ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Paper • 2312.13108 • Published Dec 20, 2023 • 3
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Paper • 2502.07870 • Published Feb 11, 2025 • 45
From Charts to Code: A Hierarchical Benchmark for Multimodal Models Paper • 2510.17932 • Published Oct 20, 2025 • 8
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Paper • 2511.02778 • Published Nov 4, 2025 • 103
Is Heuristic Sampling Necessary in Training Deep Object Detectors? Paper • 1909.04868 • Published Sep 11, 2019
Bootstrapping SparseFormers from Vision Foundation Models Paper • 2312.01987 • Published Dec 4, 2023 • 1
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Paper • 2306.08640 • Published Jun 14, 2023 • 27
Learning Video Context as Interleaved Multimodal Sequences Paper • 2407.21757 • Published Jul 31, 2024
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation Paper • 2408.16730 • Published Aug 29, 2024
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Paper • 2504.16030 • Published Apr 22, 2025 • 38
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos Paper • 2409.19603 • Published Sep 29, 2024 • 19
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published Jun 17, 2024 • 26
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published Jun 17, 2024 • 26
UniVTG: Towards Unified Video-Language Temporal Grounding Paper • 2307.16715 • Published Jul 31, 2023 • 12