Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding Paper • 2409.03757 • Published Sep 5, 2024 • 3
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Paper • 2412.01827 • Published Dec 2, 2024
PaintScene4D: Consistent 4D Scene Generation from Text Prompts Paper • 2412.04471 • Published Dec 5, 2024
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark Paper • 2504.10568 • Published Apr 14, 2025
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought Paper • 2505.23766 • Published May 29, 2025
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight Paper • 2511.20648 • Published Nov 25, 2025 • 1
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning Paper • 2601.09708 • Published Jan 14 • 55
OSGym: Scalable Distributed Data Engine for Generalizable Computer Agents Paper • 2511.11672 • Published Nov 11, 2025
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding Paper • 2605.27365 • Published 5 days ago • 127
Frozen Transformers in Language Models Are Effective Visual Encoder Layers Paper • 2310.12973 • Published Oct 19, 2023 • 1
Situational Awareness Matters in 3D Vision Language Reasoning Paper • 2406.07544 • Published Jun 11, 2024 • 1
Floating No More: Object-Ground Reconstruction from a Single Image Paper • 2407.18914 • Published Jul 26, 2024 • 20