Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models Paper • 2602.07026 • Published 12 days ago • 133
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning Paper • 2601.06943 • Published Jan 11 • 211
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published Nov 6, 2025 • 216
RoboOmni: Proactive Robot Manipulation in Omni-modal Context Paper • 2510.23763 • Published Oct 27, 2025 • 56
EO-Robotics Collection EmbodiedOneVision is a unified framework for multimodal embodied reasoning and robot control, featuring interleaved vision-text-action pretraining. • 8 items • Updated Dec 7, 2025 • 8
Unicorn: Text-Only Data Synthesis for Vision Language Model Training Paper • 2503.22655 • Published Mar 28, 2025 • 39
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation Paper • 2503.06053 • Published Mar 8, 2025 • 138
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models Paper • 2407.05131 • Published Jul 6, 2024 • 26