Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
Abstract
Video4Spatial demonstrates that video diffusion models can perform complex spatial tasks using only visual data, achieving strong spatial understanding and generalization.
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Plan-X: Instruct Video Generation via Semantic Planning (2025)
- Image Generation as a Visual Planner for Robotic Manipulation (2025)
- Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling (2025)
- Generative Video Motion Editing with 3D Point Tracks (2025)
- Show Me: Unifying Instructional Image and Video Generation with Diffusion Models (2025)
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation (2025)
- MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper