Model Card for UniPR-3D
UniPR-3D is a universal visual place recognition (VPR) framework that supports both single-frame and sequence-to-sequence matching. It leverages 3D visual geometry grounded tokens within a transformer architecture to produce robust, viewpoint-invariant descriptors for long-term place recognition under challenging environmental variations (e.g., seasonal, weather, lighting, and viewpoint changes).
Model Details
Model Description
- Developed by: Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, Hesheng Wang
- Shared by: Tianchen Deng
- Model type: Vision Transformer with 3D-aware token aggregation for visual place recognition
- Language(s): English (dataset metadata); model is vision-only
- License: MIT
Model Sources
- Repository: repo
- Paper: UniPR-3D: Towards Universal Visual Place Recognition with 3D Visual Geometry Grounded Transformer (arXiv:2512.21078, 2025)
- Demo: No demo available
Uses
Direct Use
This model can be used out-of-the-box to extract compact, discriminative global descriptors from:
- Single RGB images (for frame-to-frame VPR)
- Sequences of images (for sequence-to-sequence VPR)
These descriptors are suitable for large-scale localization, robot navigation, and SLAM systems requiring robustness to appearance changes.
Downstream Use
- Integration into visual SLAM or long-term autonomous navigation pipelines
- Replacement for traditional VPR backbones (e.g., NetVLAD, MixVPR, EigenPlaces)
- Fine-tuning on domain-specific datasets (e.g., underground, aerial, or underwater environments)
Out-of-Scope Use
- Not intended for real-time inference on low-power embedded devices without optimization (latency ~8.23 ms on RTX 4090)
- Not designed for non-visual modalities (e.g., LiDAR, audio, text)
- Performance may degrade in extreme occlusion, textureless scenes, or indoor environments not seen during training
Bias, Risks, and Limitations
- Trained primarily on urban street-level imagery (GSV-Cities, Mapillary MSLS), so generalization to rural, indoor, or non-Western cities may be limited
- Inherits biases from training data (e.g., geographic overrepresentation of North America/Europe)
- No explicit fairness or demographic considerations (as it is a geometric vision model)
Recommendations
- Evaluate on target domain before deployment
- Monitor recall performance on your specific dataset using standard VPR metrics (R@1, R@5)
How to Get Started with the Model
The exact inference script is provided in the GitHub repo (eval_lora.py, main_ft.py). Pretrained weights are available on Hugging Face or via the repo release.
Training Details
Training Data
- Single-frame model: Trained on GSV-Cities
- Multi-frame model: Trained on Mapillary Street-Level Sequences (MSLS)
- Both datasets contain millions of geo-tagged urban street-view images across diverse cities, seasons, and conditions.
Training Procedure
Preprocessing
- Images resized to 518Γ518
- Sequences sampled with spatial proximity for multi-frame training
Training Hyperparameters
- Backbone: DINOv2 (ViT-large)
- Optimization: AdamW, learning rate scheduling
- Loss: Multi-similarity loss with pair weighting
- Training regime: Mixed-precision (fp16) on NVIDIA GPUs
Speeds, Sizes, Times
- Inference latency: Single frame - 8.23 ms per image (RTX 4090)
- Descriptor dimension: 17152 (for UniPR-3D)
- Training time: Not disclosed (multi-day runs on multi-GPU setup)
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Single frame evaluation:
- MSLS Challenge, where you upload your predictions to their server for evaluation.
- Single-frame MSLS Validation set
- Nordland dataset, Pittsburgh dataset and SPED dataset, you may download them from here, aligned with DINOv2 SALAD.
- Multi-frame evaluation:
- Multi-frame MSLS Validation set
- Two sequence from Oxford RobotCar, you may download them here.
- 2014-12-16-18-44-24 (winter night) query to 2014-11-18-13-20-12 (fall day) db
- 2014-11-14-16-34-33 (fall night) query to 2015-11-13-10-28-08 (fall day) db
- Nordland (filtered) dataset
Factors
- Seasonal variation (summer β winter)
- Day vs. night
- Weather (sunny, rainy, snowy)
- Viewpoint change (lateral shift, orientation)
Metrics
- Recall@K (R@1, R@5, R@10): Standard metric for VPR β fraction of queries with correct match in top-K retrieved database images
Results
Summary
Our method achieves significantly higher recall than competing approaches, achieving new state-of-the-art performance on both single and multiple frame benchmarks.
Single-frame matching results
| MSLS Challenge | MSLS Val | NordLand | Pitts250k-test | SPED | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Latency (ms) | R@1 | R@5 | R@1 | R@5 | R@1 | R@5 | R@1 | R@5 | R@1 | R@5 |
| MixVPR | 1.37 | 64.0 | 75.9 | 88.0 | 92.7 | 58.4 | 74.6 | 94.6 | 98.3 | 85.2 | 92.1 |
| EigenPlaces | 2.65 | 67.4 | 77.1 | 89.3 | 93.7 | 54.4 | 68.8 | 94.1 | 98.0 | 69.9 | 82.9 |
| DINOv2 SALAD | 2.41 | 73.0 | 86.8 | 91.2 | 95.3 | 69.6 | 84.4 | 94.5 | 98.7 | 89.5 | 94.4 |
| UniPR-3D (ours) | 8.23 | 74.3 | 87.5 | 91.4 | 96.0 | 76.2 | 87.3 | 94.9 | 98.1 | 89.6 | 94.5 |
Sequence matching results
| MSLS Val | NordLand | Oxford1 | Oxford2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| SeqMatchNet | 65.5 | 77.5 | 80.3 | 56.1 | 71.4 | 76.9 | 36.8 | 43.3 | 48.3 | 27.9 | 38.5 | 45.3 |
| SeqVLAD | 89.9 | 92.4 | 94.1 | 65.5 | 75.2 | 80.0 | 58.4 | 72.8 | 80.8 | 19.1 | 29.9 | 37.3 |
| CaseVPR | 91.2 | 94.1 | 95.0 | 84.1 | 89.9 | 92.2 | 90.5 | 95.2 | 96.5 | 72.8 | 85.8 | 89.9 |
| UniPR-3D (ours) | 93.7 | 95.7 | 96.9 | 86.8 | 91.7 | 93.8 | 95.4 | 98.1 | 98.7 | 80.6 | 90.3 | 93.9 |
Compute Infrastructure
Hardware
- NVIDIA RTX 4090
Software
Citation
BibTeX:
@article{deng2025unipr3d,
title={UniPR-3D: Towards Universal Visual Place Recognition with 3D Visual Geometry Grounded Transformer},
author={Deng, Tianchen and Chen, Xun and Li, Ziming and Shen, Hongming and Wang, Danwei and Civera, Javier and Wang, Hesheng},
journal={arXiv preprint arXiv:2512.21078},
year={2025}
}
APA: Deng, T., Chen, X., Li, Z., Shen, H., Wang, D., Civera, J., & Wang, H. (2025). UniPR-3D: Towards Universal Visual Place Recognition with 3D Visual Geometry Grounded Transformer. arXiv preprint arXiv:2512.21078.
Contact
For questions, pretrained model access, or qualitative comparisons, please contact:
π§ Tianchen Deng β dengtianchen@sjtu.edu.cn
π Acknowledgement: This implementation builds upon SALAD and VGGT. Please cite those works if you use their components.