SARCLIP: Multimodal Foundation Model for SAR Imagery
π Overview
SARCLIP is a multimodal foundation model specifically designed for Synthetic Aperture Radar (SAR) imagery based on the Contrastive Language-Image Pre-training (CLIP) framework. SARCLIP enables cross-modal understanding between SAR images and textual information, supporting zero-shot classification, cross-modal retrieval, and image-text inference.
π Installation
Environment Requirements
- Operating System: Linux or Windows
- Python: β₯ 3.8
- CUDA: Compatible CUDA version as supported by PyTorch
Dependencies
Install required Python libraries:
pip install -r requirements.txt
Hardware Recommendations
- GPU: NVIDIA RTX3060 or higher
- Memory: β₯ 16GB RAM
- VRAM: β₯ 12GB GPU Memory
- Disk: β₯ 30GB free disk space
π Project Structure
SARCLIP-main/
βββ sar_clip/
β βββ model_configs/ # Model configs & pre-trained weights
β βββ *.py # Core model code
βββ data/ # Dataset directory
βββ retrieval.py # Cross-modal retrieval script
βββ zero-shot.py # Zero-shot classification script
βββ zero-shot-inference.py # Image-text inference script
βββ example.py # Demonstration script
βββ requirements.txt
βββ README.md
π© Quick Start
Zero-Shot Classification
Update CLASSNAMES and TEMPLATES in zero-shot.py, then execute:
python zero-shot.py \
--imagenet-val "./data/zero-shot" \
--batch-size 8 \
--model "ViT-B-32" \
--cache-dir "./sar_clip/model_configs/ViT-B-32" \
--pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"
Cross-Modal Retrieval
Execute the retrieval script (Extract the ./data/retrieval/retrieval.rar file first):
python retrieval.py \
--val-data "./data/retrieval_file_list.csv" \
--csv-img-key "filename" \
--csv-caption-key "caption" \
--batch-size 8 \
--model "ViT-B-32" \
--cache-dir "./sar_clip/model_configs/ViT-B-32" \
--pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"
Image-Text Inference
Run inference directly on images:
python zero-shot-inference.py \
--image-dir "path/to/images" \
--batch-size 8 \
--model "ViT-B-32" \
--cache-dir "./sar_clip/model_configs/ViT-B-32" \
--pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"
Example Output
Running example.py provides a visualization and outputs textual predictions:
Predictions:
- an SAR image of urban zones 1.0000
- an SAR image of water areas 0.0000
- an SAR image of croplands 0.0000
- one solitary marine craft is visible in the right region . 0.0000
- along the right side , several storage tanks are be detected . 0.0000
- 1 aircraft is found throughout the frame . 0.0000
β Troubleshooting
- Out of Memory (OOM): Decrease
--batch-size. - Model Loading Failed: Verify the correct path to the pretrained model.
- GPU Not Used: Ensure CUDA and PyTorch compatibility.
π License
- Code: Released under the MIT License.
- Dataset (SARCAP): Released under a separate Dataset License, for non-commercial research and educational use only.
πΎ Model Weights & Dataset Access
Pretrained Model Weights
The pretrained SARCLIP weights are publicly available for research and non-commercial use.
- SARCLIP Weights: π Baidu Netdisk (Extraction code:
dizf)
To use the pretrained weights, place them under:
./sar_clip/model_configs/{MODEL_NAME}/
Dataset Access
All released data are intended for non-commercial research and educational purposes only.
- SARCAP Dataset: π Baidu Netdisk (Extraction code:
2nxm) - Zero-Shot: π Baidu Netdisk (Extraction code:
quh2)
Dataset structure:
SARCAP/
βββ img/ # SAR image patches
βββ img_caption.csv # Image-text pairs
To use the zero-shot examples, place them under:
./data/zero-shot/
π Citation
If you use SARCLIP, please cite:
@misc{SARCLIP2025,
author = {CAESAR-Radi},
title = {SARCLIP: A Multimodal Foundation Framework for SAR Imagery via Contrastive Language-Image Pre-Training},
year = {2025},
publisher = {GitHub},
url = {https://github.com/CAESAR-Radi/SARCLIP}
}
π Acknowledgements
We thank the following organizations for providing datasets and inspiration:
- Capella Space (Capella SAR Data)
- ESA Copernicus Programme (WorldCover)
- Anhui University (OGSOD)
- University of Electronic Science and Technology of China (RSDD)
- Huazhong University of Science and Technology (SADD)
- Chinese Academy of Sciences (SIVED)
- Technical University of Munich (SEN12MS)
Special thanks to the OpenCLIP team for their significant contributions.
- Downloads last month
- 49