SARCLIP: Multimodal Foundation Model for SAR Imagery

License Python CUDA

πŸš€ Overview

SARCLIP is a multimodal foundation model specifically designed for Synthetic Aperture Radar (SAR) imagery based on the Contrastive Language-Image Pre-training (CLIP) framework. SARCLIP enables cross-modal understanding between SAR images and textual information, supporting zero-shot classification, cross-modal retrieval, and image-text inference.


πŸ›  Installation

Environment Requirements

  • Operating System: Linux or Windows
  • Python: β‰₯ 3.8
  • CUDA: Compatible CUDA version as supported by PyTorch

Dependencies

Install required Python libraries:

pip install -r requirements.txt

Hardware Recommendations

  • GPU: NVIDIA RTX3060 or higher
  • Memory: β‰₯ 16GB RAM
  • VRAM: β‰₯ 12GB GPU Memory
  • Disk: β‰₯ 30GB free disk space

πŸ“‚ Project Structure

SARCLIP-main/
β”œβ”€β”€ sar_clip/
β”‚   β”œβ”€β”€ model_configs/     # Model configs & pre-trained weights
β”‚   β”œβ”€β”€ *.py               # Core model code
β”œβ”€β”€ data/                  # Dataset directory
β”œβ”€β”€ retrieval.py           # Cross-modal retrieval script
β”œβ”€β”€ zero-shot.py           # Zero-shot classification script
β”œβ”€β”€ zero-shot-inference.py # Image-text inference script
β”œβ”€β”€ example.py             # Demonstration script
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md

🚩 Quick Start

Zero-Shot Classification

Update CLASSNAMES and TEMPLATES in zero-shot.py, then execute:

python zero-shot.py \
  --imagenet-val "./data/zero-shot" \
  --batch-size 8 \
  --model "ViT-B-32" \
  --cache-dir "./sar_clip/model_configs/ViT-B-32" \
  --pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"

Cross-Modal Retrieval

Execute the retrieval script (Extract the ./data/retrieval/retrieval.rar file first):

python retrieval.py \
  --val-data "./data/retrieval_file_list.csv" \
  --csv-img-key "filename" \
  --csv-caption-key "caption" \
  --batch-size 8 \
  --model "ViT-B-32" \
  --cache-dir "./sar_clip/model_configs/ViT-B-32" \
  --pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"

Image-Text Inference

Run inference directly on images:

python zero-shot-inference.py \
  --image-dir "path/to/images" \
  --batch-size 8 \
  --model "ViT-B-32" \
  --cache-dir "./sar_clip/model_configs/ViT-B-32" \
  --pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"

Example Output

Running example.py provides a visualization and outputs textual predictions:

Predictions:
- an SAR image of urban zones                        1.0000
- an SAR image of water areas                        0.0000
- an SAR image of croplands                          0.0000
- one solitary marine craft is visible in the right region . 0.0000
- along the right side , several storage tanks are be detected . 0.0000
- 1 aircraft is found throughout the frame .         0.0000

❓ Troubleshooting

  • Out of Memory (OOM): Decrease --batch-size.
  • Model Loading Failed: Verify the correct path to the pretrained model.
  • GPU Not Used: Ensure CUDA and PyTorch compatibility.

πŸ“Œ License

  • Code: Released under the MIT License.
  • Dataset (SARCAP): Released under a separate Dataset License, for non-commercial research and educational use only.

πŸ’Ύ Model Weights & Dataset Access

Pretrained Model Weights

The pretrained SARCLIP weights are publicly available for research and non-commercial use.

To use the pretrained weights, place them under:

./sar_clip/model_configs/{MODEL_NAME}/

Dataset Access

All released data are intended for non-commercial research and educational purposes only.

Dataset structure:

SARCAP/
β”œβ”€β”€ img/                   # SAR image patches
β”œβ”€β”€ img_caption.csv        # Image-text pairs

To use the zero-shot examples, place them under:

./data/zero-shot/

πŸ“š Citation

If you use SARCLIP, please cite:

@misc{SARCLIP2025,
  author = {CAESAR-Radi},
  title = {SARCLIP: A Multimodal Foundation Framework for SAR Imagery via Contrastive Language-Image Pre-Training},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/CAESAR-Radi/SARCLIP}
}

🌟 Acknowledgements

We thank the following organizations for providing datasets and inspiration:

  • Capella Space (Capella SAR Data)
  • ESA Copernicus Programme (WorldCover)
  • Anhui University (OGSOD)
  • University of Electronic Science and Technology of China (RSDD)
  • Huazhong University of Science and Technology (SADD)
  • Chinese Academy of Sciences (SIVED)
  • Technical University of Munich (SEN12MS)

Special thanks to the OpenCLIP team for their significant contributions.

Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support