SARCLIP: Multimodal Foundation Model for SAR Imagery

🚀 Overview

SARCLIP is a multimodal foundation model specifically designed for Synthetic Aperture Radar (SAR) imagery based on the Contrastive Language-Image Pre-training (CLIP) framework. SARCLIP enables cross-modal understanding between SAR images and textual information, supporting zero-shot classification, cross-modal retrieval, and image-text inference.

🛠 Installation

Environment Requirements

Operating System: Linux or Windows
Python: ≥ 3.8
CUDA: Compatible CUDA version as supported by PyTorch

Dependencies

Install required Python libraries:

pip install -r requirements.txt

Hardware Recommendations

GPU: NVIDIA RTX3060 or higher
Memory: ≥ 16GB RAM
VRAM: ≥ 12GB GPU Memory
Disk: ≥ 30GB free disk space

📂 Project Structure

SARCLIP-main/
├── sar_clip/
│   ├── model_configs/     # Model configs & pre-trained weights
│   ├── *.py               # Core model code
├── data/                  # Dataset directory
├── retrieval.py           # Cross-modal retrieval script
├── zero-shot.py           # Zero-shot classification script
├── zero-shot-inference.py # Image-text inference script
├── example.py             # Demonstration script
├── requirements.txt
├── README.md

🚩 Quick Start

Zero-Shot Classification

Update CLASSNAMES and TEMPLATES in zero-shot.py, then execute:

python zero-shot.py \
  --imagenet-val "./data/zero-shot" \
  --batch-size 8 \
  --model "ViT-B-32" \
  --cache-dir "./sar_clip/model_configs/ViT-B-32" \
  --pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"

Cross-Modal Retrieval

Execute the retrieval script (Extract the ./data/retrieval/retrieval.rar file first):

python retrieval.py \
  --val-data "./data/retrieval_file_list.csv" \
  --csv-img-key "filename" \
  --csv-caption-key "caption" \
  --batch-size 8 \
  --model "ViT-B-32" \
  --cache-dir "./sar_clip/model_configs/ViT-B-32" \
  --pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"

Image-Text Inference

Run inference directly on images:

python zero-shot-inference.py \
  --image-dir "path/to/images" \
  --batch-size 8 \
  --model "ViT-B-32" \
  --cache-dir "./sar_clip/model_configs/ViT-B-32" \
  --pretrained "./sar_clip/model_configs/ViT-B-32/vit_b_32_model.safetensors"

Example Output

Running example.py provides a visualization and outputs textual predictions:

Predictions:
- an SAR image of urban zones                        1.0000
- an SAR image of water areas                        0.0000
- an SAR image of croplands                          0.0000
- one solitary marine craft is visible in the right region . 0.0000
- along the right side , several storage tanks are be detected . 0.0000
- 1 aircraft is found throughout the frame .         0.0000

❓ Troubleshooting

Out of Memory (OOM): Decrease --batch-size.
Model Loading Failed: Verify the correct path to the pretrained model.
GPU Not Used: Ensure CUDA and PyTorch compatibility.

📌 License

Code: Released under the MIT License.
Dataset (SARCAP): Released under a separate Dataset License, for non-commercial research and educational use only.

💾 Model Weights & Dataset Access

Pretrained Model Weights

The pretrained SARCLIP weights are publicly available for research and non-commercial use.

SARCLIP Weights: 🔗 Baidu Netdisk (Extraction code: dizf)

To use the pretrained weights, place them under:

./sar_clip/model_configs/{MODEL_NAME}/

Dataset Access

All released data are intended for non-commercial research and educational purposes only.

SARCAP Dataset: 🔗 Baidu Netdisk (Extraction code: 2nxm)
Zero-Shot: 🔗 Baidu Netdisk (Extraction code: quh2)

Dataset structure:

SARCAP/
├── img/                   # SAR image patches
├── img_caption.csv        # Image-text pairs

To use the zero-shot examples, place them under:

./data/zero-shot/

📚 Citation

If you use SARCLIP, please cite:

@misc{SARCLIP2025,
  author = {CAESAR-Radi},
  title = {SARCLIP: A Multimodal Foundation Framework for SAR Imagery via Contrastive Language-Image Pre-Training},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/CAESAR-Radi/SARCLIP}
}

🌟 Acknowledgements

We thank the following organizations for providing datasets and inspiration:

Capella Space (Capella SAR Data)
ESA Copernicus Programme (WorldCover)
Anhui University (OGSOD)
University of Electronic Science and Technology of China (RSDD)
Huazhong University of Science and Technology (SADD)
Chinese Academy of Sciences (SIVED)
Technical University of Munich (SEN12MS)

Special thanks to the OpenCLIP team for their significant contributions.

Downloads last month: 49

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support