RETFound ViT-L/14 (DINOv2 → Transformers) — MEH AlzEye
Author of this fork: Dávid Isztl
Upstream project: RETFound_dinov2_meh by Yukun Zhou et al.
Paper: A foundation model for generalizable disease detection from retinal images, Nature (2023)
This repository provides a Transformers-compatible export of the RETFound DINOv2 encoder trained on a subset of MEH AlzEye (retinal CFP).
It includesconfig.json,model.safetensors, and anAutoImageProcessor, so you can load it directly with 🤗AutoModel/AutoModelForImageClassification.
Model Details
Model Description
This is a ViT-Large/14 encoder pretrained with the DINOv2 objective on retinal color fundus photographs (CFP).
This fork converts the original PyTorch .pth checkpoint into a standard 🤗 Transformers format and removes DINOv2-only components.
- Developed by (upstream): Yukun Zhou et al.
- Shared by (this fork): Dávid Isztl
- Model type: Vision Transformer (encoder only)
- License: CC BY-NC 4.0 (inherited from upstream)
- Finetuned from: Upstream RETFound DINOv2 checkpoint (ViT-L/14)
Architecture (DINOv2 ViT-L/14 @ 224):
hidden_size=1024,num_hidden_layers=24,num_attention_heads=16,mlp_ratio=4patch_size=14,image_size=224,num_channels=3hidden_act="gelu",qkv_bias=True,layer_norm_eps=1e-6use_swiglu_ffn=False,use_mask_token=False
Conversion notes:
- Dropped DINOv2-only tensors: teacher components, momentum updates
- Remapped fused qkv weights (timm-style) → separate Q/K/V matrices (Transformers style)
- Set
layer_norm_eps=1e-6to match timm numerics - Positional embeddings sized for 224×224 (patch 14×14)
Model Sources
- Repository (upstream): https://github.com/rmaphoh/RETFound
- Paper: https://www.nature.com/articles/s41586-023-06555-x
Uses
Direct Use
- Feature extraction from retinal images for downstream tasks
- Initial encoder for transfer learning on medical imaging research tasks (e.g., classification, retrieval)
Downstream Use
- Fine-tuning for image classification and related tasks using
AutoModelForImageClassification - Using CLS token or pooled features in custom pipelines
Out-of-Scope Use
- Clinical decision-making without proper validation and regulatory approval
- Commercial use beyond the CC BY-NC 4.0 license terms
Bias, Risks, and Limitations
- Trained on specific retinal data (subset of MEH AlzEye); distribution shifts (device, population, protocol) can degrade performance.
- Not a medical device; requires independent validation before any real-world or clinical deployment.
- Potential biases relate to dataset composition, imaging hardware, and labeling procedures.
Recommendations
- Perform task- and population-specific validation.
- Monitor for domain shift; consider domain adaptation where appropriate.
- Document preprocessing and augmentation pipelines for reproducibility.
How to Get Started with the Model
Feature extraction (encoder)
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch
repo = "iszt/RETFound_dinov2_meh" # this fork
processor = AutoImageProcessor.from_pretrained(repo)
model = AutoModel.from_pretrained(repo)
model.eval()
img = Image.open("example_retina_cfp.jpg").convert("RGB")
inputs = processor(images=img, return_tensors="pt")
with torch.no_grad():
out = model(**inputs)
cls = out.last_hidden_state[:, 0] # [B, 1024] — CLS embedding after final norm
tokens = out.last_hidden_state[:, 1:, :] # [B, N, 1024] — patch tokens
Classification fine-tune (use AutoModelForImageClassification)
from transformers import AutoConfig, AutoImageProcessor, AutoModelForImageClassification
repo = "iszt/RETFound_dinov2_meh"
id2label = {0: "negative", 1: "positive"} # example
label2id = {v: k for k, v in id2label.items()}
processor = AutoImageProcessor.from_pretrained(repo)
config = AutoConfig.from_pretrained(repo)
config.num_labels = len(id2label)
config.id2label = id2label
config.label2id = label2id
# Loads encoder weights from the repo and initializes a fresh classifier head
model = AutoModelForImageClassification.from_pretrained(
repo,
config=config,
ignore_mismatched_sizes=True, # replaces the classification head if shapes differ
)
# now train `model` with your dataloader/Trainer
Training Details
Training Data
- Upstream pretraining: retinal CFP from a portion of MEH AlzEye.
Training Procedure
- Objective: DINOv2 self-supervised pretraining.
- This fork: no additional training; checkpoint conversion only.
Preprocessing
AutoImageProcessorprovided for 224×224 inputs. If your dataset uses different normalization or resolution, adjust accordingly (and, if needed, interpolate positional embeddings).
Training Hyperparameters
- Not specified by upstream for this exact subset; see the paper and repository for general DINOv2 settings.
Speeds, Sizes, Times
- This fork only performs conversion; refer to upstream for compute details.
Evaluation
Testing Data, Factors & Metrics
- No new evaluation performed in this fork.
- For downstream tasks, report metrics relevant to the task (e.g., AUROC, accuracy, F1), and stratify by pertinent factors (device, demographics, pathology prevalence).
Results
- N/A for this fork; please cite/consult upstream results for baseline pretraining performance.
Summary
- Use this encoder as initialization; measure and report results on your target dataset.
Environmental Impact
This repository performs a format conversion only. Upstream pretraining compute and emissions are described in the paper and may be estimated via tools like the ML CO2 calculator.
- Hardware Type: N/A (conversion only)
- Hours used: N/A (conversion only)
- Cloud Provider / Region: N/A
- Carbon Emitted: N/A
Technical Specifications
Model Architecture and Objective
- Architecture: DINOv2 Vision Transformer Large, patch size 14, image size 224.
- Configuration: Uses
Dinov2Configwith standard GELU activation, no SwiGLU FFN, no mask token. - Objective: DINOv2 self-supervised pretraining (encoder-only kept in this fork).
- Pooling: No pooling layer (use CLS token or custom pooling).
Compute Infrastructure
- This fork does not introduce new training; conversion was done locally.
Hardware
- N/A for conversion.
Software
- Conversion used PyTorch, timm, and 🤗 Transformers.
Citation
If you use this model, please cite the original RETFound paper:
BibTeX:
@article{zhou2023foundation,
title={A foundation model for generalizable disease detection from retinal images},
author={Zhou, Yukun and Chia, Mark A and Wagner, Siegfried K and Ayhan, Murat S and Williamson, Dominic J and Struyven, Robbert R and Liu, Timing and Xu, Moucheng and Lozano, Mateo G and Woodward-Court, Peter and others},
journal={Nature},
volume={622},
number={7981},
pages={156--163},
year={2023},
publisher={Nature Publishing Group UK London}
}
APA: Zhou, Y., Chia, M. A., Wagner, S. K., Ayhan, M. S., Williamson, D. J., Struyven, R. R., … et al. (2023). A foundation model for generalizable disease detection from retinal images. Nature, 622(7981), 156–163.
Glossary
- CFP: Color Fundus Photography
- DINOv2: Self-supervised Vision Transformer training method
- CLS token: Special token prepended to the patch sequence in ViT; often used as a global image representation.
More Information
- Upstream code and instructions: https://github.com/rmaphoh/RETFound
- Nature paper: https://www.nature.com/articles/s41586-023-06555-x
Model Card Authors
- Dávid Isztl (fork & conversion)
Model Card Contact
- For this fork/conversion: contact Dávid Isztl via Hugging Face.
- For upstream model/training code: ykzhoua@gmail.com or yukun.zhou.19@ucl.ac.uk.
- Downloads last month
- 364
Model tree for iszt/RETFound_dinov2_meh
Base model
YukunZhou/RETFound_dinov2_meh