You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

RETFound ViT-L/14 (DINOv2 → Transformers) — MEH AlzEye

Author of this fork: Dávid Isztl
Upstream project: RETFound_dinov2_meh by Yukun Zhou et al.
Paper: A foundation model for generalizable disease detection from retinal images, Nature (2023)

This repository provides a Transformers-compatible export of the RETFound DINOv2 encoder trained on a subset of MEH AlzEye (retinal CFP).
It includes config.json, model.safetensors, and an AutoImageProcessor, so you can load it directly with 🤗 AutoModel / AutoModelForImageClassification.

Model Details

Model Description

This is a ViT-Large/14 encoder pretrained with the DINOv2 objective on retinal color fundus photographs (CFP).
This fork converts the original PyTorch .pth checkpoint into a standard 🤗 Transformers format and removes DINOv2-only components.

Developed by (upstream): Yukun Zhou et al.
Shared by (this fork): Dávid Isztl
Model type: Vision Transformer (encoder only)
License: CC BY-NC 4.0 (inherited from upstream)
Finetuned from: Upstream RETFound DINOv2 checkpoint (ViT-L/14)

Architecture (DINOv2 ViT-L/14 @ 224):

hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, mlp_ratio=4
patch_size=14, image_size=224, num_channels=3
hidden_act="gelu", qkv_bias=True, layer_norm_eps=1e-6
use_swiglu_ffn=False, use_mask_token=False

Conversion notes:

Dropped DINOv2-only tensors: teacher components, momentum updates
Remapped fused qkv weights (timm-style) → separate Q/K/V matrices (Transformers style)
Set layer_norm_eps=1e-6 to match timm numerics
Positional embeddings sized for 224×224 (patch 14×14)

Model Sources

Repository (upstream): https://github.com/rmaphoh/RETFound
Paper: https://www.nature.com/articles/s41586-023-06555-x

Uses

Direct Use

Feature extraction from retinal images for downstream tasks
Initial encoder for transfer learning on medical imaging research tasks (e.g., classification, retrieval)

Downstream Use

Fine-tuning for image classification and related tasks using AutoModelForImageClassification
Using CLS token or pooled features in custom pipelines

Out-of-Scope Use

Clinical decision-making without proper validation and regulatory approval
Commercial use beyond the CC BY-NC 4.0 license terms

Bias, Risks, and Limitations

Trained on specific retinal data (subset of MEH AlzEye); distribution shifts (device, population, protocol) can degrade performance.
Not a medical device; requires independent validation before any real-world or clinical deployment.
Potential biases relate to dataset composition, imaging hardware, and labeling procedures.

Recommendations

Perform task- and population-specific validation.
Monitor for domain shift; consider domain adaptation where appropriate.
Document preprocessing and augmentation pipelines for reproducibility.

How to Get Started with the Model

Feature extraction (encoder)

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

repo = "iszt/RETFound_dinov2_meh"  # this fork

processor = AutoImageProcessor.from_pretrained(repo)
model = AutoModel.from_pretrained(repo)
model.eval()

img = Image.open("example_retina_cfp.jpg").convert("RGB")
inputs = processor(images=img, return_tensors="pt")

with torch.no_grad():
    out = model(**inputs)
    cls = out.last_hidden_state[:, 0]        # [B, 1024] — CLS embedding after final norm
    tokens = out.last_hidden_state[:, 1:, :] # [B, N, 1024] — patch tokens

Classification fine-tune (use AutoModelForImageClassification)

from transformers import AutoConfig, AutoImageProcessor, AutoModelForImageClassification

repo = "iszt/RETFound_dinov2_meh"
id2label = {0: "negative", 1: "positive"}  # example
label2id = {v: k for k, v in id2label.items()}

processor = AutoImageProcessor.from_pretrained(repo)

config = AutoConfig.from_pretrained(repo)
config.num_labels = len(id2label)
config.id2label = id2label
config.label2id = label2id

# Loads encoder weights from the repo and initializes a fresh classifier head
model = AutoModelForImageClassification.from_pretrained(
    repo,
    config=config,
    ignore_mismatched_sizes=True,  # replaces the classification head if shapes differ
)

# now train `model` with your dataloader/Trainer

Training Details

Training Data

Upstream pretraining: retinal CFP from a portion of MEH AlzEye.

Training Procedure

Objective: DINOv2 self-supervised pretraining.
This fork: no additional training; checkpoint conversion only.

Preprocessing

AutoImageProcessor provided for 224×224 inputs. If your dataset uses different normalization or resolution, adjust accordingly (and, if needed, interpolate positional embeddings).

Training Hyperparameters

Not specified by upstream for this exact subset; see the paper and repository for general DINOv2 settings.

Speeds, Sizes, Times

This fork only performs conversion; refer to upstream for compute details.

Evaluation

Testing Data, Factors & Metrics

No new evaluation performed in this fork.
For downstream tasks, report metrics relevant to the task (e.g., AUROC, accuracy, F1), and stratify by pertinent factors (device, demographics, pathology prevalence).

Results

N/A for this fork; please cite/consult upstream results for baseline pretraining performance.

Summary

Use this encoder as initialization; measure and report results on your target dataset.

Environmental Impact

This repository performs a format conversion only. Upstream pretraining compute and emissions are described in the paper and may be estimated via tools like the ML CO2 calculator.

Hardware Type: N/A (conversion only)
Hours used: N/A (conversion only)
Cloud Provider / Region: N/A
Carbon Emitted: N/A

Technical Specifications

Model Architecture and Objective

Architecture: DINOv2 Vision Transformer Large, patch size 14, image size 224.
Configuration: Uses Dinov2Config with standard GELU activation, no SwiGLU FFN, no mask token.
Objective: DINOv2 self-supervised pretraining (encoder-only kept in this fork).
Pooling: No pooling layer (use CLS token or custom pooling).

Compute Infrastructure

This fork does not introduce new training; conversion was done locally.

Hardware

N/A for conversion.

Software

Conversion used PyTorch, timm, and 🤗 Transformers.

Citation

If you use this model, please cite the original RETFound paper:

BibTeX:

@article{zhou2023foundation,
  title={A foundation model for generalizable disease detection from retinal images},
  author={Zhou, Yukun and Chia, Mark A and Wagner, Siegfried K and Ayhan, Murat S and Williamson, Dominic J and Struyven, Robbert R and Liu, Timing and Xu, Moucheng and Lozano, Mateo G and Woodward-Court, Peter and others},
  journal={Nature},
  volume={622},
  number={7981},
  pages={156--163},
  year={2023},
  publisher={Nature Publishing Group UK London}
}

APA: Zhou, Y., Chia, M. A., Wagner, S. K., Ayhan, M. S., Williamson, D. J., Struyven, R. R., … et al. (2023). A foundation model for generalizable disease detection from retinal images. Nature, 622(7981), 156–163.

Glossary

CFP: Color Fundus Photography
DINOv2: Self-supervised Vision Transformer training method
CLS token: Special token prepended to the patch sequence in ViT; often used as a global image representation.

More Information

Upstream code and instructions: https://github.com/rmaphoh/RETFound
Nature paper: https://www.nature.com/articles/s41586-023-06555-x

Model Card Authors

Dávid Isztl (fork & conversion)

Model Card Contact

For this fork/conversion: contact Dávid Isztl via Hugging Face.
For upstream model/training code: ykzhoua@gmail.com or yukun.zhou.19@ucl.ac.uk.

Downloads last month: 364

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iszt/RETFound_dinov2_meh

Base model

YukunZhou/RETFound_dinov2_meh

Finetuned

(5)

this model