Regression CLIP - with strong typographic robustness!

Fine-tuned using CLS-Patch Linear Regression teachers
This model: Strong robustness to typographic attacks, good generalization
Check the benchmarks below - or read the 📄 Latent Crossroads paper
➕
New full-auto CLIP-fine-tune suite, (almost) config-free & super fast:
Get the code: 👉 github.com/zer0int/CLIP-fine-tune
Dataset heuristics (will infer dataset from local or HuggingFace automatically)
Loads HuggingFace models, pickles, state dicts / local safetensors, ...
Geometry analysis tools: get human-language answers to 'what went wrong', if it did

Love ❤️ this CLIP?

Or click here for address to send 🪙₿ BTC

3PscBrWYvrutXedLmvpcnQbE12Py8qLqMK

📊 Standard Benchmark Evaluation

🌟 = This Model

Zero-Shot (Typographic Attack)

Task / Dataset	Metric	pretrained	🌟 regr-norm	regr-brut
SCAM::NoSCAM	acc	0.9905	0.9897	0.9897
SCAM::SCAM	acc	0.4191	0.8046	0.8830
SCAM::SynthSCAM	acc	0.3227	0.8029	0.8804
RTA100	acc	0.4330	0.7880	0.8930

👉 CLICK to reproduce: Expand SCAM typographic attack benchmark code ⚡💻

from datasets import load_dataset
from transformers import CLIPModel, CLIPProcessor
import torch
from PIL import Image
from tqdm import tqdm
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# BLISS / SCAM Typographic Attack Dataset
# https://huggingface.co/datasets/BLISS-e-V/SCAM
ds = load_dataset("BLISS-e-V/SCAM", split="train")

# Benchmark pre-trained model against my fine-tune
model_variants = [
    ("OpenAI ", "openai/clip-vit-large-patch14-336", "openai/clip-vit-large-patch14-336"),
    ("regr-norm", "zer0int/CLIP-Regression-ViT-L-14", "zer0int/CLIP-Regression-ViT-L-14"),
    ("regr-brut", "zer0int/CLIP-Regression-BRUT-ViT-L-14", "zer0int/CLIP-Regression-BRUT-ViT-L-14"),
]

models = {}
for name, model_path, processor_path in model_variants:
    model = CLIPModel.from_pretrained(model_path).to(device).float()
    processor = CLIPProcessor.from_pretrained(processor_path)
    models[name] = (model, processor)

for variant in ["NoSCAM", "SCAM", "SynthSCAM"]:
    print(f"\n=== Evaluating var.: {variant} ===")
    idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)]
    if not idxs:
        print(f"  No samples for {variant}")
        continue
    subset = [ds[i] for i in idxs]

    for model_name, (model, processor) in models.items():
        results = []
        for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} |"):
            img = entry['image']
            object_label = entry['object_label']
            attack_word = entry['attack_word']

            texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"]
            inputs = processor(
                text=texts,
                images=img,
                return_tensors="pt",
                padding=True
            )
            for k in inputs:
                if isinstance(inputs[k], torch.Tensor):
                    inputs[k] = inputs[k].to(device)

            with torch.no_grad():
                outputs = model(**inputs)
                image_features = outputs.image_embeds
                text_features = outputs.text_embeds

                logits = image_features @ text_features.T
                probs = logits.softmax(dim=-1).cpu().numpy().flatten()
                pred_idx = probs.argmax()
                pred_label = [object_label, attack_word][pred_idx]
                is_correct = (pred_label == object_label)

            results.append({
                "id": entry['id'],
                "object_label": object_label,
                "attack_word": attack_word,
                "pred_label": pred_label,
                "is_correct": is_correct,
                "type": entry['type'],
                "model": model_name
            })

        n_total = len(results)
        n_correct = sum(r['is_correct'] for r in results)
        acc = n_correct / n_total if n_total else float('nan')
        print(f"| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}")

Zero-Shot (CLIP Benchmark)

Task / Dataset	Metric	pretrained	🌟 regr-norm	regr-brut
VOC-2007 multilabel	Zero-Shot acc	0.7615	0.8523	0.8350
ImageNet-1k (train)	Zero-Shot acc@1	0.3270	0.4566	0.4100
ImageNet-1k (train)	Zero-Shot acc@5	0.5300	0.6817	0.6513
ImageNet-1k (train)	Zero-Shot mean per-class recall	0.3261	0.4547	0.4078

Retrieval (CLIP Benchmark)

Dataset	Metric	pretrained	🌟 regr-norm	regr-brut
MSCOCO Captions (COCO 2014 val)	image retrieval R@5	0.2196	0.3510	0.3308
MSCOCO Captions (COCO 2014 val)	text retrieval R@5	0.3032	0.5042	0.4758
XM3600	image retrieval R@5	0.3059	0.4254	0.4138
XM3600	text retrieval R@5	0.2429	0.4091	0.3874

Retrieval (MSCOCO Captions, COCO 2014 val) — own scripts

Task	Metric	pretrained	🌟 regr-norm	regr-brut
Image-to-Text (I2T)	R@1	0.3366	0.3748	0.3508
Image-to-Text (I2T)	R@5	0.7882	0.8706	0.8502
Text-to-Image (T2I)	R@1	0.2153	0.3264	0.3184
Text-to-Image (T2I)	R@5	0.5902	0.7851	0.7821
Text-to-Text (T2T)	R@1	0.2064	0.2423	0.2359
Text-to-Text (T2T)	R@5	0.5516	0.6175	0.6130
Text-to-Text (T2T_IMG)	R@1	0.3120	0.3506	0.3275
Text-to-Text (T2T_IMG)	R@5	0.7466	0.8386	0.8179

Retrieval (SugarCrepe, COCO 2017 val) — own scripts

Split	Metric	pretrained	🌟 regr-norm	regr-brut
add_obj	acc	0.7842	0.9627	0.9515
add_att	acc	0.7168	0.9205	0.8743
replace_obj	acc	0.9407	0.9752	0.9740
replace_att	acc	0.7919	0.8579	0.8388
replace_rel	acc	0.6529	0.7752	0.7696
swap_obj	acc	0.6041	0.7224	0.6898
swap_att	acc	0.6261	0.7282	0.7102

Linear Probe (ImageNet-1k) — own scripts

Metric	pretrained	🌟 regr-norm	regr-brut
Linear Probe Top-1 (%)	72.35	70.94	65.09
Linear Probe Top-5 (%)	93.42	93.29	89.60

🔗 Note: 'own scripts' available at github.com/zer0int/CLIP-fine-tune

🎯 Special Evaluation

Please see the paper for more information!

Zero-Shot Accuracy

Dataset (n)	Method	pretrained	🌟 regr-norm	regr-brut
NoSCAM (1162)	CLS	0.9905	0.9897	0.9897
NoSCAM (1162)	CLS-PATCHSUB	0.9544	0.9845	0.9811
NoSCAM (1162)	CLS-PATCHREG-I	0.9466	0.9888	0.9888
NoSCAM (1162)	CLS-PATCHREG-N	0.9871	0.9897	0.9888
NoSCAM (1162)	REG-L23-NOPC	0.9380	0.9613	0.9570
NoSCAM (1162)	REG-L23-1PC	0.9630	0.9802	0.9802
NoSCAM (1162)	REG-L23-8PC	0.9509	0.9664	0.9604
NoSCAM (1162)	PATCH-L23	0.7349	0.9725	0.9716
NoSCAM (1162)	PATCHΔ	0.9690	0.9905	0.9888
SCAM (1162)	CLS	0.4182	0.8038	0.8830
SCAM (1162)	CLS-PATCHSUB	0.4957	0.8632	0.9002
SCAM (1162)	CLS-PATCHREG-I	0.8761	0.8537	0.9174
SCAM (1162)	CLS-PATCHREG-N	0.9286	0.8537	0.9165
SCAM (1162)	REG-L23-NOPC	0.7410	0.8244	0.7719
SCAM (1162)	REG-L23-1PC	0.7539	0.8726	0.7943
SCAM (1162)	REG-L23-8PC	0.7057	0.8038	0.7143
SCAM (1162)	PATCH-L23	0.6024	0.7470	0.8623
SCAM (1162)	PATCHΔ	0.8778	0.8451	0.8744
SynthSCAM (1162)	CLS	0.3219	0.8021	0.8804
SynthSCAM (1162)	CLS-PATCHSUB	0.4406	0.8580	0.9071
SynthSCAM (1162)	CLS-PATCHREG-I	0.8890	0.8460	0.9200
SynthSCAM (1162)	CLS-PATCHREG-N	0.9449	0.8494	0.9200
SynthSCAM (1162)	REG-L23-NOPC	0.7823	0.8382	0.7771
SynthSCAM (1162)	REG-L23-1PC	0.8055	0.8812	0.8072
SynthSCAM (1162)	REG-L23-8PC	0.7289	0.8167	0.7126
SynthSCAM (1162)	PATCH-L23	0.6317	0.7470	0.8632
SynthSCAM (1162)	PATCHΔ	0.9217	0.8614	0.8769
MVT (200382)	CLS	0.8830	0.8730	0.8573
MVT (200382)	CLS-PATCHSUB	0.4720	0.8246	0.8057
MVT (200382)	CLS-PATCHREG-I	0.7166	0.8703	0.8518
MVT (200382)	CLS-PATCHREG-N	0.5695	0.8675	0.8478
MVT (200382)	REG-L23-NOPC	0.7640	0.7935	0.7680
MVT (200382)	REG-L23-1PC	0.7921	0.8193	0.8032
MVT (200382)	REG-L23-8PC	0.7724	0.8057	0.7812
MVT (200382)	PATCH-L23	0.3414	0.8652	0.8191
MVT (200382)	PATCHΔ	0.6881	0.8667	0.8510

Downloads last month: 179

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support