Regression CLIP - with strong typographic robustness!

  • Fine-tuned using CLS-Patch Linear Regression teachers
  • This model: Strong robustness to typographic attacks, good generalization
  • Check the benchmarks below - or read the πŸ“„ Latent Crossroads paper
  • βž•
  • New full-auto CLIP-fine-tune suite, (almost) config-free & super fast:
  • Get the code: πŸ‘‰ github.com/zer0int/CLIP-fine-tune
  • Dataset heuristics (will infer dataset from local or HuggingFace automatically)
  • Loads HuggingFace models, pickles, state dicts / local safetensors, ...
  • Geometry analysis tools: get human-language answers to 'what went wrong', if it did

Love ❀️ this CLIP?

ᐅ Buy me a coffee on Ko-Fi β˜•

Or click here for address to send πŸͺ™β‚Ώ BTC
3PscBrWYvrutXedLmvpcnQbE12Py8qLqMK

latent-crossroads-banner

πŸ“Š Standard Benchmark Evaluation

🌟 = This Model

Zero-Shot (Typographic Attack)

Task / Dataset Metric pretrained 🌟 regr-norm regr-brut
SCAM::NoSCAM acc 0.9905 0.9897 0.9897
SCAM::SCAM acc 0.4191 0.8046 0.8830
SCAM::SynthSCAM acc 0.3227 0.8029 0.8804
RTA100 acc 0.4330 0.7880 0.8930
πŸ‘‰ CLICK to reproduce: Expand SCAM typographic attack benchmark code βš‘πŸ’»
from datasets import load_dataset
from transformers import CLIPModel, CLIPProcessor
import torch
from PIL import Image
from tqdm import tqdm
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# BLISS / SCAM Typographic Attack Dataset
# https://huggingface.co/datasets/BLISS-e-V/SCAM
ds = load_dataset("BLISS-e-V/SCAM", split="train")

# Benchmark pre-trained model against my fine-tune
model_variants = [
    ("OpenAI ", "openai/clip-vit-large-patch14-336", "openai/clip-vit-large-patch14-336"),
    ("regr-norm", "zer0int/CLIP-Regression-ViT-L-14", "zer0int/CLIP-Regression-ViT-L-14"),
    ("regr-brut", "zer0int/CLIP-Regression-BRUT-ViT-L-14", "zer0int/CLIP-Regression-BRUT-ViT-L-14"),
]

models = {}
for name, model_path, processor_path in model_variants:
    model = CLIPModel.from_pretrained(model_path).to(device).float()
    processor = CLIPProcessor.from_pretrained(processor_path)
    models[name] = (model, processor)

for variant in ["NoSCAM", "SCAM", "SynthSCAM"]:
    print(f"\n=== Evaluating var.: {variant} ===")
    idxs = [i for i, v in enumerate(ds['id']) if v.startswith(variant)]
    if not idxs:
        print(f"  No samples for {variant}")
        continue
    subset = [ds[i] for i in idxs]

    for model_name, (model, processor) in models.items():
        results = []
        for entry in tqdm(subset, desc=f"{model_name}", ncols=30, bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} |"):
            img = entry['image']
            object_label = entry['object_label']
            attack_word = entry['attack_word']

            texts = [f"a photo of a {object_label}", f"a photo of a {attack_word}"]
            inputs = processor(
                text=texts,
                images=img,
                return_tensors="pt",
                padding=True
            )
            for k in inputs:
                if isinstance(inputs[k], torch.Tensor):
                    inputs[k] = inputs[k].to(device)

            with torch.no_grad():
                outputs = model(**inputs)
                image_features = outputs.image_embeds
                text_features = outputs.text_embeds

                logits = image_features @ text_features.T
                probs = logits.softmax(dim=-1).cpu().numpy().flatten()
                pred_idx = probs.argmax()
                pred_label = [object_label, attack_word][pred_idx]
                is_correct = (pred_label == object_label)

            results.append({
                "id": entry['id'],
                "object_label": object_label,
                "attack_word": attack_word,
                "pred_label": pred_label,
                "is_correct": is_correct,
                "type": entry['type'],
                "model": model_name
            })

        n_total = len(results)
        n_correct = sum(r['is_correct'] for r in results)
        acc = n_correct / n_total if n_total else float('nan')
        print(f"| > > > > Zero-shot accuracy for {variant}, {model_name}: {n_correct}/{n_total} = {acc:.4f}")

Zero-Shot (CLIP Benchmark)

Task / Dataset Metric pretrained 🌟 regr-norm regr-brut
VOC-2007 multilabel Zero-Shot acc 0.7615 0.8523 0.8350
ImageNet-1k (train) Zero-Shot acc@1 0.3270 0.4566 0.4100
ImageNet-1k (train) Zero-Shot acc@5 0.5300 0.6817 0.6513
ImageNet-1k (train) Zero-Shot mean per-class recall 0.3261 0.4547 0.4078

Retrieval (CLIP Benchmark)

Dataset Metric pretrained 🌟 regr-norm regr-brut
MSCOCO Captions (COCO 2014 val) image retrieval R@5 0.2196 0.3510 0.3308
MSCOCO Captions (COCO 2014 val) text retrieval R@5 0.3032 0.5042 0.4758
XM3600 image retrieval R@5 0.3059 0.4254 0.4138
XM3600 text retrieval R@5 0.2429 0.4091 0.3874

Retrieval (MSCOCO Captions, COCO 2014 val) β€” own scripts

Task Metric pretrained 🌟 regr-norm regr-brut
Image-to-Text (I2T) R@1 0.3366 0.3748 0.3508
Image-to-Text (I2T) R@5 0.7882 0.8706 0.8502
Text-to-Image (T2I) R@1 0.2153 0.3264 0.3184
Text-to-Image (T2I) R@5 0.5902 0.7851 0.7821
Text-to-Text (T2T) R@1 0.2064 0.2423 0.2359
Text-to-Text (T2T) R@5 0.5516 0.6175 0.6130
Text-to-Text (T2T_IMG) R@1 0.3120 0.3506 0.3275
Text-to-Text (T2T_IMG) R@5 0.7466 0.8386 0.8179

Retrieval (SugarCrepe, COCO 2017 val) β€” own scripts

Split Metric pretrained 🌟 regr-norm regr-brut
add_obj acc 0.7842 0.9627 0.9515
add_att acc 0.7168 0.9205 0.8743
replace_obj acc 0.9407 0.9752 0.9740
replace_att acc 0.7919 0.8579 0.8388
replace_rel acc 0.6529 0.7752 0.7696
swap_obj acc 0.6041 0.7224 0.6898
swap_att acc 0.6261 0.7282 0.7102

Linear Probe (ImageNet-1k) β€” own scripts

Metric pretrained 🌟 regr-norm regr-brut
Linear Probe Top-1 (%) 72.35 70.94 65.09
Linear Probe Top-5 (%) 93.42 93.29 89.60

πŸ”— Note: 'own scripts' available at github.com/zer0int/CLIP-fine-tune


🎯 Special Evaluation

Please see the paper for more information!

Zero-Shot Accuracy

Dataset (n) Method pretrained 🌟 regr-norm regr-brut
NoSCAM (1162) CLS 0.9905 0.9897 0.9897
NoSCAM (1162) CLS-PATCHSUB 0.9544 0.9845 0.9811
NoSCAM (1162) CLS-PATCHREG-I 0.9466 0.9888 0.9888
NoSCAM (1162) CLS-PATCHREG-N 0.9871 0.9897 0.9888
NoSCAM (1162) REG-L23-NOPC 0.9380 0.9613 0.9570
NoSCAM (1162) REG-L23-1PC 0.9630 0.9802 0.9802
NoSCAM (1162) REG-L23-8PC 0.9509 0.9664 0.9604
NoSCAM (1162) PATCH-L23 0.7349 0.9725 0.9716
NoSCAM (1162) PATCHΞ” 0.9690 0.9905 0.9888
SCAM (1162) CLS 0.4182 0.8038 0.8830
SCAM (1162) CLS-PATCHSUB 0.4957 0.8632 0.9002
SCAM (1162) CLS-PATCHREG-I 0.8761 0.8537 0.9174
SCAM (1162) CLS-PATCHREG-N 0.9286 0.8537 0.9165
SCAM (1162) REG-L23-NOPC 0.7410 0.8244 0.7719
SCAM (1162) REG-L23-1PC 0.7539 0.8726 0.7943
SCAM (1162) REG-L23-8PC 0.7057 0.8038 0.7143
SCAM (1162) PATCH-L23 0.6024 0.7470 0.8623
SCAM (1162) PATCHΞ” 0.8778 0.8451 0.8744
SynthSCAM (1162) CLS 0.3219 0.8021 0.8804
SynthSCAM (1162) CLS-PATCHSUB 0.4406 0.8580 0.9071
SynthSCAM (1162) CLS-PATCHREG-I 0.8890 0.8460 0.9200
SynthSCAM (1162) CLS-PATCHREG-N 0.9449 0.8494 0.9200
SynthSCAM (1162) REG-L23-NOPC 0.7823 0.8382 0.7771
SynthSCAM (1162) REG-L23-1PC 0.8055 0.8812 0.8072
SynthSCAM (1162) REG-L23-8PC 0.7289 0.8167 0.7126
SynthSCAM (1162) PATCH-L23 0.6317 0.7470 0.8632
SynthSCAM (1162) PATCHΞ” 0.9217 0.8614 0.8769
MVT (200382) CLS 0.8830 0.8730 0.8573
MVT (200382) CLS-PATCHSUB 0.4720 0.8246 0.8057
MVT (200382) CLS-PATCHREG-I 0.7166 0.8703 0.8518
MVT (200382) CLS-PATCHREG-N 0.5695 0.8675 0.8478
MVT (200382) REG-L23-NOPC 0.7640 0.7935 0.7680
MVT (200382) REG-L23-1PC 0.7921 0.8193 0.8032
MVT (200382) REG-L23-8PC 0.7724 0.8057 0.7812
MVT (200382) PATCH-L23 0.3414 0.8652 0.8191
MVT (200382) PATCHΞ” 0.6881 0.8667 0.8510
Downloads last month
179
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support