File size: 2,158 Bytes
6ef9a79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fbbce52
6ef9a79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fbbce52
 
6ef9a79
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
library_name: transformers
pipeline_tag: zero-shot-image-classification
license: cc-by-nc-4.0
tags:
- clip
- multilingual
---

# Model Card for Distilled MetaCLIP 2 ViT-B/32 (mT5 Tokenizer) (worldwide)

Distilled MetaCLIP 2 (worldwide) was presented in [MetaCLIP 2: A Worldwide Scaling Recipe](https://huggingface.co/papers/2507.22062).

This checkpoint corresponds to "ViT-B-32-mT5-worldwide" of the [original implementation](https://github.com/facebookresearch/MetaCLIP).

## Install

First install the Transformers library (from source for now):

```bash
pip install -q git+https://github.com/huggingface/transformers.git
```

## Usage

Next you can use it like so:

```python
import torch
from transformers import pipeline

clip = pipeline(
   task="zero-shot-image-classification",
   model="facebook/metaclip-2-mt5-worldwide-b32",
   torch_dtype=torch.bfloat16,
   device=0
)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

results = clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
print(results)
```

In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API:

```python
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel

# note: make sure to verify that `AutoModel` is an instance of `MetaClip2Model`
model = AutoModel.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32", torch_dtype=torch.bfloat16, attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
most_likely_idx = probs.argmax(dim=1).item()
most_likely_label = labels[most_likely_idx]
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")
```