File size: 2,158 Bytes
6ef9a79 fbbce52 6ef9a79 fbbce52 6ef9a79 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
library_name: transformers
pipeline_tag: zero-shot-image-classification
license: cc-by-nc-4.0
tags:
- clip
- multilingual
---
# Model Card for Distilled MetaCLIP 2 ViT-B/32 (mT5 Tokenizer) (worldwide)
Distilled MetaCLIP 2 (worldwide) was presented in [MetaCLIP 2: A Worldwide Scaling Recipe](https://huggingface.co/papers/2507.22062).
This checkpoint corresponds to "ViT-B-32-mT5-worldwide" of the [original implementation](https://github.com/facebookresearch/MetaCLIP).
## Install
First install the Transformers library (from source for now):
```bash
pip install -q git+https://github.com/huggingface/transformers.git
```
## Usage
Next you can use it like so:
```python
import torch
from transformers import pipeline
clip = pipeline(
task="zero-shot-image-classification",
model="facebook/metaclip-2-mt5-worldwide-b32",
torch_dtype=torch.bfloat16,
device=0
)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
results = clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
print(results)
```
In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API:
```python
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
# note: make sure to verify that `AutoModel` is an instance of `MetaClip2Model`
model = AutoModel.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32", torch_dtype=torch.bfloat16, attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("facebook/metaclip-2-mt5-worldwide-b32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
most_likely_idx = probs.argmax(dim=1).item()
most_likely_label = labels[most_likely_idx]
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")
```
|