CLIP Prefix Caption Model - COCO
This model generates captions for images using CLIP image embeddings and GPT-2 language model.
Model Details
- Model Type: CLIP Prefix Caption
- Dataset: COCO
- Prefix Length: 10
- CLIP Model: ViT-B/32
- Language Model: GPT-2
Usage
from huggingface_hub import hf_hub_download
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import clip
# Load model
checkpoint_path = hf_hub_download(
repo_id="Hamza66628/clip-prefix-caption-coco",
filename="model.pt"
)
checkpoint = torch.load(checkpoint_path, map_location="cpu")
# Initialize model (use same architecture as training)
model = ClipCaptionModel(prefix_length=10)
model.load_state_dict(checkpoint, strict=False)
model.eval()
# Generate caption
# (See full usage in the notebook)
Citation
If you use this model, please cite the original CLIP Prefix Caption paper.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support