CLIP Prefix Caption - Conceptual Captions Model

Image captioning model based on CLIP and GPT-2, trained on Conceptual Captions dataset.

Model Details

Model Type: CLIP Prefix Captioning
Architecture: CLIP Vision Encoder + MLP Mapping + GPT-2 Text Decoder
Dataset: Conceptual Captions
Prefix Length: 10 tokens
CLIP Model: ViT-B/32
GPT-2 Model: gpt2

Usage

See the test notebook for usage examples.

Files

model.pt: Model checkpoint (state_dict)

Citation

If you use this model, please cite:

@article{mokady2021clipcap,
  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},
  year={2021}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support