CLIP Prefix Caption - Conceptual Captions Model

Image captioning model based on CLIP and GPT-2, trained on Conceptual Captions dataset.

Model Details

  • Model Type: CLIP Prefix Captioning
  • Architecture: CLIP Vision Encoder + MLP Mapping + GPT-2 Text Decoder
  • Dataset: Conceptual Captions
  • Prefix Length: 10 tokens
  • CLIP Model: ViT-B/32
  • GPT-2 Model: gpt2

Usage

See the test notebook for usage examples.

Files

  • model.pt: Model checkpoint (state_dict)

Citation

If you use this model, please cite:

@article{mokady2021clipcap,
  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},
  year={2021}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support