Sample code?
#1
by
seedmanc
- opened
Please provide a sample code for inference. I'm not sure how this is supposed to work. What I know is that CLIP is used to produce text embeddings and image embeddings, and matching those via cosine similarity can provide captioning to images. But we need text strings to embed and compare against, it's not an LLM that can generate text on its own. Does your safetensors file produce text out of image embeddings? How to feed those embeddings into it?