jinaai
/

jina-vlm

@@ -25,7 +25,7 @@ inference: false
 # jina-vlm-v1: Small Multilingual Vision Language Model
-[Blog](https://jina.ai/news/jina-vlm-v1) | [API](https://jina.ai/api) | [Arxiv](https://arxiv.org/abs/2512.04032)
 `jina-vlm-v1` is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images.
@@ -81,6 +81,40 @@ python infer.py -p "What is the capital of France?"
 - `--max-pixels`: Max pixels per image, larger images are resized preserving aspect ratio.
 - `--stream`: Enable streaming output.
 ### Using Transformers
 ```python

 # jina-vlm-v1: Small Multilingual Vision Language Model
+Blog | API | [Arxiv](https://arxiv.org/abs/2512.04032)
 `jina-vlm-v1` is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images.
 - `--max-pixels`: Max pixels per image, larger images are resized preserving aspect ratio.
 - `--stream`: Enable streaming output.
+**Example:**
+```bash
+python infer.py -i assets/the_persistence_of_memory.jpg -p "Describe this picture"
+```
+<table>
+<tr>
+<td width="40%"><b>Input</b></td>
+<td width="60%"><b>Output</b></td>
+</tr>
+<tr>
+<td><img src="./assets/the_persistence_of_memory.jpg" width="100%"></td>
+<td>
+```
+├── 🖼️ Images: ['the_persistence_of_memory.jpg']
+├── 📜 Prompt: Describe this picture
+└── 🧠 Response: This image is a surrealistic
+painting by Salvador Dalí, titled "The Persistence
+of Memory." The painting is characterized by its
+dreamlike and distorted elements, which are
+hallmarks of Dalí's style. The central focus of
+the painting is a melting clock, which is a key
+symbol in the artwork...
+Token usage: 1753 tokens (4.3%)
+Generated in 33.08s | 8.16 tok/s
+```
+</td>
+</tr>
+</table>
 ### Using Transformers
 ```python