Remove placeholder links, add CLI example with input/output table
Browse files
README.md
CHANGED
|
@@ -25,7 +25,7 @@ inference: false
|
|
| 25 |
|
| 26 |
# jina-vlm-v1: Small Multilingual Vision Language Model
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
`jina-vlm-v1` is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images.
|
| 31 |
|
|
@@ -81,6 +81,40 @@ python infer.py -p "What is the capital of France?"
|
|
| 81 |
- `--max-pixels`: Max pixels per image, larger images are resized preserving aspect ratio.
|
| 82 |
- `--stream`: Enable streaming output.
|
| 83 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
### Using Transformers
|
| 85 |
|
| 86 |
```python
|
|
|
|
| 25 |
|
| 26 |
# jina-vlm-v1: Small Multilingual Vision Language Model
|
| 27 |
|
| 28 |
+
Blog | API | [Arxiv](https://arxiv.org/abs/2512.04032)
|
| 29 |
|
| 30 |
`jina-vlm-v1` is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images.
|
| 31 |
|
|
|
|
| 81 |
- `--max-pixels`: Max pixels per image, larger images are resized preserving aspect ratio.
|
| 82 |
- `--stream`: Enable streaming output.
|
| 83 |
|
| 84 |
+
**Example:**
|
| 85 |
+
|
| 86 |
+
```bash
|
| 87 |
+
python infer.py -i assets/the_persistence_of_memory.jpg -p "Describe this picture"
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
<table>
|
| 91 |
+
<tr>
|
| 92 |
+
<td width="40%"><b>Input</b></td>
|
| 93 |
+
<td width="60%"><b>Output</b></td>
|
| 94 |
+
</tr>
|
| 95 |
+
<tr>
|
| 96 |
+
<td><img src="./assets/the_persistence_of_memory.jpg" width="100%"></td>
|
| 97 |
+
<td>
|
| 98 |
+
|
| 99 |
+
```
|
| 100 |
+
├── 🖼️ Images: ['the_persistence_of_memory.jpg']
|
| 101 |
+
├── 📜 Prompt: Describe this picture
|
| 102 |
+
└── 🧠 Response: This image is a surrealistic
|
| 103 |
+
painting by Salvador Dalí, titled "The Persistence
|
| 104 |
+
of Memory." The painting is characterized by its
|
| 105 |
+
dreamlike and distorted elements, which are
|
| 106 |
+
hallmarks of Dalí's style. The central focus of
|
| 107 |
+
the painting is a melting clock, which is a key
|
| 108 |
+
symbol in the artwork...
|
| 109 |
+
|
| 110 |
+
Token usage: 1753 tokens (4.3%)
|
| 111 |
+
Generated in 33.08s | 8.16 tok/s
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
</td>
|
| 115 |
+
</tr>
|
| 116 |
+
</table>
|
| 117 |
+
|
| 118 |
### Using Transformers
|
| 119 |
|
| 120 |
```python
|