|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
license: mit |
|
|
tags: |
|
|
- multimodal |
|
|
- vision-language |
|
|
- reasoning |
|
|
- qwen2 |
|
|
--- |
|
|
|
|
|
# Model Card for Virgo-72B |
|
|
|
|
|
Virgo is a multi-modal slow-thinking reasoning model based on Qwen2-VL-72B-Instruct. It excels in image-text-to-text tasks, demonstrating strong performance on various multimodal benchmarks. Virgo leverages a long-form thought process for enhanced reasoning capabilities, effectively integrating visual information into its responses. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** https://github.com/RUCAIBox/Virgo |
|
|
- **Paper:** https://arxiv.org/pdf/2501.01904 |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
This example demonstrates how to use Virgo-72B with the `vllm` library for text generation given an image and text input. Ensure you have `vllm` and `Pillow` installed (`pip install vllm Pillow`) and a suitable image file (`case/2246_image_1.jpg` in this example). |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
from PIL import Image |
|
|
|
|
|
model_name = "RUC-AIBOX/Virgo-72B" |
|
|
placeholder = "<|image_pad|>" |
|
|
llm = LLM( |
|
|
model=model_name, |
|
|
trust_remote_code=True, |
|
|
tensor_parallel_size=8, # Adjust based on your hardware |
|
|
) |
|
|
question = "Please first think deeply about the question, and then put the final answer in \\boxed{}. |
|
|
In the diagram, $\\angle E A D=90^{\\circ}, \\angle A C D=90^{\\circ}$, and $\\angle A B C=90^{\\circ}$. Also, $E D=13, E A=12$, $D C=4$, and $C B=2$. Determine the length of $A B$." |
|
|
prompt = ("<|im_start|>system |
|
|
You are a helpful assistant.<|im_end|> |
|
|
" |
|
|
f"<|im_start|>user |
|
|
<|vision_start|>{placeholder}<|vision_end|>" |
|
|
f"{question}<|im_end|> |
|
|
" |
|
|
"<|im_start|>assistant |
|
|
") |
|
|
sampling_params = SamplingParams( |
|
|
temperature=0.0, |
|
|
top_k=1, |
|
|
top_p=1.0, |
|
|
repetition_penalty=1.05, |
|
|
max_tokens=8192 |
|
|
) |
|
|
image = Image.open("case/2246_image_1.jpg") |
|
|
inputs = { |
|
|
"prompt": prompt, |
|
|
"multi_modal_data": { |
|
|
"image": image |
|
|
}, |
|
|
} |
|
|
outputs = llm.generate(inputs, sampling_params) |
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@article{du2025virgo, |
|
|
title={Virgo: A Preliminary Exploration on Reproducing o1-like MLLM}, |
|
|
author={Yifan Du and Zikang Liu and Yifan Li and Wayne Xin Zhao and Yuqi Huo and Bingning Wang and Weipeng Chen and Zheng Liu and Zhongyuan Wang and Ji-Rong Wen}, |
|
|
journal={arXiv preprint arXiv:2501.01904}, |
|
|
year={2025} |
|
|
} |
|
|
``` |