Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,7 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
|
|
|
|
|
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
library_name: transformers
|
| 5 |
base_model:
|
|
@@ -65,7 +67,7 @@ To construct this dataset, we propose an efficient data construction pipeline. S
|
|
| 65 |
|
| 66 |
- **For samples with clear ground truths:**
|
| 67 |
the model is prompted to first provide the reasoning process and then give the final answer in the format like `Final Answer: ***`.
|
| 68 |
-
Responses matching the ground truth answer constitute the positive set \\(mathcal{Y}_p\\), while those that do not match make up the negative set \\(\mathcal{Y}_n\\). Additionally, responses that fail to provide a clear final answer are also merged into \\(\mathcal{Y}_n\\).
|
| 69 |
Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response \\(y_c\\) from \\(\mathcal{Y}_p\\) and a negative response \\(y_r\\) from \\(\mathcal{Y}_n\\).
|
| 70 |
|
| 71 |
- **For samples without clear ground truths:**
|
|
@@ -160,7 +162,7 @@ To comprehensively compare InternVL's performance before and after MPO, we emplo
|
|
| 160 |
|
| 161 |
## Quick Start
|
| 162 |
|
| 163 |
-
We provide an example code to run `InternVL2_5-
|
| 164 |
|
| 165 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
| 166 |
|
|
@@ -171,7 +173,7 @@ We provide an example code to run `InternVL2_5-1B` using `transformers`.
|
|
| 171 |
```python
|
| 172 |
import torch
|
| 173 |
from transformers import AutoTokenizer, AutoModel
|
| 174 |
-
path = "OpenGVLab/InternVL2_5-
|
| 175 |
model = AutoModel.from_pretrained(
|
| 176 |
path,
|
| 177 |
torch_dtype=torch.bfloat16,
|
|
@@ -185,7 +187,7 @@ model = AutoModel.from_pretrained(
|
|
| 185 |
```python
|
| 186 |
import torch
|
| 187 |
from transformers import AutoTokenizer, AutoModel
|
| 188 |
-
path = "OpenGVLab/InternVL2_5-
|
| 189 |
model = AutoModel.from_pretrained(
|
| 190 |
path,
|
| 191 |
torch_dtype=torch.bfloat16,
|
|
@@ -230,8 +232,8 @@ def split_model(model_name):
|
|
| 230 |
|
| 231 |
return device_map
|
| 232 |
|
| 233 |
-
path = "OpenGVLab/InternVL2_5-
|
| 234 |
-
device_map = split_model('InternVL2_5-
|
| 235 |
model = AutoModel.from_pretrained(
|
| 236 |
path,
|
| 237 |
torch_dtype=torch.bfloat16,
|
|
@@ -244,6 +246,7 @@ model = AutoModel.from_pretrained(
|
|
| 244 |
### Inference with Transformers
|
| 245 |
|
| 246 |
```python
|
|
|
|
| 247 |
import numpy as np
|
| 248 |
import torch
|
| 249 |
import torchvision.transforms as T
|
|
@@ -326,14 +329,44 @@ def load_image(image_file, input_size=448, max_num=12):
|
|
| 326 |
pixel_values = torch.stack(pixel_values)
|
| 327 |
return pixel_values
|
| 328 |
|
| 329 |
-
|
| 330 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 331 |
model = AutoModel.from_pretrained(
|
| 332 |
path,
|
| 333 |
torch_dtype=torch.bfloat16,
|
|
|
|
| 334 |
low_cpu_mem_usage=True,
|
| 335 |
use_flash_attn=True,
|
| 336 |
-
trust_remote_code=True
|
|
|
|
| 337 |
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
| 338 |
|
| 339 |
# set the max number of tiles in `max_num`
|
|
@@ -510,9 +543,9 @@ LMDeploy abstracts the complex inference process of multi-modal Vision-Language
|
|
| 510 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
| 511 |
from lmdeploy.vl import load_image
|
| 512 |
|
| 513 |
-
model = 'OpenGVLab/InternVL2_5-
|
| 514 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
|
| 515 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
| 516 |
response = pipe(('describe this image', image))
|
| 517 |
print(response.text)
|
| 518 |
```
|
|
@@ -528,8 +561,8 @@ from lmdeploy import pipeline, TurbomindEngineConfig
|
|
| 528 |
from lmdeploy.vl import load_image
|
| 529 |
from lmdeploy.vl.constants import IMAGE_TOKEN
|
| 530 |
|
| 531 |
-
model = 'OpenGVLab/InternVL2_5-
|
| 532 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
| 533 |
|
| 534 |
image_urls=[
|
| 535 |
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
|
|
@@ -550,8 +583,8 @@ Conducting inference with batch prompts is quite straightforward; just place the
|
|
| 550 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
| 551 |
from lmdeploy.vl import load_image
|
| 552 |
|
| 553 |
-
model = 'OpenGVLab/InternVL2_5-
|
| 554 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
| 555 |
|
| 556 |
image_urls=[
|
| 557 |
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
|
|
@@ -570,8 +603,8 @@ There are two ways to do the multi-turn conversations with the pipeline. One is
|
|
| 570 |
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
|
| 571 |
from lmdeploy.vl import load_image
|
| 572 |
|
| 573 |
-
model = 'OpenGVLab/InternVL2_5-
|
| 574 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
| 575 |
|
| 576 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
|
| 577 |
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
|
|
@@ -586,7 +619,7 @@ print(sess.response.text)
|
|
| 586 |
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
|
| 587 |
|
| 588 |
```shell
|
| 589 |
-
lmdeploy serve api_server OpenGVLab/InternVL2_5-
|
| 590 |
```
|
| 591 |
|
| 592 |
To use the OpenAI-style interface, you need to install OpenAI:
|
|
@@ -625,7 +658,7 @@ print(response)
|
|
| 625 |
|
| 626 |
## License
|
| 627 |
|
| 628 |
-
This project is released under the MIT License. This project uses the pre-trained Qwen2.5-
|
| 629 |
|
| 630 |
## Citation
|
| 631 |
|
|
|
|
| 1 |
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: qwen
|
| 4 |
+
license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
library_name: transformers
|
| 7 |
base_model:
|
|
|
|
| 67 |
|
| 68 |
- **For samples with clear ground truths:**
|
| 69 |
the model is prompted to first provide the reasoning process and then give the final answer in the format like `Final Answer: ***`.
|
| 70 |
+
Responses matching the ground truth answer constitute the positive set \\(\mathcal{Y}_p\\), while those that do not match make up the negative set \\(\mathcal{Y}_n\\). Additionally, responses that fail to provide a clear final answer are also merged into \\(\mathcal{Y}_n\\).
|
| 71 |
Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response \\(y_c\\) from \\(\mathcal{Y}_p\\) and a negative response \\(y_r\\) from \\(\mathcal{Y}_n\\).
|
| 72 |
|
| 73 |
- **For samples without clear ground truths:**
|
|
|
|
| 162 |
|
| 163 |
## Quick Start
|
| 164 |
|
| 165 |
+
We provide an example code to run `InternVL2_5-78B-MPO` using `transformers`.
|
| 166 |
|
| 167 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
| 168 |
|
|
|
|
| 173 |
```python
|
| 174 |
import torch
|
| 175 |
from transformers import AutoTokenizer, AutoModel
|
| 176 |
+
path = "OpenGVLab/InternVL2_5-78B-MPO"
|
| 177 |
model = AutoModel.from_pretrained(
|
| 178 |
path,
|
| 179 |
torch_dtype=torch.bfloat16,
|
|
|
|
| 187 |
```python
|
| 188 |
import torch
|
| 189 |
from transformers import AutoTokenizer, AutoModel
|
| 190 |
+
path = "OpenGVLab/InternVL2_5-78B-MPO"
|
| 191 |
model = AutoModel.from_pretrained(
|
| 192 |
path,
|
| 193 |
torch_dtype=torch.bfloat16,
|
|
|
|
| 232 |
|
| 233 |
return device_map
|
| 234 |
|
| 235 |
+
path = "OpenGVLab/InternVL2_5-78B-MPO"
|
| 236 |
+
device_map = split_model('InternVL2_5-78B')
|
| 237 |
model = AutoModel.from_pretrained(
|
| 238 |
path,
|
| 239 |
torch_dtype=torch.bfloat16,
|
|
|
|
| 246 |
### Inference with Transformers
|
| 247 |
|
| 248 |
```python
|
| 249 |
+
import math
|
| 250 |
import numpy as np
|
| 251 |
import torch
|
| 252 |
import torchvision.transforms as T
|
|
|
|
| 329 |
pixel_values = torch.stack(pixel_values)
|
| 330 |
return pixel_values
|
| 331 |
|
| 332 |
+
def split_model(model_name):
|
| 333 |
+
device_map = {}
|
| 334 |
+
world_size = torch.cuda.device_count()
|
| 335 |
+
num_layers = {
|
| 336 |
+
'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
|
| 337 |
+
'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
|
| 338 |
+
# Since the first GPU will be used for ViT, treat it as half a GPU.
|
| 339 |
+
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
|
| 340 |
+
num_layers_per_gpu = [num_layers_per_gpu] * world_size
|
| 341 |
+
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
|
| 342 |
+
layer_cnt = 0
|
| 343 |
+
for i, num_layer in enumerate(num_layers_per_gpu):
|
| 344 |
+
for j in range(num_layer):
|
| 345 |
+
device_map[f'language_model.model.layers.{layer_cnt}'] = i
|
| 346 |
+
layer_cnt += 1
|
| 347 |
+
device_map['vision_model'] = 0
|
| 348 |
+
device_map['mlp1'] = 0
|
| 349 |
+
device_map['language_model.model.tok_embeddings'] = 0
|
| 350 |
+
device_map['language_model.model.embed_tokens'] = 0
|
| 351 |
+
device_map['language_model.output'] = 0
|
| 352 |
+
device_map['language_model.model.norm'] = 0
|
| 353 |
+
device_map['language_model.lm_head'] = 0
|
| 354 |
+
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
|
| 355 |
+
|
| 356 |
+
return device_map
|
| 357 |
+
|
| 358 |
+
# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
|
| 359 |
+
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
|
| 360 |
+
path = 'OpenGVLab/InternVL2_5-78B-MPO'
|
| 361 |
+
device_map = split_model('InternVL2_5-78B')
|
| 362 |
model = AutoModel.from_pretrained(
|
| 363 |
path,
|
| 364 |
torch_dtype=torch.bfloat16,
|
| 365 |
+
load_in_8bit=False,
|
| 366 |
low_cpu_mem_usage=True,
|
| 367 |
use_flash_attn=True,
|
| 368 |
+
trust_remote_code=True,
|
| 369 |
+
device_map=device_map).eval()
|
| 370 |
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
| 371 |
|
| 372 |
# set the max number of tiles in `max_num`
|
|
|
|
| 543 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
| 544 |
from lmdeploy.vl import load_image
|
| 545 |
|
| 546 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
| 547 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
|
| 548 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
| 549 |
response = pipe(('describe this image', image))
|
| 550 |
print(response.text)
|
| 551 |
```
|
|
|
|
| 561 |
from lmdeploy.vl import load_image
|
| 562 |
from lmdeploy.vl.constants import IMAGE_TOKEN
|
| 563 |
|
| 564 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
| 565 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
| 566 |
|
| 567 |
image_urls=[
|
| 568 |
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
|
|
|
|
| 583 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
| 584 |
from lmdeploy.vl import load_image
|
| 585 |
|
| 586 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
| 587 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
| 588 |
|
| 589 |
image_urls=[
|
| 590 |
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
|
|
|
|
| 603 |
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
|
| 604 |
from lmdeploy.vl import load_image
|
| 605 |
|
| 606 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
| 607 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
| 608 |
|
| 609 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
|
| 610 |
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
|
|
|
|
| 619 |
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
|
| 620 |
|
| 621 |
```shell
|
| 622 |
+
lmdeploy serve api_server OpenGVLab/InternVL2_5-78B-MPO --server-port 23333 --tp 4
|
| 623 |
```
|
| 624 |
|
| 625 |
To use the OpenAI-style interface, you need to install OpenAI:
|
|
|
|
| 658 |
|
| 659 |
## License
|
| 660 |
|
| 661 |
+
This project is released under the MIT License. This project uses the pre-trained Qwen2.5-72B-Instruct as a component, which is licensed under the Qwen License.
|
| 662 |
|
| 663 |
## Citation
|
| 664 |
|