OpenGVLab
/

InternVL2_5-78B-MPO

@@ -1,5 +1,7 @@
 ---
-license: mit
 pipeline_tag: image-text-to-text
 library_name: transformers
 base_model:
@@ -65,7 +67,7 @@ To construct this dataset, we propose an efficient data construction pipeline. S
 - **For samples with clear ground truths:**
   the model is prompted to first provide the reasoning process and then give the final answer in the format like `Final Answer: ***`.
-  Responses matching the ground truth answer constitute the positive set \\(mathcal{Y}_p\\), while those that do not match make up the negative set \\(\mathcal{Y}_n\\). Additionally, responses that fail to provide a clear final answer are also merged into \\(\mathcal{Y}_n\\).
   Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response \\(y_c\\) from \\(\mathcal{Y}_p\\) and a negative response \\(y_r\\) from \\(\mathcal{Y}_n\\).
 - **For samples without clear ground truths:**
@@ -160,7 +162,7 @@ To comprehensively compare InternVL's performance before and after MPO, we emplo
 ## Quick Start
-We provide an example code to run `InternVL2_5-1B` using `transformers`.
 > Please use transformers>=4.37.2 to ensure the model works normally.
@@ -171,7 +173,7 @@ We provide an example code to run `InternVL2_5-1B` using `transformers`.
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
-path = "OpenGVLab/InternVL2_5-1B"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
@@ -185,7 +187,7 @@ model = AutoModel.from_pretrained(
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
-path = "OpenGVLab/InternVL2_5-1B"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
@@ -230,8 +232,8 @@ def split_model(model_name):
     return device_map
-path = "OpenGVLab/InternVL2_5-1B"
-device_map = split_model('InternVL2_5-1B')
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
@@ -244,6 +246,7 @@ model = AutoModel.from_pretrained(
 ### Inference with Transformers
 ```python
 import numpy as np
 import torch
 import torchvision.transforms as T
@@ -326,14 +329,44 @@ def load_image(image_file, input_size=448, max_num=12):
     pixel_values = torch.stack(pixel_values)
     return pixel_values
-# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
-path = 'OpenGVLab/InternVL2_5-1B'
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
     low_cpu_mem_usage=True,
     use_flash_attn=True,
-    trust_remote_code=True).eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
 # set the max number of tiles in `max_num`
@@ -510,9 +543,9 @@ LMDeploy abstracts the complex inference process of multi-modal Vision-Language
 from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
-model = 'OpenGVLab/InternVL2_5-1B'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
@@ -528,8 +561,8 @@ from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 from lmdeploy.vl.constants import IMAGE_TOKEN
-model = 'OpenGVLab/InternVL2_5-1B'
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -550,8 +583,8 @@ Conducting inference with batch prompts is quite straightforward; just place the
 from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
-model = 'OpenGVLab/InternVL2_5-1B'
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -570,8 +603,8 @@ There are two ways to do the multi-turn conversations with the pipeline. One is
 from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
 from lmdeploy.vl import load_image
-model = 'OpenGVLab/InternVL2_5-1B'
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -586,7 +619,7 @@ print(sess.response.text)
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
-lmdeploy serve api_server OpenGVLab/InternVL2_5-1B --server-port 23333
 ```
 To use the OpenAI-style interface, you need to install OpenAI:
@@ -625,7 +658,7 @@ print(response)
 ## License
-This project is released under the MIT License. This project uses the pre-trained Qwen2.5-0.5B-Instruct as a component, which is licensed under the Apache License 2.0.
 ## Citation

 ---
+license: other
+license_name: qwen
+license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
 pipeline_tag: image-text-to-text
 library_name: transformers
 base_model:
 - **For samples with clear ground truths:**
   the model is prompted to first provide the reasoning process and then give the final answer in the format like `Final Answer: ***`.
+  Responses matching the ground truth answer constitute the positive set \\(\mathcal{Y}_p\\), while those that do not match make up the negative set \\(\mathcal{Y}_n\\). Additionally, responses that fail to provide a clear final answer are also merged into \\(\mathcal{Y}_n\\).
   Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response \\(y_c\\) from \\(\mathcal{Y}_p\\) and a negative response \\(y_r\\) from \\(\mathcal{Y}_n\\).
 - **For samples without clear ground truths:**
 ## Quick Start
+We provide an example code to run `InternVL2_5-78B-MPO` using `transformers`.
 > Please use transformers>=4.37.2 to ensure the model works normally.
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
+path = "OpenGVLab/InternVL2_5-78B-MPO"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
+path = "OpenGVLab/InternVL2_5-78B-MPO"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
     return device_map
+path = "OpenGVLab/InternVL2_5-78B-MPO"
+device_map = split_model('InternVL2_5-78B')
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
 ### Inference with Transformers
 ```python
+import math
 import numpy as np
 import torch
 import torchvision.transforms as T
     pixel_values = torch.stack(pixel_values)
     return pixel_values
+def split_model(model_name):
+    device_map = {}
+    world_size = torch.cuda.device_count()
+    num_layers = {
+        'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
+        'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
+    # Since the first GPU will be used for ViT, treat it as half a GPU.
+    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
+    num_layers_per_gpu = [num_layers_per_gpu] * world_size
+    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
+    layer_cnt = 0
+    for i, num_layer in enumerate(num_layers_per_gpu):
+        for j in range(num_layer):
+            device_map[f'language_model.model.layers.{layer_cnt}'] = i
+            layer_cnt += 1
+    device_map['vision_model'] = 0
+    device_map['mlp1'] = 0
+    device_map['language_model.model.tok_embeddings'] = 0
+    device_map['language_model.model.embed_tokens'] = 0
+    device_map['language_model.output'] = 0
+    device_map['language_model.model.norm'] = 0
+    device_map['language_model.lm_head'] = 0
+    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
+    return device_map
+# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
+# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
+path = 'OpenGVLab/InternVL2_5-78B-MPO'
+device_map = split_model('InternVL2_5-78B')
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
+    load_in_8bit=False,
     low_cpu_mem_usage=True,
     use_flash_attn=True,
+    trust_remote_code=True,
+    device_map=device_map).eval()
 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
 # set the max number of tiles in `max_num`
 from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
+model = 'OpenGVLab/InternVL2_5-78B-MPO'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
 from lmdeploy.vl import load_image
 from lmdeploy.vl.constants import IMAGE_TOKEN
+model = 'OpenGVLab/InternVL2_5-78B-MPO'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
+model = 'OpenGVLab/InternVL2_5-78B-MPO'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
 from lmdeploy.vl import load_image
+model = 'OpenGVLab/InternVL2_5-78B-MPO'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
+lmdeploy serve api_server OpenGVLab/InternVL2_5-78B-MPO --server-port 23333 --tp 4
 ```
 To use the OpenAI-style interface, you need to install OpenAI:
 ## License
+This project is released under the MIT License. This project uses the pre-trained Qwen2.5-72B-Instruct as a component, which is licensed under the Qwen License.
 ## Citation