OctoMed-7B

Introduction

OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.

Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.

OctoMed-7B produces internal reasoning traces in <think>...</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.

Evaluation

Medical Benchmark Performances

Notes:

Green = OSS smaller models (<10B), Cyan = large proprietary models.
† = 10-sample majority vote ensemble result.

Legacy Medical Benchmark Performance

Dataset	Setting	Performance
VQA-RAD	Open (Token F1)	64.23
VQA-RAD	Closed (Accuracy)	85.66
SLAKE	Open (Token F1)	84.96
SLAKE	Closed (Accuracy)	89.66

We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a direct prompt by including the phrase Answer in a short word or phrase. at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.

Requirements

We recommend installing the transformers version used in our experiments and other dependencies with this command:

pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14

Quickstart

Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.

Inference with HF Transformers 🤗

Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "OctoMed/OctoMed-7B",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
min_pixels = 262144
max_pixels = 262144
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

# Text-Only Query
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
#         ],
#     }
# ]

# General Query
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
#             },
#             {"type": "text", "text": "Describe this image."},
#         ],
#     }
# ]

# Multiple Choice Query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
            },
            {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

        
inputs = inputs.to(device="cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Inference with vLLM

Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):

from vllm import LLM, SamplingParams
from transformers import AutoProcessor

min_pixels = 262144
max_pixels = 262144
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

llm = LLM(
    model="OctoMed/OctoMed-7B",
    trust_remote_code=True,
    dtype="bfloat16",
    max_model_len=8192,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.8,
    limit_mm_per_prompt={"image": 1}
)

# Set up sampling parameters
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=8192,
)

image_data = []

# Text-Only Query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
        ],
    }
]

# General Query
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": image_data[0],
#             },
#             {"type": "text", "text": "Describe this image."},
#         ],
#     }
# ]

# Multiple Choice Query
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": image_data[0],
#             },
#             {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
#         ],
#     }
# ]

prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)

if image_data:
    mm_prompt = {
        "prompt": prompt,
        "multi_modal_data": {"image": image_data}
    }
else:
    mm_prompt = {"prompt": prompt}

# Generate response
outputs = llm.generate([mm_prompt], sampling_params)

# Print the generated response
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Generated text: {generated_text}")
    print("-" * 50)

Suggested Hyperparameters

We suggest using the same settings used in evaluation to reproduce results:

Format multiple choice questions with the following template:

{optional image(s)}
{question}
{options, 1 on each line}

Please reason step-by-step, and put your final answer within \\boxed{}.

Example Prompt:

{image(s)}
What orientation was the MRI in image B taken in?
A: Axial
B: Coronal
C: Sagittal
D: Oblique

Please reason step-by-step, and put your final answer within \\boxed{}.

Use the default system prompt ("You are a helpful assistant.")
Extract the answer by looking at the content within the last \boxed{}.
Temperature of 0.6
Top-p of 0.95
min_pixels = 262144
max_pixels = 262144

Known Issues

Model is sensitive to system prompt. We recommend using the default one.
The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.
The model occasionally states that it cannot see the image even when the image is provided. Some of our text-only reasoning traces describe an image in words, and the expected reasoning trace indicates that the image is missing. Despite this, we observe high benchmark performance and the model can still reliably use visual information.

We hope to address these concerns moving forward in future iterations!

Citation

If you find our work helpful, feel free to give us a cite.

@article{OctoMed,
  title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
  author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, Guanghui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
  journal={arXiv preprint arXiv:2511.23269},
  year={2025}
}

Downloads last month: 117

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for OctoMed/OctoMed-7B

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(902)

this model

Quantizations

2 models