MOSS-VL-Base-0408

📌 Introduction

MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.

Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal base model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation.

Specifically, the pretraining pipeline is structured into the following four progressive stages:

Stage 1: Vision-language alignment
Stage 2: Large-scale multimodal pretraining
Stage 3: High-quality multimodal pretraining
Stage 4: Annealing and long-context extension

✨ Highlights

📐 Native Dynamic Resolution MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
🎞️ Native Interleaved Image & Video Inputs The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing.

🏗 Model Architecture

MOSS-VL-Base-0408 adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a multimodal backbone for image and video understanding.

MOSS-VL Architecture

🧩 Absolute Timestamps

To help the model perceive the pacing and duration of events, MOSS-VL-Base-0408 injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.

Timestamped Sequence Input Illustration

🧬 Cross-attention RoPE (XRoPE)

MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.

MOSS-VL mRoPE Architecture Illustration

🚀 Quickstart

🛠️ Installation

conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt

🏃 Run Inference

Single-image offline inference (Python)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_image_generate(
    processor,
    prompt="",
    image=image_path,
    shortest_edge=4096,
    longest_edge=16777216,
    multi_image_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)

Single-video offline inference (Python)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_video_generate(
    processor,
    prompt="",
    video=video_path,
    shortest_edge=4096,
    longest_edge=16777216,
    video_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    video_fps=1.0,
    min_frames=1,
    max_frames=256,
    num_extract_threads=4,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)

Batched offline inference (Python)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
shared_generate_kwargs = {
    "temperature": 1.0,
    "top_k": 50,
    "top_p": 1.0,
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "do_sample": False,
}
shared_video_media_kwargs = {
    "min_pixels": 4096,
    "max_pixels": 16777216,
    "video_max_pixels": 201326592,
    "video_fps": 1.0,
    "min_frames": 1,
    "max_frames": 256,
}


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)
queries = [
    {
        "images": ["data/sample_a.jpg"],
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": dict(shared_video_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        session_states=None,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]

🚧 Limitations and Future Work

MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:

📄 Stronger OCR, Especially for Long Documents — We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
🎬 Expanded Extremely Long Video Understanding — We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts.

We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.

📜 Citation

@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
  note          = {GitHub repository}
}

Downloads last month: 379

Safetensors

Model size

11B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSS-Team/MOSS-VL-Base-0408

Finetunes

1 model

Collection including OpenMOSS-Team/MOSS-VL-Base-0408

MOSS-VL

Collection

2 items • Updated Apr 20 • 55