Cosmos-Reason2-2B-W4A16-Edge2

Optimized version of nvidia/Cosmos-Reason2-2B using quantization and targeted mixed-precision exclusions. This release is based on the W4A16 line and adds a mixed precision quantization recipe resulting in almost no accuracy drop while preserving the 2x speedup.

This model was created by quantizing the base language model to INT4 weights while keeping activations and a small selection of sensitive layers in FP16 precision. The model preserves Text + Image / Video -> Text behavior and reasoning capabilities while reducing memory footprint, enabling the Cosmos-Reason2 model to run even on an NVIDIA Jetson Orin Nano Super.

For more efficient inference, Embedl proprietary optimizations and architectural enhancements may require patched vLLM on some edge setups.

Output Example

Real output captured from NVIDIA Jetson Orin Nano Super.

system: You are a helpful assistant.

user: Describe the sequence of actions from start to end, and explain what changes over time in this video: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/assets/kitchen_fridge_input.mp4

assistant: Initially, the skeleton character stands facing the refrigerator, positioned on the left side of the frame.

The skeleton character then reaches out with its right hand to open the refrigerator door.

As the skeleton opens the refrigerator, the door swings outward, revealing the interior shelves.

Model Details

Field	Value
Model	embedl/Cosmos-Reason2-2B-W4A16-Edge2
Base Model	nvidia/Cosmos-Reason2-2B
Input / Output	Text + Image / Video -> Text
Release Date	2026-02-25
Version	1.0
Optimizations	Quantization (W4A16) + Mixed Precision Exclusions (`edge2`)
Developers	Embedl
Licenses	Upstream: NVIDIA Open Model License. Additional Information: Apache License 2.0. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose multimodal NLP on NVIDIA GPUs

Optimizations

Quantization (W4A16): INT4 weights + FP16 activations.
Edge2 mixed exclusions: Keep sensitive layers in FP16 precision.

Accuracy

For comparative evaluation, we report benchmark scores using the Physical AI Bench Reason Task.

We have not been able to reproduce the baseline benchmarks reported by nvidia/Cosmos-Reason2-2B on the Physical AI Bench Leaderboard, see related issue: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52

Overall + Category Scores

Model	Overall	Embodied Reasoning	Common Sense
nvidia/Cosmos-Reason2-2B	50.60	53.93	47.19
embedl/Cosmos-Reason2-2B-W4A16	48.68	50.49	46.85
embedl/Cosmos-Reason2-2B-W4A16-Edge2	50.58	53.61	47.52

Subcategory Scores

Model	AV	Physical World	Time	Space	Agibot	HoloAssist	RoboFail	RoboVQA	BridgeData V2
nvidia/Cosmos-Reason2-2B	44.00	46.90	45.30	55.00	34.00	60.00	49.00	90.91	42.00
embedl/Cosmos-Reason2-2B-W4A16	36.00	47.79	44.30	53.75	36.00	61.00	42.00	80.91	44.00
embedl/Cosmos-Reason2-2B-W4A16-Edge2	45.00	44.25	48.66	52.50	32.00	59.00	54.00	85.45	43.00

Performance

On-device performance benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Usage Examples

vLLM Video Inference

vLLM image: NVIDIA vLLM 0.14.0 for Jetson

Test Hardware (serve command): NVIDIA Jetson AGX Orin

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  -e HF_TOKEN=hf_*** \
  -e HF_HOME=/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2" \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2

Test Hardware (Python example): NVIDIA Jetson AGX Orin, NVIDIA Jetson Orin Nano Super

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model = "embedl/Cosmos-Reason2-2B-W4A16-Edge2"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        gpu_memory_utilization=0.75,
        max_num_seqs=2,
    )

    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
    print(output[0].outputs[0].text)

Transformers Inference

Test Hardware: NVIDIA L4 GPU

import transformers

if __name__ == "__main__":
    model_name = "embedl/Cosmos-Reason2-2B-W4A16-Edge2"

    model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
        model_name,
        device_map="auto",
        attn_implementation="sdpa",
    )

    processor = transformers.AutoProcessor.from_pretrained(model_name)

    messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": "Write me a short story about an elephant."}],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        truncation=False,
    ).to(model.device)

    generated_ids = model.generate(**inputs, max_new_tokens=256)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids, strict=False)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )
    print(output_text[0])

License

Built on NVIDIA Cosmos

This model is a derivative of nvidia/Cosmos-Reason2-2B.

Licensed by NVIDIA Corporation under the NVIDIA Open Model License

Upstream: NVIDIA Open Model License
Additional Information: Apache License 2.0
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise and Commercial Inquiries: [email protected]
Technical Issues and Early Access: https://github.com/embedl/embedl-models
More Information and Model Releases: https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on this model, or exploring custom model optimization, reach out for:

Engineering support for on-prem and edge deployments
Early access and partner co-marketing opportunities

Contact: [email protected]

Downloads last month: 12,517

Safetensors

Model size

2B params

Tensor type

I64

I32

BF16

Model tree for embedl/Cosmos-Reason2-2B-W4A16-Edge2

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

nvidia/Cosmos-Reason2-2B

Quantized

(5)

this model

Collections including embedl/Cosmos-Reason2-2B-W4A16-Edge2