Cosmos-Reason2-2B-W4A16-Edge2

Cosmos-Reason2-2B Benchmark Results

Optimized version of nvidia/Cosmos-Reason2-2B using quantization and targeted mixed-precision exclusions. This release is based on the W4A16 line and adds a mixed precision quantization recipe resulting in almost no accuracy drop while preserving the 2x speedup.

This model was created by quantizing the base language model to INT4 weights while keeping activations and a small selection of sensitive layers in FP16 precision. The model preserves Text + Image / Video -> Text behavior and reasoning capabilities while reducing memory footprint, enabling the Cosmos-Reason2 model to run even on an NVIDIA Jetson Orin Nano Super.

For more efficient inference, Embedl proprietary optimizations and architectural enhancements may require patched vLLM on some edge setups.

Output Example

Real output captured from NVIDIA Jetson Orin Nano Super.

system: You are a helpful assistant.

user: Describe the sequence of actions from start to end, and explain what changes over time in this video: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/assets/kitchen_fridge_input.mp4

assistant: Initially, the skeleton character stands facing the refrigerator, positioned on the left side of the frame.

The skeleton character then reaches out with its right hand to open the refrigerator door.

As the skeleton opens the refrigerator, the door swings outward, revealing the interior shelves.


Model Details

Field Value
Model embedl/Cosmos-Reason2-2B-W4A16-Edge2
Base Model nvidia/Cosmos-Reason2-2B
Input / Output Text + Image / Video -> Text
Release Date 2026-02-25
Version 1.0
Optimizations Quantization (W4A16) + Mixed Precision Exclusions (edge2)
Developers Embedl
Licenses Upstream: NVIDIA Open Model License.
Additional Information: Apache License 2.0.
Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose multimodal NLP on NVIDIA GPUs

Optimizations

  • Quantization (W4A16): INT4 weights + FP16 activations.
  • Edge2 mixed exclusions: Keep sensitive layers in FP16 precision.

Accuracy

For comparative evaluation, we report benchmark scores using the Physical AI Bench Reason Task.

We have not been able to reproduce the baseline benchmarks reported by nvidia/Cosmos-Reason2-2B on the Physical AI Bench Leaderboard, see related issue: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52

Overall + Category Scores

Model Overall Embodied Reasoning Common Sense
nvidia/Cosmos-Reason2-2B 50.60 53.93 47.19
embedl/Cosmos-Reason2-2B-W4A16 48.68 50.49 46.85
embedl/Cosmos-Reason2-2B-W4A16-Edge2 50.58 53.61 47.52

Subcategory Scores

Model AV Physical World Time Space Agibot HoloAssist RoboFail RoboVQA BridgeData V2
nvidia/Cosmos-Reason2-2B 44.00 46.90 45.30 55.00 34.00 60.00 49.00 90.91 42.00
embedl/Cosmos-Reason2-2B-W4A16 36.00 47.79 44.30 53.75 36.00 61.00 42.00 80.91 44.00
embedl/Cosmos-Reason2-2B-W4A16-Edge2 45.00 44.25 48.66 52.50 32.00 59.00 54.00 85.45 43.00

Performance

On-device performance benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Screenshot Edge Inference Benchmarks

Usage Examples

vLLM Video Inference

vLLM image: NVIDIA vLLM 0.14.0 for Jetson

Test Hardware (serve command): NVIDIA Jetson AGX Orin

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  -e HF_TOKEN=hf_*** \
  -e HF_HOME=/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2" \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2

Test Hardware (Python example): NVIDIA Jetson AGX Orin, NVIDIA Jetson Orin Nano Super

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model = "embedl/Cosmos-Reason2-2B-W4A16-Edge2"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        gpu_memory_utilization=0.75,
        max_num_seqs=2,
    )

    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
    print(output[0].outputs[0].text)

Transformers Inference

Test Hardware: NVIDIA L4 GPU

import transformers

if __name__ == "__main__":
    model_name = "embedl/Cosmos-Reason2-2B-W4A16-Edge2"

    model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
        model_name,
        device_map="auto",
        attn_implementation="sdpa",
    )

    processor = transformers.AutoProcessor.from_pretrained(model_name)

    messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": "Write me a short story about an elephant."}],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        truncation=False,
    ).to(model.device)

    generated_ids = model.generate(**inputs, max_new_tokens=256)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids, strict=False)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )
    print(output_text[0])

License

Built on NVIDIA Cosmos

This model is a derivative of nvidia/Cosmos-Reason2-2B.

Licensed by NVIDIA Corporation under the NVIDIA Open Model License


Contact

  • Enterprise and Commercial Inquiries: [email protected]
  • Technical Issues and Early Access: https://github.com/embedl/embedl-models
  • More Information and Model Releases: https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on this model, or exploring custom model optimization, reach out for:

  • Engineering support for on-prem and edge deployments
  • Early access and partner co-marketing opportunities

Contact: [email protected]

Downloads last month
12,517
Safetensors
Model size
2B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Cosmos-Reason2-2B-W4A16-Edge2

Quantized
(5)
this model

Collections including embedl/Cosmos-Reason2-2B-W4A16-Edge2