---
library_name: transformers
pipeline_tag: image-text-to-text
license: other
base_model:
- moondream/moondream3-preview
---

# Moondream 3 (Preview) 4-Bit

![4bit-efficiency-gains-and-performance-tradeoffs](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/_puZs7EqffYYMFaNpaxsu.jpeg)

**Moondream 3 (Preview) 4-Bit** is the INT4 quantized version of [Moondream3-Preview](https://huggingface.co/moondream/moondream3-preview), reducing model size from \~18GB to \~6GB (\~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.

This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.

## Features

- **66% smaller**: ~6GB vs ~18GB original
- **Lower memory**: Runs on 7GB VRAM (vs 20GB for FP16)
- **Same capabilities**: Retains original Moondream3 skills & API
- **Minimal quality loss**: ~2-5% degradation on benchmarks
- **HuggingFace compatible**: Load with `AutoModelForCausalLM.from_pretrained()`

## VRAM & Time Savings

| Configuration   | Model Size | VRAM usage | s/query* |
|-----------------|------------|------------|----------|
| FP16 (original) |    18.5 GB | 19,594 MiB |   4.19   |
| INT4 (this one) |    6.18 GB |  7,332 MiB |   2.65   |
| Reduction       |  **66 %**  |  **62 %**  | **37 %** |

_(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)_

## Evaluation Results

| Test              | time (4-bit) | accuracy (4-bit) |   | time (base) | accuracy (base) |
|-------------------|--------------|------------------|---|-------------|-----------------|
| vision-ai-checkup |   **156 s**  |      42.8 %      |   |    223 s    |    **47.2 %**   |
| CountBenchQA      | **22.9 min** |      91.2 %      |   |   36.6 min  |    **93.2 %**   |


![image (9)](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/s3m_TiW0ASZ6jVGHSivdL.webp)


## Architecture

**Quantized Components (INT4):**
- Text attention QKV/projection layers
- Dense MLP layers (layers 0-3)
- MoE expert weights (layers 4-23, 64 experts each)
- Region model encoder/decoder

**Preserved in FP16:**
- Vision encoder (SigLIP)
- MoE routers (critical for expert selection)
- Temperature (tau) parameters
- LayerNorms, embeddings, LM head


![moondream3-preview-4bit-visualization](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/8hYfJv76Q605JhTA6-qOg.jpeg)

**Slow First-Time Compile and Inference**

_A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's [correctly configured](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html). I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :\)_


## Quick Start (HuggingFace Style)

The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:

```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# Load quantized model (same API as original Moondream3-preview)
moondream = AutoModelForCausalLM.from_pretrained(
    "alecccdd/moondream3-preview-4bit",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
moondream.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])
```

## Alternative: Manual Loading

If you prefer more control, you can load the model directly:

```python
import torch
from PIL import Image
from config import MoondreamConfig
from moondream import MoondreamModel
from weights import load_weights

# Load quantized model
model = MoondreamModel(MoondreamConfig())
load_weights("./", model, device="cuda")
model.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = model.query(image=image, question="What's in this image?")
print(result["answer"])
```

## Skills

API of all skills remains identical to the [original moondream3-preview model](https://huggingface.co/moondream/moondream3-preview).


## License

This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.

Original Copyright (c) M87 Labs, Inc.

Quantization and conversion code:
Copyright (c) 2025 Alicius Schröder