--- library_name: transformers pipeline_tag: image-text-to-text license: other base_model: - moondream/moondream3-preview --- # Moondream 3 (Preview) 4-Bit ![4bit-efficiency-gains-and-performance-tradeoffs](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/_puZs7EqffYYMFaNpaxsu.jpeg) **Moondream 3 (Preview) 4-Bit** is the INT4 quantized version of [Moondream3-Preview](https://huggingface.co/moondream/moondream3-preview), reducing model size from \~18GB to \~6GB (\~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality. This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM. ## Features - **66% smaller**: ~6GB vs ~18GB original - **Lower memory**: Runs on 7GB VRAM (vs 20GB for FP16) - **Same capabilities**: Retains original Moondream3 skills & API - **Minimal quality loss**: ~2-5% degradation on benchmarks - **HuggingFace compatible**: Load with `AutoModelForCausalLM.from_pretrained()` ## VRAM & Time Savings | Configuration | Model Size | VRAM usage | s/query* | |-----------------|------------|------------|----------| | FP16 (original) | 18.5 GB | 19,594 MiB | 4.19 | | INT4 (this one) | 6.18 GB | 7,332 MiB | 2.65 | | Reduction | **66 %** | **62 %** | **37 %** | _(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)_ ## Evaluation Results | Test | time (4-bit) | accuracy (4-bit) | | time (base) | accuracy (base) | |-------------------|--------------|------------------|---|-------------|-----------------| | vision-ai-checkup | **156 s** | 42.8 % | | 223 s | **47.2 %** | | CountBenchQA | **22.9 min** | 91.2 % | | 36.6 min | **93.2 %** | ![image (9)](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/s3m_TiW0ASZ6jVGHSivdL.webp) ## Architecture **Quantized Components (INT4):** - Text attention QKV/projection layers - Dense MLP layers (layers 0-3) - MoE expert weights (layers 4-23, 64 experts each) - Region model encoder/decoder **Preserved in FP16:** - Vision encoder (SigLIP) - MoE routers (critical for expert selection) - Temperature (tau) parameters - LayerNorms, embeddings, LM head ![moondream3-preview-4bit-visualization](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/8hYfJv76Q605JhTA6-qOg.jpeg) **Slow First-Time Compile and Inference** _A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's [correctly configured](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html). I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :\)_ ## Quick Start (HuggingFace Style) The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API: ```python import torch from PIL import Image from transformers import AutoModelForCausalLM # Load quantized model (same API as original Moondream3-preview) moondream = AutoModelForCausalLM.from_pretrained( "alecccdd/moondream3-preview-4bit", trust_remote_code=True, dtype=torch.bfloat16, device_map={"": "cuda"}, ) moondream.compile() # Critical for fast inference # Load an image image = Image.open("photo.jpg") # Ask a question result = moondream.query(image=image, question="What's in this image?") print(result["answer"]) ``` ## Alternative: Manual Loading If you prefer more control, you can load the model directly: ```python import torch from PIL import Image from config import MoondreamConfig from moondream import MoondreamModel from weights import load_weights # Load quantized model model = MoondreamModel(MoondreamConfig()) load_weights("./", model, device="cuda") model.compile() # Critical for fast inference # Load an image image = Image.open("photo.jpg") # Ask a question result = model.query(image=image, question="What's in this image?") print(result["answer"]) ``` ## Skills API of all skills remains identical to the [original moondream3-preview model](https://huggingface.co/moondream/moondream3-preview). ## License This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1. Original Copyright (c) M87 Labs, Inc. Quantization and conversion code: Copyright (c) 2025 Alicius Schröder