Qwen3-VL-4B-Instruct-FP8-Dynamic

This is a quantized version of Qwen/Qwen3-VL-4B-Instruct using SmoothQuant + FP8_DYNAMIC across all text linear layers.

Quantization Strategy

Component Scheme Details
All Text Linear Layers FP8_DYNAMIC W8A8 dynamic quantization
Vision Encoder BF16 (unquantized) Full precision for visual understanding
LM Head BF16 (unquantized) Full precision for output quality

SmoothQuant

Applied with strength 0.8 for activation smoothing before quantization:

  • Q/K/V projections ← input_layernorm
  • Gate/Up projections ← post_attention_layernorm

Model Details

  • Base Model: Qwen/Qwen3-VL-4B-Instruct (4.4B parameters)
  • Quantization Method: compressed-tensors (llm-compressor)
  • Model Size: ~5.6 GB (reduced from ~8.9 GB BF16)
  • Calibration: 512 samples from flickr30k, max_seq_length=2048

Usage with vLLM

vllm serve JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192
from vllm import LLM, SamplingParams

llm = LLM(
    model="JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic",
    quantization="compressed-tensors",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)

Usage with Transformers

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch

model_id = "JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

License

Apache 2.0, same as the base model.

Downloads last month
15
Safetensors
Model size
5B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic

Quantized
(53)
this model