Qwen3-VL-4B-Instruct-FP8-Dynamic

This is a quantized version of Qwen/Qwen3-VL-4B-Instruct using SmoothQuant + FP8_DYNAMIC across all text linear layers.

Quantization Strategy

Component	Scheme	Details
All Text Linear Layers	FP8_DYNAMIC	W8A8 dynamic quantization
Vision Encoder	BF16 (unquantized)	Full precision for visual understanding
LM Head	BF16 (unquantized)	Full precision for output quality

SmoothQuant

Applied with strength 0.8 for activation smoothing before quantization:

Q/K/V projections ← input_layernorm
Gate/Up projections ← post_attention_layernorm

Model Details

Base Model: Qwen/Qwen3-VL-4B-Instruct (4.4B parameters)
Quantization Method: compressed-tensors (llm-compressor)
Model Size: ~5.6 GB (reduced from ~8.9 GB BF16)
Calibration: 512 samples from flickr30k, max_seq_length=2048

Usage with vLLM

vllm serve JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192

from vllm import LLM, SamplingParams

llm = LLM(
    model="JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic",
    quantization="compressed-tensors",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)

Usage with Transformers

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch

model_id = "JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

License

Apache 2.0, same as the base model.

Downloads last month: 15

Safetensors

Model size

5B params

Tensor type

BF16

F8_E4M3

Model tree for JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic

Base model

Qwen/Qwen3-VL-4B-Instruct

Quantized

(53)

this model