Qwen3-Embedding-0.6B (ONNX INT8)

This repository contains a quantized (INT8) and ONNX-exported version of Qwen/Qwen3-Embedding-0.6B.

It is optimized for high-throughput CPU inference using Hugging Face's Text Embeddings Inference (TEI) or the optimum library.

Model Details

Attribute	Detail
Base Model	Qwen/Qwen3-Embedding-0.6B
Format	ONNX (Opset 17)
Quantization	INT8 (AVX2 optimized)
Task	Feature Extraction / Semantic Embedding
File Size	~0.6 GB (vs ~1.2 GB Original)

Usage with Text Embeddings Inference (TEI)

This model is pre-configured for TEI. You can run it directly using Docker. Note: auto-truncate is required because the model supports 32k context, but Docker defaults to smaller batches.

Option A: Docker CLI

docker run --rm -p 8080:80 \\
    -v $PWD/data:/data \\
    ghcr.io/huggingface/text-embeddings-inference:cpu-latest \\
    --model-id Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8 \\
    --pooling mean \\
    --auto-truncate

Option B: Docker Compose

services:
  embedding-service:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-latest
    environment:
      - MODEL_ID=Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8
      - POOLING=mean
      - MAX_CLIENT_BATCH_SIZE=8
      - MAX_BATCH_TOKENS=2048
      - AUTO_TRUNCATE=true
    volumes:
      - ./data:/data
    ports:
      - "8080:80"

Usage with Python (Optimum)

pip install optimum[onnxruntime] transformers

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch

model_id = "Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8"

# Load Tokenizer and ONNX Model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForFeatureExtraction.from_pretrained(model_id)

# Input text
sentences = ["This is an example sentence.", "Qwen3 is a powerful model."]

# Tokenize
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Perform Inference
with torch.no_grad():
    outputs = model(**inputs)

# Mean Pooling
attention_mask = inputs['attention_mask']
token_embeddings = outputs.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

print(f"Embeddings shape: {embeddings.shape}")

Downloads last month: 65

Model tree for Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Quantized

(24)

this model