---
tags:
- text-embeddings-inference
- onnx
- tei
- int8
- feature-extraction
- sentence-transformers
- qwen
library_name: transformers
pipeline_tag: feature-extraction
base_model: Qwen/Qwen3-Embedding-0.6B
language:
- en
- zh
- de
license: apache-2.0
---

# Qwen3-Embedding-0.6B (ONNX INT8)

This repository contains a **quantized (INT8)** and **ONNX-exported** version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B).

It is optimized for high-throughput CPU inference using Hugging Face's [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) or the `optimum` library.

## Model Details

| Attribute | Detail |
| :--- | :--- |
| **Base Model** | Qwen/Qwen3-Embedding-0.6B |
| **Format** | ONNX (Opset 17) |
| **Quantization** | INT8 (AVX2 optimized) |
| **Task** | Feature Extraction / Semantic Embedding |
| **File Size** | ~0.6 GB (vs ~1.2 GB Original) |

## Usage with Text Embeddings Inference (TEI)

This model is pre-configured for TEI. You can run it directly using Docker. 
**Note:** `auto-truncate` is required because the model supports 32k context, but Docker defaults to smaller batches.

### Option A: Docker CLI

```bash
docker run --rm -p 8080:80 \\
    -v $PWD/data:/data \\
    ghcr.io/huggingface/text-embeddings-inference:cpu-latest \\
    --model-id Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8 \\
    --pooling mean \\
    --auto-truncate
```

### Option B: Docker Compose

```yaml
services:
  embedding-service:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-latest
    environment:
      - MODEL_ID=Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8
      - POOLING=mean
      - MAX_CLIENT_BATCH_SIZE=8
      - MAX_BATCH_TOKENS=2048
      - AUTO_TRUNCATE=true
    volumes:
      - ./data:/data
    ports:
      - "8080:80"
```

## Usage with Python (Optimum)

```bash
pip install optimum[onnxruntime] transformers
```

```python
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch

model_id = "Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8"

# Load Tokenizer and ONNX Model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForFeatureExtraction.from_pretrained(model_id)

# Input text
sentences = ["This is an example sentence.", "Qwen3 is a powerful model."]

# Tokenize
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Perform Inference
with torch.no_grad():
    outputs = model(**inputs)

# Mean Pooling
attention_mask = inputs['attention_mask']
token_embeddings = outputs.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

print(f"Embeddings shape: {embeddings.shape}")
```