--- tags: - text-embeddings-inference - onnx - tei - int8 - feature-extraction - sentence-transformers - qwen library_name: transformers pipeline_tag: feature-extraction base_model: Qwen/Qwen3-Embedding-0.6B language: - en - zh - de license: apache-2.0 --- # Qwen3-Embedding-0.6B (ONNX INT8) This repository contains a **quantized (INT8)** and **ONNX-exported** version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B). It is optimized for high-throughput CPU inference using Hugging Face's [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) or the `optimum` library. ## Model Details | Attribute | Detail | | :--- | :--- | | **Base Model** | Qwen/Qwen3-Embedding-0.6B | | **Format** | ONNX (Opset 17) | | **Quantization** | INT8 (AVX2 optimized) | | **Task** | Feature Extraction / Semantic Embedding | | **File Size** | ~0.6 GB (vs ~1.2 GB Original) | ## Usage with Text Embeddings Inference (TEI) This model is pre-configured for TEI. You can run it directly using Docker. **Note:** `auto-truncate` is required because the model supports 32k context, but Docker defaults to smaller batches. ### Option A: Docker CLI ```bash docker run --rm -p 8080:80 \\ -v $PWD/data:/data \\ ghcr.io/huggingface/text-embeddings-inference:cpu-latest \\ --model-id Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8 \\ --pooling mean \\ --auto-truncate ``` ### Option B: Docker Compose ```yaml services: embedding-service: image: ghcr.io/huggingface/text-embeddings-inference:cpu-latest environment: - MODEL_ID=Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8 - POOLING=mean - MAX_CLIENT_BATCH_SIZE=8 - MAX_BATCH_TOKENS=2048 - AUTO_TRUNCATE=true volumes: - ./data:/data ports: - "8080:80" ``` ## Usage with Python (Optimum) ```bash pip install optimum[onnxruntime] transformers ``` ```python from optimum.onnxruntime import ORTModelForFeatureExtraction from transformers import AutoTokenizer import torch model_id = "Svenni551/Qwen3-Embedding-0.6B-ONNX-INT8" # Load Tokenizer and ONNX Model tokenizer = AutoTokenizer.from_pretrained(model_id) model = ORTModelForFeatureExtraction.from_pretrained(model_id) # Input text sentences = ["This is an example sentence.", "Qwen3 is a powerful model."] # Tokenize inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") # Perform Inference with torch.no_grad(): outputs = model(**inputs) # Mean Pooling attention_mask = inputs['attention_mask'] token_embeddings = outputs.last_hidden_state input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) print(f"Embeddings shape: {embeddings.shape}") ```