Svenni551
/

Qwen3-Embedding-0.6B-ONNX-INT8

@@ -35,7 +35,8 @@ It is optimized for high-throughput CPU inference using Hugging Face's [Text Emb
 ## Usage with Text Embeddings Inference (TEI)
-This model is pre-configured for TEI.
 ### Option A: Docker CLI
@@ -50,8 +51,6 @@ docker run --rm -p 8080:80 \\
 ### Option B: Docker Compose
-Use this configuration to integrate the model into your stack:
 ```yaml
 services:
   embedding-service:
@@ -68,27 +67,12 @@ services:
       - "8080:80"
 ```
-### API Request Example
-Once the container is running, you can generate embeddings via the HTTP API:
-```bash
-curl 127.0.0.1:8080/embed \\
-    -X POST \\
-    -d '{"inputs":"Deep learning is a subset of machine learning."}' \\
-    -H 'Content-Type: application/json'
-```
 ## Usage with Python (Optimum)
-You can also run this model locally using the `optimum` library with ONNX Runtime.
-**Installation:**
 ```bash
 pip install optimum[onnxruntime] transformers
 ```
-**Inference Code:**
 ```python
 from optimum.onnxruntime import ORTModelForFeatureExtraction
 from transformers import AutoTokenizer
@@ -110,20 +94,11 @@ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt"
 with torch.no_grad():
     outputs = model(**inputs)
-# Mean Pooling (getting the sentence embeddings)
-# Attention mask is needed to exclude padding tokens from the average
 attention_mask = inputs['attention_mask']
 token_embeddings = outputs.last_hidden_state
 input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
 embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 print(f"Embeddings shape: {embeddings.shape}")
-# Output: torch.Size([2, 1024]) (Dimension depends on specific Qwen model config)
-```
-## Performance Comparison
-By converting to ONNX and quantizing to INT8, this model achieves significantly lower latency and reduced memory footprint compared to the original PyTorch model, with minimal impact on embedding quality.
-- **Memory Usage:** Reduced by approximately 50%.
-- **Inference Speed:** Up to 3x-5x faster on modern CPUs (depending on batch size and sequence length).

 ## Usage with Text Embeddings Inference (TEI)
+This model is pre-configured for TEI. You can run it directly using Docker.
+**Note:** `auto-truncate` is required because the model supports 32k context, but Docker defaults to smaller batches.
 ### Option A: Docker CLI
 ### Option B: Docker Compose
 ```yaml
 services:
   embedding-service:
       - "8080:80"
 ```
 ## Usage with Python (Optimum)
 ```bash
 pip install optimum[onnxruntime] transformers
 ```
 ```python
 from optimum.onnxruntime import ORTModelForFeatureExtraction
 from transformers import AutoTokenizer
 with torch.no_grad():
     outputs = model(**inputs)
+# Mean Pooling
 attention_mask = inputs['attention_mask']
 token_embeddings = outputs.last_hidden_state
 input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
 embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 print(f"Embeddings shape: {embeddings.shape}")
+```