Svenni551 commited on
Commit
5a87674
·
verified ·
1 Parent(s): 89856bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -29
README.md CHANGED
@@ -35,7 +35,8 @@ It is optimized for high-throughput CPU inference using Hugging Face's [Text Emb
35
 
36
  ## Usage with Text Embeddings Inference (TEI)
37
 
38
- This model is pre-configured for TEI.
 
39
 
40
  ### Option A: Docker CLI
41
 
@@ -50,8 +51,6 @@ docker run --rm -p 8080:80 \\
50
 
51
  ### Option B: Docker Compose
52
 
53
- Use this configuration to integrate the model into your stack:
54
-
55
  ```yaml
56
  services:
57
  embedding-service:
@@ -68,27 +67,12 @@ services:
68
  - "8080:80"
69
  ```
70
 
71
- ### API Request Example
72
-
73
- Once the container is running, you can generate embeddings via the HTTP API:
74
-
75
- ```bash
76
- curl 127.0.0.1:8080/embed \\
77
- -X POST \\
78
- -d '{"inputs":"Deep learning is a subset of machine learning."}' \\
79
- -H 'Content-Type: application/json'
80
- ```
81
-
82
  ## Usage with Python (Optimum)
83
 
84
- You can also run this model locally using the `optimum` library with ONNX Runtime.
85
-
86
- **Installation:**
87
  ```bash
88
  pip install optimum[onnxruntime] transformers
89
  ```
90
 
91
- **Inference Code:**
92
  ```python
93
  from optimum.onnxruntime import ORTModelForFeatureExtraction
94
  from transformers import AutoTokenizer
@@ -110,20 +94,11 @@ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt"
110
  with torch.no_grad():
111
  outputs = model(**inputs)
112
 
113
- # Mean Pooling (getting the sentence embeddings)
114
- # Attention mask is needed to exclude padding tokens from the average
115
  attention_mask = inputs['attention_mask']
116
  token_embeddings = outputs.last_hidden_state
117
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
118
  embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
119
 
120
  print(f"Embeddings shape: {embeddings.shape}")
121
- # Output: torch.Size([2, 1024]) (Dimension depends on specific Qwen model config)
122
- ```
123
-
124
- ## Performance Comparison
125
-
126
- By converting to ONNX and quantizing to INT8, this model achieves significantly lower latency and reduced memory footprint compared to the original PyTorch model, with minimal impact on embedding quality.
127
-
128
- - **Memory Usage:** Reduced by approximately 50%.
129
- - **Inference Speed:** Up to 3x-5x faster on modern CPUs (depending on batch size and sequence length).
 
35
 
36
  ## Usage with Text Embeddings Inference (TEI)
37
 
38
+ This model is pre-configured for TEI. You can run it directly using Docker.
39
+ **Note:** `auto-truncate` is required because the model supports 32k context, but Docker defaults to smaller batches.
40
 
41
  ### Option A: Docker CLI
42
 
 
51
 
52
  ### Option B: Docker Compose
53
 
 
 
54
  ```yaml
55
  services:
56
  embedding-service:
 
67
  - "8080:80"
68
  ```
69
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Usage with Python (Optimum)
71
 
 
 
 
72
  ```bash
73
  pip install optimum[onnxruntime] transformers
74
  ```
75
 
 
76
  ```python
77
  from optimum.onnxruntime import ORTModelForFeatureExtraction
78
  from transformers import AutoTokenizer
 
94
  with torch.no_grad():
95
  outputs = model(**inputs)
96
 
97
+ # Mean Pooling
 
98
  attention_mask = inputs['attention_mask']
99
  token_embeddings = outputs.last_hidden_state
100
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
101
  embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
102
 
103
  print(f"Embeddings shape: {embeddings.shape}")
104
+ ```