update readme
Browse files
README.md
CHANGED
|
@@ -50,7 +50,9 @@ MiniCPM-SALA is an efficient hybrid model in which 25% of the layers adopt [InfL
|
|
| 50 |
- **Efficient Inference on Long Sequences**
|
| 51 |
- Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors.
|
| 52 |
|
| 53 |
-
##
|
|
|
|
|
|
|
| 54 |
|
| 55 |
### HuggingFace
|
| 56 |
|
|
|
|
| 50 |
- **Efficient Inference on Long Sequences**
|
| 51 |
- Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors.
|
| 52 |
|
| 53 |
+
## Inference
|
| 54 |
+
|
| 55 |
+
To achieve optimal performance, we recommend using `Temperature=0.9`.
|
| 56 |
|
| 57 |
### HuggingFace
|
| 58 |
|