Update README.md
Browse files
README.md
CHANGED
|
@@ -106,26 +106,20 @@ For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-co
|
|
| 106 |
[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
|
| 107 |
SGLang could be used to launch a server with OpenAI-compatible API service.
|
| 108 |
|
| 109 |
-
`sglang>=
|
| 110 |
```shell
|
| 111 |
-
pip install 'sglang[all]>=
|
| 112 |
```
|
| 113 |
See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
|
| 114 |
|
| 115 |
The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
|
| 116 |
```shell
|
| 117 |
-
python -m sglang.launch_server --model
|
| 118 |
-
```
|
| 119 |
|
| 120 |
-
The following command is recommended for MTP with the rest settings the same as above:
|
| 121 |
-
```shell
|
| 122 |
-
python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Instruct --port 30000 --tp-size 4 --context-length 262144 --mem-fraction-static 0.8 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
|
| 123 |
-
```
|
| 124 |
|
| 125 |
> [!Note]
|
| 126 |
> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
|
| 127 |
|
| 128 |
-
Please also refer to SGLang's usage guide on [Qwen3-Next](https://docs.sglang.ai/basic_usage/qwen3.html).
|
| 129 |
|
| 130 |
### vLLM
|
| 131 |
|
|
|
|
| 106 |
[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
|
| 107 |
SGLang could be used to launch a server with OpenAI-compatible API service.
|
| 108 |
|
| 109 |
+
`sglang>=v0.5.8` is required for Qwen3-Coder-Next, which can be installed using:
|
| 110 |
```shell
|
| 111 |
+
pip install 'sglang[all]>=v0.5.8'
|
| 112 |
```
|
| 113 |
See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
|
| 114 |
|
| 115 |
The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
|
| 116 |
```shell
|
| 117 |
+
python -m sglang.launch_server --model Qwen/Qwen3-Coder-Next --tp-size 2 --tool-call-parser qwen3_coder```
|
|
|
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
> [!Note]
|
| 121 |
> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
|
| 122 |
|
|
|
|
| 123 |
|
| 124 |
### vLLM
|
| 125 |
|