Update README.md
Browse files
README.md
CHANGED
|
@@ -97,6 +97,56 @@ print("content:", content)
|
|
| 97 |
|
| 98 |
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
## Agentic Coding
|
| 101 |
|
| 102 |
Qwen3-Coder-Next excels in tool calling capabilities.
|
|
|
|
| 97 |
|
| 98 |
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
|
| 99 |
|
| 100 |
+
## Deployment
|
| 101 |
+
|
| 102 |
+
For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-compatible API endpoint.
|
| 103 |
+
|
| 104 |
+
### SGLang
|
| 105 |
+
|
| 106 |
+
[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
|
| 107 |
+
SGLang could be used to launch a server with OpenAI-compatible API service.
|
| 108 |
+
|
| 109 |
+
`sglang>=0.5.2` is required for Qwen3-Next, which can be installed using:
|
| 110 |
+
```shell
|
| 111 |
+
pip install 'sglang[all]>=0.5.2'
|
| 112 |
+
```
|
| 113 |
+
See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
|
| 114 |
+
|
| 115 |
+
The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
|
| 116 |
+
```shell
|
| 117 |
+
python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Instruct --port 30000 --tp-size 4 --context-length 262144 --mem-fraction-static 0.8
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
The following command is recommended for MTP with the rest settings the same as above:
|
| 121 |
+
```shell
|
| 122 |
+
python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Instruct --port 30000 --tp-size 4 --context-length 262144 --mem-fraction-static 0.8 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
> [!Note]
|
| 126 |
+
> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
|
| 127 |
+
|
| 128 |
+
Please also refer to SGLang's usage guide on [Qwen3-Next](https://docs.sglang.ai/basic_usage/qwen3.html).
|
| 129 |
+
|
| 130 |
+
### vLLM
|
| 131 |
+
|
| 132 |
+
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
|
| 133 |
+
vLLM could be used to launch a server with OpenAI-compatible API service.
|
| 134 |
+
|
| 135 |
+
`vllm>=0.15.0` is required for Qwen3-Coder-Next, which can be installed using:
|
| 136 |
+
```shell
|
| 137 |
+
pip install 'vllm>=0.15.0'
|
| 138 |
+
```
|
| 139 |
+
See [its documentation](https://docs.vllm.ai/en/stable/getting_started/installation/index.html) for more details.
|
| 140 |
+
|
| 141 |
+
The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
|
| 142 |
+
```shell
|
| 143 |
+
vllm serve Qwen/Qwen3-Coder-Next --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
> [!Note]
|
| 147 |
+
> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
|
| 148 |
+
|
| 149 |
+
|
| 150 |
## Agentic Coding
|
| 151 |
|
| 152 |
Qwen3-Coder-Next excels in tool calling capabilities.
|