littlebird13 commited on
Commit
b85cca4
·
verified ·
1 Parent(s): 867c91e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md CHANGED
@@ -97,6 +97,56 @@ print("content:", content)
97
 
98
  For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  ## Agentic Coding
101
 
102
  Qwen3-Coder-Next excels in tool calling capabilities.
 
97
 
98
  For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
99
 
100
+ ## Deployment
101
+
102
+ For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-compatible API endpoint.
103
+
104
+ ### SGLang
105
+
106
+ [SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
107
+ SGLang could be used to launch a server with OpenAI-compatible API service.
108
+
109
+ `sglang>=0.5.2` is required for Qwen3-Next, which can be installed using:
110
+ ```shell
111
+ pip install 'sglang[all]>=0.5.2'
112
+ ```
113
+ See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
114
+
115
+ The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
116
+ ```shell
117
+ python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Instruct --port 30000 --tp-size 4 --context-length 262144 --mem-fraction-static 0.8
118
+ ```
119
+
120
+ The following command is recommended for MTP with the rest settings the same as above:
121
+ ```shell
122
+ python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Instruct --port 30000 --tp-size 4 --context-length 262144 --mem-fraction-static 0.8 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
123
+ ```
124
+
125
+ > [!Note]
126
+ > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
127
+
128
+ Please also refer to SGLang's usage guide on [Qwen3-Next](https://docs.sglang.ai/basic_usage/qwen3.html).
129
+
130
+ ### vLLM
131
+
132
+ [vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
133
+ vLLM could be used to launch a server with OpenAI-compatible API service.
134
+
135
+ `vllm>=0.15.0` is required for Qwen3-Coder-Next, which can be installed using:
136
+ ```shell
137
+ pip install 'vllm>=0.15.0'
138
+ ```
139
+ See [its documentation](https://docs.vllm.ai/en/stable/getting_started/installation/index.html) for more details.
140
+
141
+ The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
142
+ ```shell
143
+ vllm serve Qwen/Qwen3-Coder-Next --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
144
+ ```
145
+
146
+ > [!Note]
147
+ > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
148
+
149
+
150
  ## Agentic Coding
151
 
152
  Qwen3-Coder-Next excels in tool calling capabilities.