Qwen
/

Qwen3-Coder-Next

Text Generation

Model card Files Files and versions

littlebird13 commited on 15 days ago

Commit

b85cca4

·

verified ·

1 Parent(s): 867c91e

Update README.md

Files changed (1) hide show

README.md +50 -0

README.md CHANGED Viewed

@@ -97,6 +97,56 @@ print("content:", content)
 For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
 ## Agentic Coding
 Qwen3-Coder-Next excels in tool calling capabilities.

 For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
+## Deployment
+For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-compatible API endpoint.
+### SGLang
+[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
+SGLang could be used to launch a server with OpenAI-compatible API service.
+`sglang>=0.5.2` is required for Qwen3-Next, which can be installed using:
+```shell
+pip install 'sglang[all]>=0.5.2'
+```
+See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
+The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
+```shell
+python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Instruct --port 30000 --tp-size 4 --context-length 262144 --mem-fraction-static 0.8
+```
+The following command is recommended for MTP with the rest settings the same as above:
+```shell
+python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Instruct --port 30000 --tp-size 4 --context-length 262144 --mem-fraction-static 0.8 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
+```
+> [!Note]
+> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
+Please also refer to SGLang's usage guide on [Qwen3-Next](https://docs.sglang.ai/basic_usage/qwen3.html).
+### vLLM
+[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
+vLLM could be used to launch a server with OpenAI-compatible API service.
+`vllm>=0.15.0` is required for Qwen3-Coder-Next, which can be installed using:
+```shell
+pip install 'vllm>=0.15.0'
+```
+See [its documentation](https://docs.vllm.ai/en/stable/getting_started/installation/index.html) for more details.
+The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
+```shell
+vllm serve Qwen/Qwen3-Coder-Next  --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
+```
+> [!Note]
+> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
 ## Agentic Coding
 Qwen3-Coder-Next excels in tool calling capabilities.