dasLOL commited on
Commit
cf64e6a
·
verified ·
1 Parent(s): 3e06ec0

Upload docs/deploy_guidance.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/deploy_guidance.md +196 -0
docs/deploy_guidance.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kimi-K2 Deployment Guide
2
+
3
+ > [!Note]
4
+ > This guide only provides some examples of deployment commands for Kimi-K2, which may not be the optimal configuration. Since inference engines are still being updated frequenty, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
5
+
6
+
7
+ ## vLLM Deployment
8
+
9
+ The smallest deployment unit for Kimi-K2 FP8 weights with 256k seqlen on mainstream H200 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP).
10
+ Running parameters for this environment are provided below. You may scale up to more nodes and increase expert-parallelism to enlarge the inference batch size and overall throughput.
11
+
12
+ ### Tensor Parallelism
13
+
14
+ When the parallelism degree ≤ 16, you can run inference with pure Tensor Parallelism. A sample launch command is:
15
+
16
+ ``` bash
17
+ # start ray on node 0 and node 1
18
+
19
+ # node 0:
20
+ vllm serve $MODEL_PATH \
21
+ --port 8000 \
22
+ --served-model-name kimi-k2 \
23
+ --trust-remote-code \
24
+ --tensor-parallel-size 16 \
25
+ --enable-auto-tool-choice \
26
+ --tool-call-parser kimi_k2
27
+ ```
28
+
29
+ **Key parameter notes:**
30
+ - `--tensor-parallel-size 16`: If using more than 16 GPUs, combine with pipeline-parallelism.
31
+ - `--enable-auto-tool-choice`: Required when enabling tool usage.
32
+ - `--tool-call-parser kimi_k2`: Required when enabling tool usage.
33
+
34
+ ### Data Parallelism + Expert Parallelism
35
+
36
+ You can install libraries like DeepEP and DeepGEMM as needed. Then run the command (example on H200):
37
+
38
+ ``` bash
39
+ # node 0
40
+ vllm serve $MODEL_PATH --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2
41
+
42
+ # node 1
43
+ vllm serve $MODEL_PATH --headless --data-parallel-start-rank 8 --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2
44
+ ```
45
+
46
+ ## SGLang Deployment
47
+
48
+ Similarly, we can use TP or DP+EP in SGLang for Deployment, here are the examples.
49
+
50
+
51
+ ### Tensor Parallelism
52
+
53
+ Here is the simple example code to run TP16 with two nodes on H200:
54
+
55
+ ``` bash
56
+ # Node 0
57
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr $MASTER_IP:50000 --nnodes 2 --node-rank 0 --trust-remote-code --tool-call-parser kimi_k2
58
+
59
+ # Node 1
60
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp 16 --dist-init-addr $MASTER_IP:50000 --nnodes 2 --node-rank 1 --trust-remote-code --tool-call-parser kimi_k2
61
+ ```
62
+
63
+ **Key parameter notes:**
64
+ - `--tool-call-parser kimi_k2`: Required when enabling tool usage.
65
+
66
+ ### Data Parallelism + Expert Parallelism
67
+
68
+ Here is an example for large scale Prefill-Decode Disaggregation (4P12D H200) with DP+EP in SGLang:
69
+
70
+ ``` bash
71
+ # for prefill node
72
+ MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 PYTHONUNBUFFERED=1 \
73
+ python -m sglang.launch_server --model-path $MODEL_PATH \
74
+ --trust-remote-code --disaggregation-mode prefill --dist-init-addr $PREFILL_NODE0$:5757 --tp-size 32 --dp-size 32 --enable-dp-attention --host $LOCAL_IP --decode-log-interval 1 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device $IB_DEVICE --chunked-prefill-size 262144 --mem-fraction-static 0.85 --deepep-mode normal --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --max-running-requests 1024 --nnodes 4 --node-rank $RANK --tool-call-parser kimi_k2
75
+
76
+
77
+ # for decode node
78
+ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=480 MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 PYTHONUNBUFFERED=1 \
79
+ python -m sglang.launch_server --model-path $MODEL_PATH --trust-remote-code --disaggregation-mode decode --dist-init-addr $DECODE_NODE0:5757 --tp-size 96 --dp-size 96 --enable-dp-attention --host $LOCAL_IP --decode-log-interval 1 --context-length 2176 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device $IB_DEVICE --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-bs 480 --max-running-requests 46080 --ep-num-redundant-experts 96 --nnodes 12 --node-rank $RANK --tool-call-parser kimi_k2
80
+
81
+ # pdlb
82
+ PYTHONUNBUFFERED=1 python -m sglang.srt.disaggregation.launch_lb --prefill http://${PREFILL_NODE0}:30000 --decode http://${DECODE_NODE0}:30000
83
+ ```
84
+
85
+ ## KTransformers Deployment
86
+
87
+ Please copy all configuration files (i.e., everything except the .safetensors files) into the GGUF checkpoint folder at /path/to/K2. Then run:
88
+ ``` bash
89
+ python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000
90
+ ```
91
+
92
+ To enable AMX optimization, run:
93
+
94
+ ``` bash
95
+ python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve-amx.yaml
96
+ ```
97
+
98
+ ## TensoRT-LLM Deployment
99
+ ### Prerequisite
100
+ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) to build TensorRT-LLM v1.0.0-rc2 from source and start a TRT-LLM docker container.
101
+
102
+ install blobfile by:
103
+ ```bash
104
+ pip install blobfile
105
+ ```
106
+ ### Multi-node Serving
107
+ TensorRT-LLM supports multi-node inference. You can use mpirun to launch Kimi-K2 with multi-node jobs. We will use two nodes for this example.
108
+
109
+ #### mpirun
110
+ mpirun requires each node to have passwordless ssh access to the other node. We need to setup the environment inside the docker container. Run the container with host network and mount the current directory as well as model directory to the container.
111
+
112
+ ```bash
113
+ # use host network
114
+ IMAGE=<YOUR_IMAGE>
115
+ NAME=test_2node_docker
116
+ # host1
117
+ docker run -it --name ${NAME}_host1 --ipc=host --gpus=all --network host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/workspace -v <YOUR_MODEL_DIR>:/models/DeepSeek-V3 -w /workspace ${IMAGE}
118
+ # host2
119
+ docker run -it --name ${NAME}_host2 --ipc=host --gpus=all --network host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/workspace -v <YOUR_MODEL_DIR>:/models/DeepSeek-V3 -w /workspace ${IMAGE}
120
+ ```
121
+
122
+ Set up ssh inside the container
123
+
124
+ ```bash
125
+ apt-get update && apt-get install -y openssh-server
126
+
127
+ # modify /etc/ssh/sshd_config
128
+ PermitRootLogin yes
129
+ PubkeyAuthentication yes
130
+ # modify /etc/ssh/sshd_config, change default port 22 to another unused port
131
+ port 2233
132
+
133
+ # modify /etc/ssh
134
+ ```
135
+
136
+ Generate ssh key on host1 and copy to host2, vice versa.
137
+
138
+ ```bash
139
+ # on host1
140
+ ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
141
+ ssh-copy-id -i ~/.ssh/id_ed25519.pub root@<HOST2>
142
+ # on host2
143
+ ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
144
+ ssh-copy-id -i ~/.ssh/id_ed25519.pub root@<HOST1>
145
+
146
+ # restart ssh service on host1 and host2
147
+ service ssh restart # or
148
+ /etc/init.d/ssh restart # or
149
+ systemctl restart ssh
150
+ ```
151
+
152
+ Generate additional config for trtllm serve.
153
+ ```bash
154
+ cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
155
+ cuda_graph_config:
156
+ padding_enabled: true
157
+ batch_sizes:
158
+ - 1
159
+ - 2
160
+ - 4
161
+ - 8
162
+ - 16
163
+ - 32
164
+ - 64
165
+ - 128
166
+ print_iter_log: true
167
+ enable_attention_dp: true
168
+ EOF
169
+ ```
170
+
171
+
172
+ After the preparations,you can run the trtllm-serve on two nodes using mpirun:
173
+
174
+ ```bash
175
+ mpirun -np 16 \
176
+ -H <HOST1>:8,<HOST2>:8 \
177
+ -mca plm_rsh_args "-p 2233" \
178
+ --allow-run-as-root \
179
+ trtllm-llmapi-launch trtllm-serve serve \
180
+ --backend pytorch \
181
+ --tp_size 16 \
182
+ --ep_size 8 \
183
+ --kv_cache_free_gpu_memory_fraction 0.95 \
184
+ --trust_remote_code \
185
+ --max_batch_size 128 \
186
+ --max_num_tokens 4096 \
187
+ --extra_llm_api_options /path/to/TensorRT-LLM/extra-llm-api-config.yml \
188
+ --port 8000 \
189
+ <YOUR_MODEL_DIR>
190
+ ```
191
+
192
+ ## Others
193
+
194
+ Kimi-K2 reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.
195
+
196
+ If you are using a framework that is not on the recommended list, you can still run the model by manually changing `model_type` to "deepseek_v3" in `config.json` as a temporary workaround. You may need to manually parse tool calls in case no tool call parser is available in your framework.