jerryzhu423 commited on
Commit
65191cf
·
verified ·
1 Parent(s): d30fdf6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -591
README.md CHANGED
@@ -8,7 +8,6 @@ library_name: transformers
8
  <img src="figures/kimi-logo.png" width="30%" alt="Kimi K2: Open Agentic Intellignece">
9
  </picture>
10
  </div>
11
-
12
  <hr>
13
 
14
  <div align="center" style="line-height:1">
@@ -22,7 +21,6 @@ library_name: transformers
22
  <a href="https://twitter.com/kimi_moonshot" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-Kimi.ai-white?logo=x&logoColor=white"/></a>
23
  <a href="https://discord.gg/TYU2fdJykW" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-Kimi.ai-white?logo=discord&logoColor=white"/></a>
24
  </div>
25
-
26
  <div align="center" style="line-height: 1;">
27
  <a href="https://github.com/moonshotai/Kimi-K2/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
28
  </div>
@@ -34,22 +32,13 @@ library_name: transformers
34
 
35
  ## 1. Model Introduction
36
 
37
- Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.
38
 
39
  ### Key Features
40
- - Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability.
41
- - MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up.
42
- - Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.
43
 
44
- ### Model Variants
45
- - **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
46
- - **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
47
-
48
- <div align="center">
49
- <picture>
50
- <img src="figures/banner.png" width="80%" alt="Evaluation Results">
51
- </picture>
52
- </div>
53
 
54
  ## 2. Model Summary
55
 
@@ -77,582 +66,21 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
77
 
78
  ## 3. Evaluation Results
79
 
80
- #### Instruction model evaluation results
81
-
82
- <div align="center">
83
- <table>
84
- <thead>
85
- <tr>
86
- <th align="center">Benchmark</th>
87
- <th align="center">Metric</th>
88
- <th align="center"><sup>Kimi K2 Instruct</sup></th>
89
- <th align="center"><sup>DeepSeek-V3-0324</sup></th>
90
- <th align="center"><sup>Qwen3-235B-A22B <br><sup>(non-thinking)</sup></sup></th>
91
- <th align="center"><sup>Claude Sonnet 4 <br><sup>(w/o extended thinking)</sup></sup></th>
92
- <th align="center"><sup>Claude Opus 4 <br><sup>(w/o extended thinking)</sup></sup></th>
93
- <th align="center"><sup>GPT-4.1</sup></th>
94
- <th align="center"><sup>Gemini 2.5 Flash <br> Preview (05-20)</sup></th>
95
- </tr>
96
- </thead>
97
- <tbody>
98
- <tr>
99
- <td align="center" colspan=9><strong>Coding Tasks</strong></td>
100
- </tr>
101
- <tr>
102
- <td align="center">LiveCodeBench v6<br><sup>(Aug 24 - May 25)</sup></td>
103
- <td align="center">Pass@1</td>
104
- <td align="center"><strong>53.7</strong></td>
105
- <td align="center">46.9</td>
106
- <td align="center">37.0</td>
107
- <td align="center">48.5</td>
108
- <td align="center">47.4</td>
109
- <td align="center">44.7</td>
110
- <td align="center">44.7</td>
111
- </tr>
112
- <tr>
113
- <td align="center">OJBench</td>
114
- <td align="center">Pass@1</td>
115
- <td align="center"><strong>27.1</strong></td>
116
- <td align="center">24.0</td>
117
- <td align="center">11.3</td>
118
- <td align="center">15.3</td>
119
- <td align="center">19.6</td>
120
- <td align="center">19.5</td>
121
- <td align="center">19.5</td>
122
- </tr>
123
-
124
- <tr>
125
- <td align="center">MultiPL-E</td>
126
- <td align="center">Pass@1</td>
127
- <td align="center"><ins><strong>85.7</strong></ins></td>
128
- <td align="center">83.1</td>
129
- <td align="center">78.2</td>
130
- <td align="center">88.6</td>
131
- <td align="center"><strong>89.6</strong></td>
132
- <td align="center">86.7</td>
133
- <td align="center">85.6</td>
134
- </tr>
135
-
136
- <tr>
137
- <td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
138
- <td align="center">Single Patch w/o Test (Acc)</td>
139
- <td align="center"><ins><strong>51.8</strong></ins></td>
140
- <td align="center">36.6</td>
141
- <td align="center">39.4</td>
142
- <td align="center">50.2</td>
143
- <td align="center"><strong>53.0</strong></td>
144
- <td align="center">40.8</td>
145
- <td align="center">32.6</td>
146
- </tr>
147
-
148
- <tr>
149
- <td align="center" rowspan="2">SWE-bench Verified <br/> <sup>(Agentic Coding)</sup></td>
150
- <td align="center">Single Attempt (Acc)</td>
151
- <td align="center"><ins><strong>65.8</strong></ins></td>
152
- <td align="center">38.8</td>
153
- <td align="center">34.4</td>
154
- <td align="center"><strong>72.7</strong><sup>*</sup></td>
155
- <td align="center">72.5<sup>*</sup></td>
156
- <td align="center">54.6</td>
157
- <td align="center">—</td>
158
- </tr>
159
-
160
- <tr>
161
- <!--<td align="center">(Agentic Coding)</td>-->
162
- <td align="center">Multiple Attempts (Acc)</td>
163
- <td align="center"><ins><strong>71.6</strong></ins></td>
164
- <td align="center">—</td>
165
- <td align="center">—</td>
166
- <td align="center"><strong>80.2</strong></td>
167
- <td align="center">79.4<sup>*</sup></td>
168
- <td align="center">—</td>
169
- <td align="center">—</td>
170
- </tr>
171
-
172
- <tr>
173
- <td align="center">SWE-bench Multilingual<br /> <sup>(Agentic Coding)</sup></td>
174
- <td align="center">Single Attempt (Acc)</td>
175
- <td align="center"><ins><strong>47.3</strong> </ins></td>
176
- <td align="center">25.8</td>
177
- <td align="center">20.9</td>
178
- <td align="center"><strong>51.0</strong></td>
179
- <td align="center">—</td>
180
- <td align="center">31.5</td>
181
- <td align="center">—</td>
182
- </tr>
183
-
184
- <tr>
185
- <td align="center" rowspan="2">TerminalBench</td>
186
- <td align="center">Inhouse Framework (Acc)</td>
187
- <td align="center"><ins><strong>30.0</strong></ins></td>
188
- <td align="center">—</td>
189
- <td align="center">—</td>
190
- <td align="center">35.5</td>
191
- <td align="center"><strong>43.2</strong></td>
192
- <td align="center">8.3</td>
193
- <td align="center">—</td>
194
- </tr>
195
-
196
- <tr>
197
- <!--<td align="center">TerminalBench</td>-->
198
- <td align="center">Terminus (Acc)</td>
199
- <td align="center"><ins><strong>25.0</strong> </ins></td>
200
- <td align="center">16.3</td>
201
- <td align="center">6.6</td>
202
- <td align="center">—</td>
203
- <td align="center">—</td>
204
- <td align="center"><strong>30.3</strong></td>
205
- <td align="center">16.8</td>
206
- </tr>
207
- <tr>
208
- <td align="center">Aider-Polyglot</td>
209
- <td align="center">Acc</td>
210
- <td align="center">60.0</td>
211
- <td align="center">55.1</td>
212
- <td align="center"><ins><strong>61.8</strong></ins></td>
213
- <td align="center">56.4</td>
214
- <td align="center"><strong>70.7</strong></td>
215
- <td align="center">52.4</td>
216
- <td align="center">44.0</td>
217
- </tr>
218
- <tr>
219
- <td align="center" colspan=9><strong>Tool Use Tasks</strong></td>
220
- </tr>
221
- <tr>
222
- <td align="center">Tau2 retail</td>
223
- <td align="center">Avg@4</td>
224
- <td align="center"><ins><strong>70.6</strong></ins></td>
225
- <td align="center">69.1</td>
226
- <td align="center">57.0</td>
227
- <td align="center">75.0</td>
228
- <td align="center"><strong>81.8</strong></td>
229
- <td align="center">74.8</td>
230
- <td align="center">64.3</td>
231
- </tr>
232
- <tr>
233
- <td align="center">Tau2 airline</td>
234
- <td align="center">Avg@4</td>
235
- <td align="center"><ins><strong>56.5</strong></ins></td>
236
- <td align="center">39.0</td>
237
- <td align="center">26.5</td>
238
- <td align="center">55.5</td>
239
- <td align="center"><strong>60.0</strong></td>
240
- <td align="center">54.5</td>
241
- <td align="center">42.5</td>
242
- </tr>
243
- <tr>
244
- <td align="center">Tau2 telecom</td>
245
- <td align="center">Avg@4</td>
246
- <td align="center"><strong>65.8</strong></td>
247
- <td align="center">32.5</td>
248
- <td align="center">22.1</td>
249
- <td align="center">45.2</td>
250
- <td align="center">57.0</td>
251
- <td align="center">38.6</td>
252
- <td align="center">16.9</td>
253
- </tr>
254
- <tr>
255
- <td align="center">AceBench</td>
256
- <td align="center">Acc</td>
257
- <td align="center"><ins><strong>76.5</strong></ins></td>
258
- <td align="center">72.7</td>
259
- <td align="center">70.5</td>
260
- <td align="center">76.2</td>
261
- <td align="center">75.6</td>
262
- <td align="center"><strong>80.1</strong></td>
263
- <td align="center">74.5</td>
264
- </tr>
265
- <tr>
266
- <td align="center" colspan=9><strong>Math &amp; STEM Tasks</strong></td>
267
- </tr>
268
- <tr>
269
- <td align="center">AIME 2024</td>
270
- <td align="center">Avg@64</td>
271
- <td align="center"><strong>69.6</strong></td>
272
- <td align="center">59.4<sup>*</sup></td>
273
- <td align="center">40.1<sup>*</sup></td>
274
- <td align="center">43.4</td>
275
- <td align="center">48.2</td>
276
- <td align="center">46.5</td>
277
- <td align="center">61.3</td>
278
- </tr>
279
- <tr>
280
- <td align="center">AIME 2025</td>
281
- <td align="center">Avg@64</td>
282
- <td align="center"><strong>49.5</strong></td>
283
- <td align="center">46.7</td>
284
- <td align="center">24.7<sup>*</sup></td>
285
- <td align="center">33.1<sup>*</sup></td>
286
- <td align="center">33.9<sup>*</sup></td>
287
- <td align="center">37.0</td>
288
- <td align="center">46.6</td>
289
- </tr>
290
- <tr>
291
- <td align="center">MATH-500</td>
292
- <td align="center">Acc</td>
293
- <td align="center"><strong>97.4</strong></td>
294
- <td align="center">94.0<sup>*</sup></td>
295
- <td align="center">91.2<sup>*</sup></td>
296
- <td align="center">94.0</td>
297
- <td align="center">94.4</td>
298
- <td align="center">92.4</td>
299
- <td align="center">95.4</td>
300
- </tr>
301
- <tr>
302
- <td align="center">HMMT 2025</td>
303
- <td align="center">Avg@32</td>
304
- <td align="center"><strong>38.8</strong></td>
305
- <td align="center">27.5</td>
306
- <td align="center">11.9</td>
307
- <td align="center">15.9</td>
308
- <td align="center">15.9</td>
309
- <td align="center">19.4</td>
310
- <td align="center">34.7</td>
311
- </tr>
312
- <tr>
313
- <td align="center">CNMO 2024</td>
314
- <td align="center">Avg@16</td>
315
- <td align="center">74.3</td>
316
- <td align="center"><ins><strong>74.7</strong></ins></td>
317
- <td align="center">48.6</td>
318
- <td align="center">60.4</td>
319
- <td align="center">57.6</td>
320
- <td align="center">56.6</td>
321
- <td align="center"><strong>75.0</strong></td>
322
- </tr>
323
- <tr>
324
- <td align="center">PolyMath-en</td>
325
- <td align="center">Avg@4</td>
326
- <td align="center"><strong>65.1</strong></td>
327
- <td align="center">59.5</td>
328
- <td align="center">51.9</td>
329
- <td align="center">52.8</td>
330
- <td align="center">49.8</td>
331
- <td align="center">54.0</td>
332
- <td align="center">49.9</td>
333
- </tr>
334
-
335
- <tr>
336
- <td align="center">ZebraLogic</td>
337
- <td align="center">Acc</td>
338
- <td align="center"><strong>89.0</strong></td>
339
- <td align="center">84.0</td>
340
- <td align="center">37.7<sup>*</sup></td>
341
- <td align="center">73.7</td>
342
- <td align="center">59.3</td>
343
- <td align="center">58.5</td>
344
- <td align="center">57.9</td>
345
- </tr>
346
-
347
- <tr>
348
- <td align="center">AutoLogi</td>
349
- <td align="center">Acc</td>
350
- <td align="center"><ins><strong>89.5</strong></ins></td>
351
- <td align="center">88.9</td>
352
- <td align="center">83.3</td>
353
- <td align="center"><strong>89.8</strong></td>
354
- <td align="center">86.1</td>
355
- <td align="center">88.2</td>
356
- <td align="center">84.1</td>
357
- </tr>
358
-
359
- <tr>
360
- <td align="center">GPQA-Diamond</td>
361
- <td align="center">Avg@8</td>
362
- <td align="center"><strong>75.1</strong></td>
363
- <td align="center">68.4<sup>*</sup></td>
364
- <td align="center">62.9<sup>*</sup></td>
365
- <td align="center">70.0<sup>*</sup></td>
366
- <td align="center">74.9<sup>*</sup></td>
367
- <td align="center">66.3</td>
368
- <td align="center">68.2</td>
369
- </tr>
370
-
371
- <tr>
372
- <td align="center">SuperGPQA</td>
373
- <td align="center">Acc</td>
374
- <td align="center"><strong>57.2</strong></td>
375
- <td align="center">53.7</td>
376
- <td align="center">50.2</td>
377
- <td align="center">55.7</td>
378
- <td align="center">56.5</td>
379
- <td align="center">50.8</td>
380
- <td align="center">49.6</td>
381
- </tr>
382
-
383
- <tr>
384
- <td align="center">Humanity's Last Exam<br><sup>(Text Only)</sup></td>
385
- <td align="center">-</td>
386
- <td align="center">4.7</td>
387
- <td align="center">5.2</td>
388
- <td align="center"><ins><strong>5.7</strong></ins></td>
389
- <td align="center">5.8</td>
390
- <td align="center"><strong>7.1</strong></td>
391
- <td align="center">3.7</td>
392
- <td align="center">5.6</td>
393
- </tr>
394
-
395
- <tr>
396
- <td align="center" colspan=9><strong>General Tasks</strong></td>
397
- </tr>
398
-
399
- <tr>
400
- <td align="center">MMLU</td>
401
- <td align="center">EM</td>
402
- <td align="center"><ins><strong>89.5</strong></ins></td>
403
- <td align="center">89.4</td>
404
- <td align="center">87.0</td>
405
- <td align="center">91.5</td>
406
- <td align="center"><strong>92.9</strong></td>
407
- <td align="center">90.4</td>
408
- <td align="center">90.1</td>
409
- </tr>
410
-
411
- <tr>
412
- <td align="center">MMLU-Redux</td>
413
- <td align="center">EM</td>
414
- <td align="center"><ins><strong>92.7</strong></ins></td>
415
- <td align="center">90.5</td>
416
- <td align="center">89.2</td>
417
- <td align="center">93.6</td>
418
- <td align="center"><strong>94.2</strong></td>
419
- <td align="center">92.4</td>
420
- <td align="center">90.6</td>
421
- </tr>
422
-
423
- <tr>
424
- <td align="center">MMLU-Pro</td>
425
- <td align="center">EM</td>
426
- <td align="center">81.1</td>
427
- <td align="center"><ins><strong>81.2</strong></ins><sup>*</sup></td>
428
- <td align="center">77.3</td>
429
- <td align="center">83.7</td>
430
- <td align="center"><strong>86.6</strong></td>
431
- <td align="center">81.8</td>
432
- <td align="center">79.4</td>
433
- </tr>
434
-
435
- <tr>
436
- <td align="center">IFEval</td>
437
- <td align="center">Prompt Strict</td>
438
- <td align="center"><strong>89.8</strong></td>
439
- <td align="center">81.1</td>
440
- <td align="center">83.2<sup>*</sup></td>
441
- <td align="center">87.6</td>
442
- <td align="center">87.4</td>
443
- <td align="center">88.0</td>
444
- <td align="center">84.3</td>
445
- </tr>
446
-
447
- <tr>
448
- <td align="center">Multi-Challenge</td>
449
- <td align="center">Acc</td>
450
- <td align="center"><strong>54.1</strong></td>
451
- <td align="center">31.4</td>
452
- <td align="center">34.0</td>
453
- <td align="center">46.8</td>
454
- <td align="center">49.0</td>
455
- <td align="center">36.4</td>
456
- <td align="center">39.5</td>
457
- </tr>
458
-
459
- <tr>
460
- <td align="center">SimpleQA</td>
461
- <td align="center">Correct</td>
462
- <td align="center"><ins><strong>31.0</strong></ins></td>
463
- <td align="center">27.7</td>
464
- <td align="center">13.2</td>
465
- <td align="center">15.9</td>
466
- <td align="center">22.8</td>
467
- <td align="center"><strong>42.3</strong></td>
468
- <td align="center">23.3</td>
469
- </tr>
470
-
471
- <tr>
472
- <td align="center">Livebench</td>
473
- <td align="center">Pass@1</td>
474
- <td align="center"><strong>76.4</strong></td>
475
- <td align="center">72.4</td>
476
- <td align="center">67.6</td>
477
- <td align="center">74.8</td>
478
- <td align="center">74.6</td>
479
- <td align="center">69.8</td>
480
- <td align="center">67.8</td>
481
- </tr>
482
- </tbody>
483
- </table>
484
- </div>
485
- <sup>
486
- • Bold denotes global SOTA, and underlined denotes open-source SOTA.
487
- </sup><br/><sup>
488
- • Data points marked with * are taken directly from the model's tech report or blog.
489
- </sup><br/><sup>
490
- • All metrics, except for SWE-bench Verified (Agentless), are evaluated with an 8k output token length. SWE-bench Verified (Agentless) is limited to a 16k output token length.
491
- </sup><br/><sup>
492
- • Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools (single-attempt patches, no test-time compute). It also achieves a 47.3% pass@1 on the SWE-bench Multilingual tests under the same conditions. Additionally, we report results on SWE-bench Verified tests (71.6%) that leverage parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.
493
- </sup><br/><sup>
494
- • To ensure the stability of the evaluation, we employed avg@k on the AIME, HMMT, CNMO, PolyMath-en, GPQA-Diamond, EvalPlus, Tau2.
495
- </sup><br/><sup>
496
- • Some data points have been omitted due to prohibitively expensive evaluation costs.
497
- </sup>
498
-
499
- ---
500
 
501
- #### Base model evaluation results
502
 
503
- <div align="center">
 
504
 
505
- <table>
506
- <thead>
507
- <tr>
508
- <th align="center">Benchmark</th>
509
- <th align="center">Metric</th>
510
- <th align="center">Shot</th>
511
- <th align="center">Kimi K2 Base</th>
512
- <th align="center">Deepseek-V3-Base</th>
513
- <th align="center">Qwen2.5-72B</th>
514
- <th align="center">Llama 4 Maverick</th>
515
- </tr>
516
- </thead>
517
- <tbody>
518
- <tr>
519
- <td align="center" colspan="7"><strong>General Tasks</strong></td>
520
- </tr>
521
- <tr>
522
- <td align="center">MMLU</td>
523
- <td align="center">EM</td>
524
- <td align="center">5-shot</td>
525
- <td align="center"><strong>87.8</strong></td>
526
- <td align="center">87.1</td>
527
- <td align="center">86.1</td>
528
- <td align="center">84.9</td>
529
- </tr>
530
- <tr>
531
- <td align="center">MMLU-pro</td>
532
- <td align="center">EM</td>
533
- <td align="center">5-shot</td>
534
- <td align="center"><strong>69.2</strong></td>
535
- <td align="center">60.6</td>
536
- <td align="center">62.8</td>
537
- <td align="center">63.5</td>
538
- </tr>
539
- <tr>
540
- <td align="center">MMLU-redux-2.0</td>
541
- <td align="center">EM</td>
542
- <td align="center">5-shot</td>
543
- <td align="center"><strong>90.2</strong></td>
544
- <td align="center">89.5</td>
545
- <td align="center">87.8</td>
546
- <td align="center">88.2</td>
547
- </tr>
548
- <tr>
549
- <td align="center">SimpleQA</td>
550
- <td align="center">Correct</td>
551
- <td align="center">5-shot</td>
552
- <td align="center"><strong>35.3</strong></td>
553
- <td align="center">26.5</td>
554
- <td align="center">10.3</td>
555
- <td align="center">23.7</td>
556
- </tr>
557
- <tr>
558
- <td align="center">TriviaQA</td>
559
- <td align="center">EM</td>
560
- <td align="center">5-shot</td>
561
- <td align="center"><strong>85.1</strong></td>
562
- <td align="center">84.1</td>
563
- <td align="center">76.0</td>
564
- <td align="center">79.3</td>
565
- </tr>
566
- <tr>
567
- <td align="center">GPQA-Diamond</td>
568
- <td align="center">Avg@8</td>
569
- <td align="center">5-shot</td>
570
- <td align="center">48.1</td>
571
- <td align="center"><strong>50.5</strong></td>
572
- <td align="center">40.8</td>
573
- <td align="center">49.4</td>
574
- </tr>
575
- <tr>
576
- <td align="center">SuperGPQA</td>
577
- <td align="center">EM</td>
578
- <td align="center">5-shot</td>
579
- <td align="center"><strong>44.7</strong></td>
580
- <td align="center">39.2</td>
581
- <td align="center">34.2</td>
582
- <td align="center">38.8</td>
583
- </tr>
584
- <tr>
585
- <td align="center" colspan="7"><strong>Coding Tasks</strong></td>
586
- </tr>
587
- <tr>
588
- <td align="center">LiveCodeBench v6</td>
589
- <td align="center">Pass@1</td>
590
- <td align="center">1-shot</td>
591
- <td align="center"><strong>26.3</strong></td>
592
- <td align="center">22.9</td>
593
- <td align="center">21.1</td>
594
- <td align="center">25.1</td>
595
- </tr>
596
- <tr>
597
- <td align="center">EvalPlus</td>
598
- <td align="center">Pass@1</td>
599
- <td align="center">-</td>
600
- <td align="center"><strong>80.3</strong></td>
601
- <td align="center">65.6</td>
602
- <td align="center">66.0</td>
603
- <td align="center">65.5</td>
604
- </tr>
605
- <tr>
606
- <td align="center" colspan="7"><strong>Mathematics Tasks</strong></td>
607
- </tr>
608
- <tr>
609
- <td align="center">MATH</td>
610
- <td align="center">EM</td>
611
- <td align="center">4-shot</td>
612
- <td align="center"><strong>70.2</strong></td>
613
- <td align="center">60.1</td>
614
- <td align="center">61.0</td>
615
- <td align="center">63.0</td>
616
- </tr>
617
- <tr>
618
- <td align="center">GSM8k</td>
619
- <td align="center">EM</td>
620
- <td align="center">8-shot</td>
621
- <td align="center"><strong>92.1</strong></td>
622
- <td align="center">91.7</td>
623
- <td align="center">90.4</td>
624
- <td align="center">86.3</td>
625
- </tr>
626
- <tr>
627
- <td align="center" colspan="7"><strong>Chinese Tasks</strong></td>
628
- </tr>
629
- <tr>
630
- <td align="center">C-Eval</td>
631
- <td align="center">EM</td>
632
- <td align="center">5-shot</td>
633
- <td align="center"><strong>92.5</strong></td>
634
- <td align="center">90.0</td>
635
- <td align="center">90.9</td>
636
- <td align="center">80.9</td>
637
- </tr>
638
- <tr>
639
- <td align="center">CSimpleQA</td>
640
- <td align="center">Correct</td>
641
- <td align="center">5-shot</td>
642
- <td align="center"><strong>77.6</strong></td>
643
- <td align="center">72.1</td>
644
- <td align="center">50.5</td>
645
- <td align="center">53.5</td>
646
- </tr>
647
- </tbody>
648
- </table>
649
- </div>
650
- <sup>
651
- • We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
652
- </sup><br/><sup>
653
- • All models are evaluated using the same evaluation protocol.
654
 
655
- </sup>
656
 
657
 
658
  ## 4. Deployment
@@ -713,7 +141,6 @@ The following example demonstrates calling a weather tool end-to-end:
713
  # Your tool implementation
714
  def get_weather(city: str) -> dict:
715
  return {"weather": "Sunny"}
716
-
717
  # Tool schema definition
718
  tools = [{
719
  "type": "function",
@@ -732,12 +159,10 @@ tools = [{
732
  }
733
  }
734
  }]
735
-
736
  # Map tool names to their implementations
737
  tool_map = {
738
  "get_weather": get_weather
739
  }
740
-
741
  def tool_call_with_client(client: OpenAI, model_name: str):
742
  messages = [
743
  {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
@@ -762,7 +187,6 @@ def tool_call_with_client(client: OpenAI, model_name: str):
762
  tool_function = tool_map[tool_call_name]
763
  tool_result = tool_function(**tool_call_arguments)
764
  print("tool_result:", tool_result)
765
-
766
  messages.append({
767
  "role": "tool",
768
  "tool_call_id": tool_call.id,
 
8
  <img src="figures/kimi-logo.png" width="30%" alt="Kimi K2: Open Agentic Intellignece">
9
  </picture>
10
  </div>
 
11
  <hr>
12
 
13
  <div align="center" style="line-height:1">
 
21
  <a href="https://twitter.com/kimi_moonshot" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-Kimi.ai-white?logo=x&logoColor=white"/></a>
22
  <a href="https://discord.gg/TYU2fdJykW" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-Kimi.ai-white?logo=discord&logoColor=white"/></a>
23
  </div>
 
24
  <div align="center" style="line-height: 1;">
25
  <a href="https://github.com/moonshotai/Kimi-K2/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
26
  </div>
 
32
 
33
  ## 1. Model Introduction
34
 
35
+ Kimi K2-Instruct-0905 is the latest, most capable version of Kimi K2. It is a state-of-the-art mixture-of-experts (MoE) language model, featuring 32 billion activated parameters and a total of 1 trillion parameters.
36
 
37
  ### Key Features
38
+ - Enhanced agentic coding intelligence: Kimi K2-Instruct-0905 demonstrates significant improvements in performance on public benchmarks and real-world coding agent tasks.
39
+ - Improved frontend coding experience: Kimi K2-Instruct-0905 offers advancements in both the aesthetics and practicality of frontend programming.
40
+ - Extended context length: Kimi K2-Instruct-0905’s context window has been increased from 128k to 256k tokens, providing better support for long-horizon tasks.
41
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## 2. Model Summary
44
 
 
66
 
67
  ## 3. Evaluation Results
68
 
69
+ | Benchmark | Metric | K2-Instruct-0905 | K2-Instruct-0711 | Qwen3-Coder-480B-A35B-Instruct | GLM-4.5 | DeepSeek-V3.1 | Claude-Sonnet-4 | Claude-Opus-4 |
70
+ |------------------------|--------|------------------|------------------|--------|--------|--------|-----------------|---------------|
71
+ | SWE-Bench verified | ACC | 69.2 ± 0.63 | 65.8 | 69.6* | 64.2* | 66.0* | 72.7 | 72.5 |
72
+ | SWE-Bench Multilingual | ACC | 55.9 ± 0.72 | 47.3 | 54.7* | 52.7 | 54.5* | 53.3* | - |
73
+ | Multi-SWE-Bench | ACC | 33.5 ± 0.28 | 31.3 | 32.7 | 31.7 | 29.0 | 35.7 | - |
74
+ | Terminal-Bench | ACC | 44.5 ± 2.03 | 37.5 | 37.5* | 39.9* | 31.3* | 36.4* | 43.2* |
75
+ | SWE-Dev | ACC | 66.6 ± 0.72 | 61.9 | 64.7 | 63.2 | 53.3 | 67.1 | - |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
 
77
 
78
+ All K2-Instruct-0905 numbers are reported as mean ± std over five independent, full-test-set runs.
79
+ Before each run we prune the repository so that every Git object unreachable from the target commit disappears; this guarantees the agent sees only the code that would legitimately be available at that point in history.
80
 
81
+ Except for Terminal-Bench (Terminus-2), every result was produced with our in-house evaluation harness. The harness is derived from SWE-agent, but we clamp the context windows of the Bash and Edit tools and rewrite the system prompt to match the task semantics. All baseline figures denoted with an asterisk (*) are excerpted directly from their official report or public leaderboard; the remaining metrics were evaluated by us under conditions identical to those used for K2-Instruct-0905.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
+ For SWE-Dev we go one step further: we overwrite the original repository files and delete any test file that exercises the functions the agent is expected to generate, eliminating any indirect hints about the desired implementation.
84
 
85
 
86
  ## 4. Deployment
 
141
  # Your tool implementation
142
  def get_weather(city: str) -> dict:
143
  return {"weather": "Sunny"}
 
144
  # Tool schema definition
145
  tools = [{
146
  "type": "function",
 
159
  }
160
  }
161
  }]
 
162
  # Map tool names to their implementations
163
  tool_map = {
164
  "get_weather": get_weather
165
  }
 
166
  def tool_call_with_client(client: OpenAI, model_name: str):
167
  messages = [
168
  {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
 
187
  tool_function = tool_map[tool_call_name]
188
  tool_result = tool_function(**tool_call_arguments)
189
  print("tool_result:", tool_result)
 
190
  messages.append({
191
  "role": "tool",
192
  "tool_call_id": tool_call.id,