xuebi commited on
Commit
3e16da2
·
1 Parent(s): c6c3633

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -6
README.md CHANGED
@@ -62,13 +62,16 @@ library_name: transformers
62
  <a href="https://www.modelscope.cn/organization/MiniMax" target="_blank" style="margin: 2px;">
63
  🤖️ ModelScope
64
  </a> |
65
- <a href="https://github.com/MiniMax-AI/MiniMax-M2.1/blob/main/LICENSE" style="margin: 2px;">
66
  📄 License: Modified-MIT
67
  </a>
68
  </div>
69
 
70
  <p align="center">
71
- <img width="100%" src="figures/bench_11.png">
 
 
 
72
  </p>
73
 
74
  Today we're introducing our latest model, **MiniMax-M2.5**.
@@ -84,7 +87,10 @@ M2.5 is the first frontier model where users do not need to worry about cost, de
84
  In programming evaluations, MiniMax-M2.5 saw substantial improvements compared to previous generations, reaching SOTA levels. The performance of M2.5 in multilingual tasks is especially pronounced.
85
 
86
  <p align="center">
87
- <img width="100%" src="figures/bench_2.png">
 
 
 
88
  </p>
89
 
90
  A significant improvement from previous generations is M2.5's ability to think and plan like an architect. The Spec-writing tendency of the model emerged during training: before writing any code, M2.5 actively decomposes and plans the features, structure, and UI design of the project from the perspective of an experienced software architect.
@@ -94,7 +100,10 @@ M2.5 was trained on over 10 languages (including Go, C, C++, TypeScript, Rust, K
94
  To evaluate these capabilities, we also upgraded the VIBE benchmark to a more complex and challenging Pro version, significantly increasing task complexity, domain coverage, and evaluation accuracy. Overall, M2.5 performs on par with Opus 4.5.
95
 
96
  <p align="center">
97
- <img width="100%" src="figures/bench_4.png">
 
 
 
98
  </p>
99
 
100
  We focused on the model's ability to generalize across out-of-distribution harnesses. We tested performance on the SWE-Bench Verified evaluation set using different coding agent harnesses.
@@ -104,7 +113,10 @@ We focused on the model's ability to generalize across out-of-distribution harne
104
  ## Search and Tool calling
105
 
106
  <p align="center">
107
- <img width="100%" src="figures/bench_6.png">
 
 
 
108
  </p>
109
 
110
  Effective tool calling and search are prerequisites for a model's ability to autonomously handle more complex tasks. In evaluations on benchmarks such as BrowseComp and Wide Search, M2.5 achieved industry-leading performance. At the same time, the model's generalization has also improved — M2.5 demonstrates more stable performance when facing unfamiliar scaffolding environments.
@@ -118,7 +130,10 @@ Compared to its predecessors, M2.5 also demonstrates much better decision-making
118
  M2.5 was trained to produce truly deliverable outputs in office scenarios. To this end, we engaged in thorough collaboration with senior professionals in fields such as finance, law, and social sciences. They designed requirements, provided feedback, participated in defining standards, and directly contributed to data construction, bringing the tacit knowledge of their industries into the model's training pipeline. Based on this foundation, M2.5 has achieved significant capability improvements in high-value workspace scenarios such as Word, PowerPoint, and Excel financial modeling. On the evaluation side, we built an internal Cowork Agent evaluation framework (GDPval-MM) that assesses both the quality of the deliverable and the professionalism of the agent's trajectory through pairwise comparisons, while also monitoring token costs across the entire workflow to estimate the model's real-world productivity gains. In comparisons against other mainstream models, it achieved an average win rate of 59.0%.
119
 
120
  <p align="center">
121
- <img width="100%" src="figures/bench_8.png">
 
 
 
122
  </p>
123
 
124
  ## Efficiency
 
62
  <a href="https://www.modelscope.cn/organization/MiniMax" target="_blank" style="margin: 2px;">
63
  🤖️ ModelScope
64
  </a> |
65
+ <a href="https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE" style="margin: 2px;">
66
  📄 License: Modified-MIT
67
  </a>
68
  </div>
69
 
70
  <p align="center">
71
+ <picture>
72
+ <img class="hidden dark:block" width="100%" src="figures/bench_11.png">
73
+ <img class="dark:hidden" width="100%" src="figures/bench_12.png">
74
+ </picture>
75
  </p>
76
 
77
  Today we're introducing our latest model, **MiniMax-M2.5**.
 
87
  In programming evaluations, MiniMax-M2.5 saw substantial improvements compared to previous generations, reaching SOTA levels. The performance of M2.5 in multilingual tasks is especially pronounced.
88
 
89
  <p align="center">
90
+ <picture>
91
+ <img class="hidden dark:block" width="100%" src="figures/bench_2.png">
92
+ <img class="dark:hidden" width="100%" src="figures/bench_1.png">
93
+ </picture>
94
  </p>
95
 
96
  A significant improvement from previous generations is M2.5's ability to think and plan like an architect. The Spec-writing tendency of the model emerged during training: before writing any code, M2.5 actively decomposes and plans the features, structure, and UI design of the project from the perspective of an experienced software architect.
 
100
  To evaluate these capabilities, we also upgraded the VIBE benchmark to a more complex and challenging Pro version, significantly increasing task complexity, domain coverage, and evaluation accuracy. Overall, M2.5 performs on par with Opus 4.5.
101
 
102
  <p align="center">
103
+ <picture>
104
+ <img class="hidden dark:block" width="100%" src="figures/bench_4.png">
105
+ <img class="dark:hidden" width="100%" src="figures/bench_3.png">
106
+ </picture>
107
  </p>
108
 
109
  We focused on the model's ability to generalize across out-of-distribution harnesses. We tested performance on the SWE-Bench Verified evaluation set using different coding agent harnesses.
 
113
  ## Search and Tool calling
114
 
115
  <p align="center">
116
+ <picture>
117
+ <img class="hidden dark:block" width="100%" src="figures/bench_6.png">
118
+ <img class="dark:hidden" width="100%" src="figures/bench_5.png">
119
+ </picture>
120
  </p>
121
 
122
  Effective tool calling and search are prerequisites for a model's ability to autonomously handle more complex tasks. In evaluations on benchmarks such as BrowseComp and Wide Search, M2.5 achieved industry-leading performance. At the same time, the model's generalization has also improved — M2.5 demonstrates more stable performance when facing unfamiliar scaffolding environments.
 
130
  M2.5 was trained to produce truly deliverable outputs in office scenarios. To this end, we engaged in thorough collaboration with senior professionals in fields such as finance, law, and social sciences. They designed requirements, provided feedback, participated in defining standards, and directly contributed to data construction, bringing the tacit knowledge of their industries into the model's training pipeline. Based on this foundation, M2.5 has achieved significant capability improvements in high-value workspace scenarios such as Word, PowerPoint, and Excel financial modeling. On the evaluation side, we built an internal Cowork Agent evaluation framework (GDPval-MM) that assesses both the quality of the deliverable and the professionalism of the agent's trajectory through pairwise comparisons, while also monitoring token costs across the entire workflow to estimate the model's real-world productivity gains. In comparisons against other mainstream models, it achieved an average win rate of 59.0%.
131
 
132
  <p align="center">
133
+ <picture>
134
+ <img class="hidden dark:block" width="100%" src="figures/bench_8.png">
135
+ <img class="dark:hidden" width="100%" src="figures/bench_7.png">
136
+ </picture>
137
  </p>
138
 
139
  ## Efficiency