Add model-index with benchmark evaluations
#8
by
davidlms
- opened
Added structured evaluation results from README benchmark tables:
Reasoning Benchmarks:
- AIME25: 0.721
- AIME24: 0.775
- GPQA Diamond: 0.534
- LiveCodeBench: 0.548
Instruct Benchmarks:
- Arena Hard: 0.305
- WildBench: 56.8
- MATH Maj@1: 0.830
- MM MTBench: 7.83
Base Model Benchmarks:
- Multilingual MMLU: 0.652
- MATH CoT 2-Shot: 0.601
- AGIEval 5-shot: 0.511
- MMLU Redux 5-shot: 0.735
- MMLU 5-shot: 0.707
- TriviaQA 5-shot: 0.592
Total: 14 benchmarks across reasoning, instruction-following, and base capabilities.
This enables the model to appear in leaderboards and makes it easier to compare with other models.
Note: PR #6 only adds the transformers tag and doesn't conflict with this benchmark metadata addition.