Add model-index with benchmark evaluations

#8
by davidlms - opened

Added structured evaluation results from README benchmark tables:

Reasoning Benchmarks:

  • AIME25: 0.721
  • AIME24: 0.775
  • GPQA Diamond: 0.534
  • LiveCodeBench: 0.548

Instruct Benchmarks:

  • Arena Hard: 0.305
  • WildBench: 56.8
  • MATH Maj@1: 0.830
  • MM MTBench: 7.83

Base Model Benchmarks:

  • Multilingual MMLU: 0.652
  • MATH CoT 2-Shot: 0.601
  • AGIEval 5-shot: 0.511
  • MMLU Redux 5-shot: 0.735
  • MMLU 5-shot: 0.707
  • TriviaQA 5-shot: 0.592

Total: 14 benchmarks across reasoning, instruction-following, and base capabilities.

This enables the model to appear in leaderboards and makes it easier to compare with other models.

Note: PR #6 only adds the transformers tag and doesn't conflict with this benchmark metadata addition.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment