Developed by: Allen Institute for AI (Ai2)
Model type: a Transformer style autoregressive language model.
Language(s) (NLP): English
License: This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.
Contact: Technical inquiries: olmo@allenai.org. Press: press@allenai.org
Date cutoff: Dec. 2024.

Project Page: https://allenai.org/olmo
Repositories:
- Open-Instruct for DPO and RLVR: https://github.com/allenai/open-instruct
- OLMo-Core for pre-training and SFT: https://github.com/allenai/OLMo-core
- OLMo-Eval for evaluation: https://github.com/allenai/OLMo-Eval
Paper: [TBD]

Evaluation

Skill	Benchmark	Olmo 3 Instruct 7B SFT	Olmo 3 Instruct 7B DPO	Olmo3 Instruct 7B	Qwen 3 8B (no reasoning)	Qwen 3 VL 8B Instruct	Qwen 2.5 7B	Olmo 2 7B Instruct	Apertus 8B Instruct	Granite 3.3 8B Instruct
Math	MATH	65.1	79.6	87.3	82.3	91.6	71.0	30.1	21.9	67.3
	AIME 2024	6.7	23.5	44.3	26.2	55.1	11.3	1.3	0.5	7.3
	AIME 2025	7.2	20.4	32.5	21.7	43.3	6.3	0.4	0.2	6.3
	OMEGA	14.4	22.8	28.9	20.5	32.3	13.7	5.2	5.0	10.7
Reasoning	BigBenchHard	51.0	69.3	71.2	73.7	85.6	68.8	43.8	42.2	61.2
	ZebraLogic	18.0	28.4	32.9	25.4	64.3	10.7	5.3	5.3	17.6
	AGI Eval English	59.2	64.0	64.4	76.0	84.5	69.8	56.1	50.8	64.0
Coding	HumanEvalPlus	69.8	72.9	77.2	79.8	82.9	74.9	25.8	34.4	64.0
	MBPP+	56.5	55.9	60.2	64.4	66.3	62.6	40.7	42.1	54.0
	LiveCodeBench v3	20.0	18.8	29.5	53.2	55.9	34.5	7.2	7.8	11.5
IF	IFEval	81.7	82.0	85.6	86.3	87.8	73.4	72.2	71.4	77.5
	IFBench	27.4	29.3	32.3	29.3	34.0	28.4	26.7	22.1	22.3
Knowledge	MMLU	67.1	69.1	69.1	80.4	83.6	77.2	61.6	62.7	63.5
QA	PopQA	16.5	20.7	14.1	20.4	26.5	21.5	25.5	25.5	28.9
	GPQA	30.0	37.9	40.4	44.6	51.1	35.6	31.3	28.8	33.0
Chat	AlpacaEval 2 LC	21.8	43.3	40.9	49.8	73.5	23.0	18.3	8.1	28.6
Tool Use	SimpleQA	74.2	79.8	79.3	79.0	90.3	78.0	–	–	–
	LitQA2	38.0	43.3	38.2	39.6	30.7	29.8	–	–	–
	BFCL	48.9	49.6	49.8	60.2	66.2	55.8	–	–	–
Safety	Safety	89.2	90.2	87.3	78.0	80.2	73.4	93.1	72.2	73.7

Model Details

supervised fine-tuning on the Dolci-Think-SFT-7B dataset. This dataset consits of math, code, chat, and general knowledge queries.
Datasets: Dolci-Think-SFT-7B, Dolci-Instruct-SFT-7B

direct preference optimization on the Dolci-Think-DPO-7B dataset. This dataset consits of math, code, chat, and general knowledge queries.
Datasets: Dolci-Think-DPO-7B, Dolci-Instruct-DPO-7B

reinforcement learning from verifiable rewards on the Dolci-Think-RL-7B dataset. This dataset consits of math, code, instruction-following, and general chat queries.
Datasets: Dolci-Think-RL-7B, Dolci-Instruct-RL-7B

We evaluated our models on the following settings. We also recommend using them for generation:

GGUF

Model size

7B params

Architecture

olmo2

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Finetuned

Finetuned

Finetuned

Quantized

(32)

this model