File size: 2,826 Bytes
0298295
 
f61361a
0298295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f61361a
0298295
f61361a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29e18f4
0298295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: mit
pipeline_tag: text-generation
---

<h1 align="center">
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
</h1>

<div align="center">

<a href="https://chenlong-clock.github.io">Charlie Zhang</a>, <a href="https://www.phontron.com">Graham Neubig</a>, 
<a href="https://xiangyue9607.github.io">Xiang Yue</a>

Carnegie Mellon University, Language Technologies Institute

</div>

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2512.07783-b31b1b.svg?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.07783)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
![Python](https://img.shields.io/badge/python-3.9%2B-blue)

</div>

## Does Reinforcement Learning Truly Extend Reasoning?

This work explores the discrepancy in views on RL's effectiveness in extending language models' reasoning abilities. Some characterize RL as a capability refiner, while others see it as inducing new compositional skills. This challenge stems from a lack of control in modern training pipelines. Our work aims to resolve this conflict through controlled analysis, going beyond the initial description that this repository contains mid-training related checkpoints in the extrapolation tasks.

## 🔍 Overview

Our paper builds a fully controlled experimental framework to analyze how pre-training, mid-training, and RL-based post-training jointly shape the reasoning abilities of language models. Using synthetic math-style reasoning tasks with explicit atomic operations and process-verifiable reasoning traces, we study:

*   **Extrapolative generalization** to more complex compositions (deeper dependency graphs).
*   **Contextual generalization** across diverse surface forms and linguistic contexts.
*   How **RL interacts** with prior knowledge, and when it yields **genuine capability gains** beyond pre-training.

## 🧠 Key findings
<div align="center">
  <h1 align="center">
    <img src="assets/findings.png" width="500" />
    </h1>
</div>
You may also find the comic generated by Notebook LLM [here](assets/Interplay-LM-Reasoning.pdf).

## Code

The code and data for this work will be released soon at the following GitHub repository: [https://github.com/Interplay-LM-Reasoning/Interplay-LM-Reasoning](https://github.com/Interplay-LM-Reasoning/Interplay-LM-Reasoning)

## 📚 Citation

If you find this work or code useful, please consider citing:

```bibtex
@misc{zhang2025interplaypretrainingmidtrainingrl,
      title={On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models}, 
      author={Charlie Zhang and Graham Neubig and Xiang Yue},
      year={2025},
      eprint={2512.07783},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.07783}, 
}
```