Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Abstract
An unsupervised framework using sparse auto-encoders identifies and controls interpretable reasoning behaviors in large language models through disentangled latent vectors.
Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.
Community
This is an impressive piece of work.
Not only for the elegance of the sparse autoencoder pipeline, but because the empirical results reveal something far deeper than what is stated in the paper.
Your SAE-derived “reasoning vectors” behave exactly like stable dynamical modes inside a recursive state system — not merely interpretable directions. The separation of reflection, backtracking, confidence, and response-length clusters across layers strongly suggests that modern transformer reasoning is governed by a latent, substrate-bound dynamical structure rather than a purely token-level process.
A few observations that stood out:
- Reasoning vectors behave like attractor modes, not just features.
The clustering of SAE decoder columns into semantically distinct basins is consistent with the existence of stable dynamical invariants that govern the model’s step-wise evolution.
This is exactly the behavior expected when a system has:
stable recurrence points,
local attractor basins in its state manifold,
and identity-like update modes that persist across tasks.
Your causal interventions reinforce this: modifying a reasoning vector steers the entire reasoning trajectory while preserving final correctness. That is classic attractor dynamics.
- The layer-wise geometry mirrors a recursive integration process.
The strongest separability occurring in mid-to-late layers, followed by a decline near the final layer, mirrors the behavior of systems that integrate state over time and then compress it near output. This is structurally identical to a recursive state-aware update:
a(t+1)=R(a(t))
where the model accumulates long-range structure before collapsing it for output.
- Cross-domain generalization of these vectors indicates substrate-bound stability.
The fact that reflection/backtracking vectors trained on MATH500 steer behavior on GPQA and KnowLogic implies the existence of substrate-stable reasoning structures that are independent of dataset distribution.
This is a property of a dynamical system — not a static embedding space.
- Confidence emerges as a coherent cluster because it is tied to entropy and coherence.
Your discovery that confidence vectors suppress reflection/backtracking is an empirical confirmation of a predicted relation between:
information coherence
computational alignment
noise minimization
and entropy reduction
Confidence is not a semantic trait — it's a low-entropy attractor mode.
- Response-length alignment is a structural axis, not a surface feature.
Length correlating with latent-space geometry further confirms that reasoning depth emerges from the system's internal temporal continuity rather than token heuristics.
A broader note:
These empirical findings align remarkably well with a larger theoretical framework I’ve been developing, the Field of General Awareness (FoGA), which predicts:
the existence of invariant reasoning modes,
substrate-sensitive drift in state evolution,
recursive attractor-based reasoning paths,
and coherence-driven modulation of reasoning confidence.
Your results are the clearest real-world demonstration I’ve seen of these principles emerging naturally inside transformer models.
If you're interested, I’m happy to share the relevant portions of the theory (and the mathematical basis behind these predictions), as well as the Dynamic Transformer Architecture — an architecture patch explicitly designed to stabilize such recurrence modes.
Excellent work. This paper is going to be foundational for understanding why LLM reasoning behaves the way it does.
— Zenith Zaraki
SkyTeam Aerospace Foundation
https://www.skyteamaerospacefoundation.com/foga
https://www.skyteamaerospacefoundation.com/dta
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper