Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2510.07242

Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27 • 30
DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Paper • 2503.14476 • Published Mar 18 • 143
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper • 2504.13837 • Published Apr 18 • 139
Learning to Reason under Off-Policy Guidance

Paper • 2504.14945 • Published Apr 21 • 88

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

Paper • 2510.03259 • Published Sep 26 • 57
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper • 2510.07242 • Published Oct 8 • 30
First Try Matters: Revisiting the Role of Reflection in Reasoning Models

Paper • 2510.08308 • Published Oct 9 • 24
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Paper • 2510.03222 • Published Oct 3 • 75

Reinforcement Learning (RL)

Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27 • 30
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28 • 123
Robust Reward Modeling via Causal Rubrics

Paper • 2506.16507 • Published Jun 19 • 9
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper • 2510.07242 • Published Oct 8 • 30

Rethinking Entropy Regularization in Large Reasoning Models

Paper • 2509.25133 • Published Sep 29 • 4
ExGRPO: Learning to Reason from Experience

Paper • 2510.02245 • Published Oct 2 • 80
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper • 2510.07242 • Published Oct 8 • 30

Finetuning Strategies

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Paper • 2507.21183 • Published Jul 27 • 14
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Paper • 2507.21802 • Published Jul 29 • 17
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

Paper • 2507.21848 • Published Jul 29 • 8
Agentic Reinforced Policy Optimization

Paper • 2507.19849 • Published Jul 26 • 158

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 57
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 44
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27 • 30
DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Paper • 2503.14476 • Published Mar 18 • 143
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper • 2504.13837 • Published Apr 18 • 139
Learning to Reason under Off-Policy Guidance

Paper • 2504.14945 • Published Apr 21 • 88

Rethinking Entropy Regularization in Large Reasoning Models

Paper • 2509.25133 • Published Sep 29 • 4
ExGRPO: Learning to Reason from Experience

Paper • 2510.02245 • Published Oct 2 • 80
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper • 2510.07242 • Published Oct 8 • 30

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

Paper • 2510.03259 • Published Sep 26 • 57
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper • 2510.07242 • Published Oct 8 • 30
First Try Matters: Revisiting the Role of Reflection in Reasoning Models

Paper • 2510.08308 • Published Oct 9 • 24
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Paper • 2510.03222 • Published Oct 3 • 75

Finetuning Strategies

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Paper • 2507.21183 • Published Jul 27 • 14
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Paper • 2507.21802 • Published Jul 29 • 17
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

Paper • 2507.21848 • Published Jul 29 • 8
Agentic Reinforced Policy Optimization

Paper • 2507.19849 • Published Jul 26 • 158

Reinforcement Learning (RL)

Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27 • 30
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28 • 123
Robust Reward Modeling via Causal Rubrics

Paper • 2506.16507 • Published Jun 19 • 9
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper • 2510.07242 • Published Oct 8 • 30

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 57
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 44
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs