InnoGym: Benchmarking the Innovation Potential of AI Agents Paper • 2512.01822 • Published 6 days ago • 33
LightMem: Lightweight and Efficient Memory-Augmented Generation Paper • 2510.18866 • Published Oct 21 • 110
OceanGym: A Benchmark Environment for Underwater Embodied Agents Paper • 2509.26536 • Published Sep 30 • 34
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation Paper • 2510.07238 • Published Oct 8 • 14
Towards Personalized Deep Research: Benchmarks and Evaluations Paper • 2509.25106 • Published Sep 29 • 29
Towards General Agentic Intelligence via Environment Scaling Paper • 2509.13311 • Published Sep 16 • 71
ReCode: Updating Code API Knowledge with Reinforcement Learning Paper • 2506.20495 • Published Jun 25 • 9