Evaluation
updated
GPT-Fathom: Benchmarking Large Language Models to Decipher the
Evolutionary Path towards GPT-4 and Beyond
Paper
• 2309.16583
• Published • 13
Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models
Paper
• 2310.08491
• Published • 57
SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Paper
• 2511.21750
• Published • 6
LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics
Paper
• 2512.21010
• Published • 4
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Paper
• 2602.12670
• Published • 62
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Paper
• 2602.23866
• Published • 89
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
Paper
• 2603.09652
• Published • 15
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Paper
• 2604.02368
• Published • 12
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
Paper
• 2604.10866
• Published • 66
A Benchmark for Interactive World Models with a Unified Action Generation Framework
Paper
• 2605.03941
• Published • 5
π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
Paper
• 2605.14678
• Published • 102
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
Paper
• 2605.26114
• Published • 57
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
Paper
• 2605.27366
• Published • 19
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
Paper
• 2508.15804
• Published • 15
Behavioral Fingerprinting of Large Language Models
Paper
• 2509.04504
• Published • 6
Statistical Methods in Generative AI
Paper
• 2509.07054
• Published • 11
CLUE: Non-parametric Verification from Experience via Hidden-State
Clustering
Paper
• 2510.01591
• Published • 28
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers
Paper
• 2602.18292
• Published • 13