This is a collection of datasets used to evaluate language models in the task of ablation planning in empirical AI research.
-
ai-coscientist/researcher-ablation-bench
Viewer • Updated • 83 • 58 -
ai-coscientist/reviewer-ablation-bench
Viewer • Updated • 6.26k • 26 -
ai-coscientist/researcher-ablation-judge-eval
Viewer • Updated • 63 • 21 -
ai-coscientist/reviewer-ablation-judge-eval
Viewer • Updated • 60 • 20