Reading list on evaluation metrics, benchmarks, frameworks, datasets
-
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Paper • 2310.11324 • Published • 1 -
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
Paper • 2509.01790 • Published • 7 -
POSIX: A Prompt Sensitivity Index For Large Language Models
Paper • 2410.02185 • Published -
A Survey on Evaluation of Large Language Models
Paper • 2307.03109 • Published • 43