Lecture 07: Predictable Noise in LLM Benchmarks
Link to lecture recording on YouTube
Date: 2025-10-27
Speaker: Sida Wang
Speaker’s Social Profile: Website / Company Profile / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
- Ph.D. in Computer Science, 2011-2017, Stanford University, advised by Prof. Christopher Manning and Prof. Percy Liang
- BASc in Engineering Science (Computer Option), 2006-2011, University of Toronto
Work:
Notes
Reflections on evaluation
Do 10% better on ImageNet, and no more doubts about deep learning
- necessary condition: no doubt about statistical significance
Do we similarly trust current LLM evaluations? Many papers report 2-10% improvements on HumanEval1, or a few % on SWEbench2
- interpretation: 2.5% on HumanEval is not statistically significant
Test set sizes are now much smaller than MNIST:
- deep learning (measured and validated by MNIST and ImageNet) - MNIST: 10k, ImageNet:100k, LLM multiple choice: MMLU ~10k
- generative test-based eval (generate a short program) - HumanEval: 164, MBPP+: 378, CruxEval:800, DS-1000: 1000, DMC: 165, LiveCodeBench: ~1000, Alder: 225
- agent evals (often need 100k+ tokens, often hours) - ( various versions of) SWEBench: 2.3k, -Verified: 500, -lite: 300, -multimodal: 517, T-Bench: 80
Each question is more informative
- to answer each question, a lot of text / code is generated and evaluated by tests. more informative than true / false and multiple choice?
- for humans, can ask a few good questions and get good information
Solve an open problem
- solve even one major open problem (millennium problems, build AGI etc.) is significant, no one will object that a sample size of 1 is not statistically significant
- many daily agentic tasks are more economically important if less intellectual; e.g. fix issues, find bugs, vibe code
Are small benchmarks not reliable?
A: small benchmarks are not reliable even if they contains some hard generative / agentic problems
B: solving a single hard problem could be significant already, so we can and should use small but very hard benchmarks
A lot of inconsistency: the worst models sometimes succeed on hard problem, while the best models still fails on easy problems
[Incomplete, work in progress]
References
-
Mark Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]. 2021.
-
Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. arXiv:2310.06770 [cs.CL]. 2023.