ucb_agentic_ai

Lecture 07: Predictable Noise in LLM Benchmarks

Link to lecture recording on YouTube

Date: 2025-10-27

Speaker: Sida Wang

Speaker’s Social Profile: Website / Company Profile / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Computer Science, 2011-2017, Stanford University, advised by Prof. Christopher Manning and Prof. Percy Liang
BASc in Engineering Science (Computer Option), 2006-2011, University of Toronto

Work:

Research Scientist, Meta

Notes

Reflections on evaluation

Do 10% better on ImageNet, and no more doubts about deep learning

necessary condition: no doubt about statistical significance

Do we similarly trust current LLM evaluations? Many papers report 2-10% improvements on HumanEval¹, or a few % on SWEbench²

interpretation: 2.5% on HumanEval is not statistically significant

Test set sizes are now much smaller than MNIST:

deep learning (measured and validated by MNIST and ImageNet) - MNIST: 10k, ImageNet:100k, LLM multiple choice: MMLU ~10k
generative test-based eval (generate a short program) - HumanEval: 164, MBPP+: 378, CruxEval:800, DS-1000: 1000, DMC: 165, LiveCodeBench: ~1000, Alder: 225
agent evals (often need 100k+ tokens, often hours) - ( various versions of) SWEBench: 2.3k, -Verified: 500, -lite: 300, -multimodal: 517, T-Bench: 80

Each question is more informative

to answer each question, a lot of text / code is generated and evaluated by tests. more informative than true / false and multiple choice?
for humans, can ask a few good questions and get good information

Solve an open problem

solve even one major open problem (millennium problems, build AGI etc.) is significant, no one will object that a sample size of 1 is not statistically significant
many daily agentic tasks are more economically important if less intellectual; e.g. fix issues, find bugs, vibe code

Are small benchmarks not reliable?
A: small benchmarks are not reliable even if they contains some hard generative / agentic problems
B: solving a single hard problem could be significant already, so we can and should use small but very hard benchmarks

A lot of inconsistency: the worst models sometimes succeed on hard problem, while the best models still fails on easy problems

[Incomplete, work in progress]

References

Mark Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]. 2021.
Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. arXiv:2310.06770 [cs.CL]. 2023.

This site is open source. Improve this page.