ucb_agentic_ai

Lecture 07: Predictable Noise in LLM Benchmarks

Link to lecture recording on YouTube

Date: 2025-10-27

Speaker: Sida Wang

Speaker’s Social Profile: Website / Company Profile / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Work:

Notes

Reflections on evaluation

Do 10% better on ImageNet, and no more doubts about deep learning

Do we similarly trust current LLM evaluations? Many papers report 2-10% improvements on HumanEval1, or a few % on SWEbench2

Test set sizes are now much smaller than MNIST:

Each question is more informative

Solve an open problem

Are small benchmarks not reliable?
A: small benchmarks are not reliable even if they contains some hard generative / agentic problems
B: solving a single hard problem could be significant already, so we can and should use small but very hard benchmarks

A lot of inconsistency: the worst models sometimes succeed on hard problem, while the best models still fails on easy problems

[Incomplete, work in progress]

References

  1. Mark Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]. 2021.

  2. Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. arXiv:2310.06770 [cs.CL]. 2023.