Link to lecture recording on YouTube
Date: 2025-10-20
Speaker: Noam Brown
Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
Work:
Analogy between the trajectory of AlphaGo and LLMs
| Training Steps | AlphaGo | LLMs |
|---|---|---|
| 1. pre-train on high-quality human data | training on human GO games | training on large chunks of the Internet |
| 2. enable large-scale inference compute | Monte Carlo tree search | chain of thought |
| 3. recursive self-improvement (self play) | self-play | don’t have that piece yet |
Takeaway:
people’s intuition about self-play is basically overfit to GO and chess kinds of two-player zero-sum perfect information games. It turns out that when going out of these games, a lot of the nice properties go away, and self-play becomes much more difficult.
Who’s the better poker player?
In AI for games, by “solving a game”, people typically mean computing a minimax equilibrium. It’s a strong assumption and not necessarily what we want. In games like chess and GO, this is fine because it ends up being the same thing; but in other games like poker, we will see that this becomes a very significant problem beyond two-player zero-sum games. It’s possible that what we really want is a population best response.
[Incomplete, work in progress]