ucb_agentic_ai

Lecture 06: Multi-Agent AI

Link to lecture recording on YouTube

Date: 2025-10-20

Speaker: Noam Brown

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Work:

Notes

Analogy between the trajectory of AlphaGo and LLMs

Training Steps AlphaGo LLMs
1. pre-train on high-quality human data training on human GO games training on large chunks of the Internet
2. enable large-scale inference compute Monte Carlo tree search chain of thought
3. recursive self-improvement (self play) self-play don’t have that piece yet
     

Takeaway:
people’s intuition about self-play is basically overfit to GO and chess kinds of two-player zero-sum perfect information games. It turns out that when going out of these games, a lot of the nice properties go away, and self-play becomes much more difficult.

Who’s the better poker player?

In AI for games, by “solving a game”, people typically mean computing a minimax equilibrium. It’s a strong assumption and not necessarily what we want. In games like chess and GO, this is fine because it ends up being the same thing; but in other games like poker, we will see that this becomes a very significant problem beyond two-player zero-sum games. It’s possible that what we really want is a population best response.

[Incomplete, work in progress]

References