ucb_agentic_ai

Lecture 06: Multi-Agent AI

Link to lecture recording on YouTube

Date: 2025-10-20

Speaker: Noam Brown

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Computer Science, 2014-2020, Carnegie Mellon University
M.S. in Robotics, 2012-2014, Carnegie Mellon University
Bachelor’s degree in Mathematics and Computer Science, 2005-2008, Rutgers University

Work:

Research Scientist, OpenAI

Notes

Analogy between the trajectory of AlphaGo and LLMs

Training Steps	AlphaGo	LLMs
1. pre-train on high-quality human data	training on human GO games	training on large chunks of the Internet
2. enable large-scale inference compute	Monte Carlo tree search	chain of thought
3. recursive self-improvement (self play)	self-play	don’t have that piece yet

Takeaway:
people’s intuition about self-play is basically overfit to GO and chess kinds of two-player zero-sum perfect information games. It turns out that when going out of these games, a lot of the nice properties go away, and self-play becomes much more difficult.

Who’s the better poker player?

option 1 (minimax equilibrium): someone who wins head-to-head vs. any other play over a large enough sample size
option 2 (population Best Response): someone who makes more money playing poker than anyone else

In AI for games, by “solving a game”, people typically mean computing a minimax equilibrium. It’s a strong assumption and not necessarily what we want. In games like chess and GO, this is fine because it ends up being the same thing; but in other games like poker, we will see that this becomes a very significant problem beyond two-player zero-sum games. It’s possible that what we really want is a population best response.

[Incomplete, work in progress]

References

This site is open source. Improve this page.