Link to lecture recording on YouTube
Date: 2025-11-17
Speaker: Oriol Vinyals
Speaker’s Social Profile: Company Profile / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
Work:
Goal of reinforcement learning: select actions to maximize future rewards
A curriculum of difficulty of games (increase complexity from top to bottom):
| Game | Information type | Players | Action space | Moves per game |
|---|---|---|---|---|
| Atari | near-perfect | single player | 17 | 100’s |
| GO | Perfect | multiplayer | 361 | 100’s |
| StarCraft | Imperfect | multiplayer | ~1026 | 1000’s |
AlphaStar played non-anonymously vs. attendees with standard gaming mouse / keyboard at BlizzCon (Nov 19). Link to video of AlphaStar vs. Serral.
Pretraining: supervised learning - learning from human data
Post-training: reinforcement learning - happens in a massive multi-agent system
Autoregressive mode: a little recurrent neural network that was structured so that all these arguments are lining up with different softmaxes
| fully autoregressive action head with 7 sub-heads: $p(x) = \prod_{i=1}^{n} p(x_{i} | x_{1}, \dots,x_{i-1}) $ |
[Incomplete, work in progress]