Link to lecture recording on YouTube
Date: 2025-09-22
Speaker: Yangqing Jia 贾扬清
Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
Work:
The Chinese typewriter example: how to make things more efficient? today people are typing Chinese characters on the QWERTY keyboard
Conventional NLP: count 2-3 word before it, probably 4-5 bigrams is the upper limit because the correlation matrix gets too large in multi-dimension
Latest models have millions of tokens context length
Have not seen a plateauing effect yet
Healthy competition between open-source and closed-source models, closed-source models lead in absolute quality but gap becomes narrower
There does not seem to be a “bubble” - consumption continue growing, not only model training1
Historical analogies
| Timeline | Algorithm | Analogies |
|---|---|---|
| Nov 2022 | GPT (3.5) | AlexNet (structural innovation, freed researchers from using simple, handcrafted features) |
| Dec 2023 | Mixture of Expert (Mixtral 8*7B): massively sparse model and sparsely activated model to improve efficiency and decrease model size | Ensemble Learning Inception ResNet |
| Sep 2024 | Test time scaling: having model accumulate and reflect on its intermediate guesses to get a better result | Fully convolutional network: predict multiple times across the space domain and try to get a better result Multi-instance learning |
| Jan 2025 (and earlier) |
Reinforcement Learning: principled way to match the intention where it is difficult to define a loss function | General RL GANs |
“Perfect app experience is correlated, but independent from models”
“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin”
– Professor Richard Sutton, The Bitter Lesson
Interpretation of the above:
The biggest lesson that we have learned in the 70 years of AI research is that we can employ a generalized approach (in today’s context, large language models, deep learning models etc.). We deploy a large amount of numerical computation onto a large amount of data, and that seems to be the most effective way.
| Timeline | Infra | Examples | Characteristics |
|---|---|---|---|
| 1970s | Scientific computing | NERSC | <ul><li>large cluster of scientific computing machines</li><li>used for large-scale physics / weather simulations</li></ul> |
| 1990s | Virtual private servers | <ul><li>someone else taking care of the physical machines</li><li>people build applications on top</li><li>still limited offering of software and applications on top of the machines</li></ul> | |
| 2000s | Web service cloud | Amazon Web Services (AWS) Google Cloud Platform (GCP) Microsoft Azure |
<ul><li>host a bunch of machines and do virtualization</li><li>put microservices or small workloads over there</li><li>does a good job in web serving, moving data like webpages / images / videos around</li></ul> |
| 2010s | Data cloud | Snowflake Databricks |
<ul><li>massive amount of data and distributed computation using SQL</li></ul> |
| 2020s | Need for AI Cloud | CoreWeave Lambda DigitalOcean |
<ul><li>a lot of compute and communication over high performance, heterogeneous and cloud native infra</li><li>this has never been seen in the history of cloud computing</li></ul> |
Two parameters governing our thought of how we think about infrastructure: 1) data movement (IO); 2) numerical computation (compute)
| Service | IO vs. compute | Characteristics | Use and infra |
|---|---|---|---|
| Data compute | IO ≫ compute | <ul><li>abstractions like Spark or MapReduce to carry out one SQL query over a big distributed system</li><li>a lot of complicated distributed scheduling under the hood</li></ul> | <ul><li>easy to use</li><li>hard for infra</li></ul> |
| Web services | IO > compute | <ul><li>arbitrary code</li><li>embarrassingly parallel system, can scale up / down</li><li>read things off the disk and serve to the customers</li></ul> | <ul><li>(kinda) easy to use</li><li>(kinda) easy for infra: if some machines go down, redo the compute of that chunk of compute on the next available MapReduce shard</li></ul> |
| AI compute | compute ≫ IO | <ul><li>arbitrary code</li><li>very distributed systems</li></ul> | <ul><li>(pretty) hard to use</li><li>(pretty) hard for infra: if one GPU machine goes down, need to restart the whole job</li></ul> |
Conventional cloud value proposition no longer holds…
| Cloud | Software variety | Software workload | Supply chain flexibility | Supply chain interchangeability |
|---|---|---|---|---|
| Conventional cloud | complicated: many applications, middleware | varied: compute, storage, network, big data, database etc. | high: microservices work well on virtualization, can migrate workloads between different machines | high: CPU machines can be re-purposed; VMs can do many different jobs |
| AI cloud | simple: “AI frameworks” (PyTorch, TensorFlow etc.) and dependent libraries | unified: numerical computation, matrix multiplications | low: a bunch of GPU machines physically stuck together with high performance networks, hard to live migrate | low: good for numerical computation, but cannot really run web services etc. |
SemiAnalysis coined the term neocloud with a focus on AI-centric computation resources
Should not run on bare metal; Kubernetes is a nice abstraction but normally for site reliability engineers or operations
Efficiency in conflicting state:
Best practices for start-ups:
AI infra is different, but in the end, a lot of the conventional wisdom of operating a cluster and services start coming back
e.g., similar to conventional cloud services - when resuming chatbot service after it is down, put a traffic regulator in front of the service so that only a certain amount of traffic coming in is allowed, then service can be gradually putting up to speed
| Type | Characteristics | Example |
|---|---|---|
| Mainframe | <ul><li>one bus and can attach CPUs, memory and storage</li><li>operate as if it is one single computer, CPUs and GPUs able to access all those memory in that rack</li><li>nice to do programming</li><li>not that flexible</li></ul> | Cray-2 |
| Conventional cloud | <ul><li>small modular machines, each operates on its own</li><li>each machine cannot access another machine’s memory without asking for permission</li><li>can shovel it in / out individual machine</li><li>serves microservices well</li></ul> | Open Compute Project (OCP) server design |
| AI compute | <ul><li>rack-level nicely integrated set of servers that can do large distributed training</li><li>high bandwidth switches able to directly access other machine’s memory without asking for permission</li><li>able to host large model, much easier for optimization (e.g., disaggregated prefilling and decoding, prompt caching)</li></ul> | DGX |
Summary by NVIDIA:
| Era | Timeline | Model size | Inference running on |
|---|---|---|---|
| Early AI era | ~2010 and before | <= 100M parameters doubling every 20 months |
CPUs |
| GPU AI era | ~2015 | ~100M to ~1B parameters doubling every 6 months |
1 GPU |
| Multi-GPU AI era | ~2020-2025 | ~1B to multi-trillion parameters doubling every 10 months |
up to 8 GPUs |
| Age of AI reasoning at scale | 2025 onwards | drastic increase in compute for reasoning expansion of distributed parallelism techniques |
up to 72 GPUs |