ucb_agentic_ai

Lecture 02: Evolution of System Designs from an AI Engineer’s Perspective

Link to lecture recording on YouTube

Date: 2025-09-22

Speaker: Yangqing Jia 贾扬清

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Computer Science, 2009-2013, University of California, Berkeley, advised by Prof. Trevor Darrel
Master in Control Science and Engineering, Tsinghua University
B.Eng. in Automation, Tsinghua University

Work:

VP, AI System Software, Nvidia
Founder, Lepton AI (acquired by Nvidia)

Notes

Demystify “LLM” and “AGI”

The Chinese typewriter example: how to make things more efficient? today people are typing Chinese characters on the QWERTY keyboard

idea: the distance between current character and next character matters
analogy: defragmentation of spinning hard disk drive to physically organize the contents and store files into contiguous regions
practice: group characters that often appear together

Conventional NLP: count 2-3 word before it, probably 4-5 bigrams is the upper limit because the correlation matrix gets too large in multi-dimension
Latest models have millions of tokens context length

New Algorithms Drive Continued Improvements

Have not seen a plateauing effect yet
Healthy competition between open-source and closed-source models, closed-source models lead in absolute quality but gap becomes narrower
There does not seem to be a “bubble” - consumption continue growing, not only model training¹

Historical analogies

Timeline	Algorithm	Analogies
Nov 2022	GPT (3.5)	AlexNet (structural innovation, freed researchers from using simple, handcrafted features)
Dec 2023	Mixture of Expert (Mixtral 8*7B): massively sparse model and sparsely activated model to improve efficiency and decrease model size	Ensemble Learning Inception ResNet
Sep 2024	Test time scaling: having model accumulate and reflect on its intermediate guesses to get a better result	Fully convolutional network: predict multiple times across the space domain and try to get a better result Multi-instance learning
Jan 2025 (and earlier)	Reinforcement Learning: principled way to match the intention where it is difficult to define a loss function	General RL GANs

Applications: to C (consumer) apps thrive, to B (business) apps hopeful and nascent

“Perfect app experience is correlated, but independent from models”

Consumer app landscape: highly fluid and competitive due to the continued improvement of foundation models
Prosumers’ willingness to pay drives revenue
An interesting position where a lot of the enterprise applications are growing much faster than the more conventional ones, but not as fast as Cursor/Perplexity on the consumer side
There is still a lot of potentials, especially with the more nascent enterprise market
Healthy synergy between the model building companies and the application companies, and they need cloud infrastructure to support massive scaling

AI Infra is the 3^rd Pillar in IT Strategy

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin”

– Professor Richard Sutton, The Bitter Lesson

Interpretation of the above:
The biggest lesson that we have learned in the 70 years of AI research is that we can employ a generalized approach (in today’s context, large language models, deep learning models etc.). We deploy a large amount of numerical computation onto a large amount of data, and that seems to be the most effective way.

Timeline	Infra	Examples	Characteristics
1970s	Scientific computing	NERSC	<ul><li>large cluster of scientific computing machines</li><li>used for large-scale physics / weather simulations</li></ul>
1990s	Virtual private servers		<ul><li>someone else taking care of the physical machines</li><li>people build applications on top</li><li>still limited offering of software and applications on top of the machines</li></ul>
2000s	Web service cloud	Amazon Web Services (AWS) Google Cloud Platform (GCP) Microsoft Azure	<ul><li>host a bunch of machines and do virtualization</li><li>put microservices or small workloads over there</li><li>does a good job in web serving, moving data like webpages / images / videos around</li></ul>
2010s	Data cloud	Snowflake Databricks	<ul><li>massive amount of data and distributed computation using SQL</li></ul>
2020s	Need for AI Cloud	CoreWeave Lambda DigitalOcean	<ul><li>a lot of compute and communication over high performance, heterogeneous and cloud native infra</li><li>this has never been seen in the history of cloud computing</li></ul>

Two parameters governing our thought of how we think about infrastructure: 1) data movement (IO); 2) numerical computation (compute)

Service	IO vs. compute	Characteristics	Use and infra
Data compute	IO ≫ compute	<ul><li>abstractions like Spark or MapReduce to carry out one SQL query over a big distributed system</li><li>a lot of complicated distributed scheduling under the hood</li></ul>	<ul><li>easy to use</li><li>hard for infra</li></ul>
Web services	IO > compute	<ul><li>arbitrary code</li><li>embarrassingly parallel system, can scale up / down</li><li>read things off the disk and serve to the customers</li></ul>	<ul><li>(kinda) easy to use</li><li>(kinda) easy for infra: if some machines go down, redo the compute of that chunk of compute on the next available MapReduce shard</li></ul>
AI compute	compute ≫ IO	<ul><li>arbitrary code</li><li>very distributed systems</li></ul>	<ul><li>(pretty) hard to use</li><li>(pretty) hard for infra: if one GPU machine goes down, need to restart the whole job</li></ul>

Conventional cloud value proposition no longer holds…

Cloud	Software variety	Software workload	Supply chain flexibility	Supply chain interchangeability
Conventional cloud	complicated: many applications, middleware	varied: compute, storage, network, big data, database etc.	high: microservices work well on virtualization, can migrate workloads between different machines	high: CPU machines can be re-purposed; VMs can do many different jobs
AI cloud	simple: “AI frameworks” (PyTorch, TensorFlow etc.) and dependent libraries	unified: numerical computation, matrix multiplications	low: a bunch of GPU machines physically stuck together with high performance networks, hard to live migrate	low: good for numerical computation, but cannot really run web services etc.

SemiAnalysis coined the term neocloud with a focus on AI-centric computation resources

Should not run on bare metal; Kubernetes is a nice abstraction but normally for site reliability engineers or operations

Efficiency in conflicting state:

developer efficiency: want things to be simple and abstracted
infra efficiency: faulty GPU (35.3%), GPU HBM3 memory (17.2%) and software bug (12.9%) are the top root causes of interruptions during a 54-day period of Llama 3 450B Pre-training

Best practices for start-ups:

multi-cloud supply chain management as GPUs are in short supply: figure out how to smoothly migrate jobs between different clouds
elasticity and utilization management: not wasting idle GPU hours
AI native platform to unify dev, training and inference; abstract away the Kubernetes jargon and focus on the training jobs (e.g., Ray developed at UC Berkeley)
build your own team around model and applications (e.g., Cursor uses standardized SaaS services to stabilize inference workloads)

AI infra is different, but in the end, a lot of the conventional wisdom of operating a cluster and services start coming back
e.g., similar to conventional cloud services - when resuming chatbot service after it is down, put a traffic regulator in front of the service so that only a certain amount of traffic coming in is allowed, then service can be gradually putting up to speed

Hardware and Software Design: back to the Future?

Type	Characteristics	Example
Mainframe	<ul><li>one bus and can attach CPUs, memory and storage</li><li>operate as if it is one single computer, CPUs and GPUs able to access all those memory in that rack</li><li>nice to do programming</li><li>not that flexible</li></ul>	Cray-2
Conventional cloud	<ul><li>small modular machines, each operates on its own</li><li>each machine cannot access another machine’s memory without asking for permission</li><li>can shovel it in / out individual machine</li><li>serves microservices well</li></ul>	Open Compute Project (OCP) server design
AI compute	<ul><li>rack-level nicely integrated set of servers that can do large distributed training</li><li>high bandwidth switches able to directly access other machine’s memory without asking for permission</li><li>able to host large model, much easier for optimization (e.g., disaggregated prefilling and decoding, prompt caching)</li></ul>	DGX

Summary by NVIDIA:

Era	Timeline	Model size	Inference running on
Early AI era	~2010 and before	<= 100M parameters doubling every 20 months	CPUs
GPU AI era	~2015	~100M to ~1B parameters doubling every 6 months	1 GPU
Multi-GPU AI era	~2020-2025	~1B to multi-trillion parameters doubling every 10 months	up to 8 GPUs
Age of AI reasoning at scale	2025 onwards	drastic increase in compute for reasoning expansion of distributed parallelism techniques	up to 72 GPUs

Q&A after this

References

Gartner: The 2025 Hype Cycle for Artificial Intelligence Goes Beyond GenAI ↩

This site is open source. Improve this page.