ucb_agentic_ai

Lecture 02: Evolution of System Designs from an AI Engineer’s Perspective

Link to lecture recording on YouTube

Date: 2025-09-22

Speaker: Yangqing Jia 贾扬清

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Work:

Notes

Demystify “LLM” and “AGI”

The Chinese typewriter example: how to make things more efficient? today people are typing Chinese characters on the QWERTY keyboard

Conventional NLP: count 2-3 word before it, probably 4-5 bigrams is the upper limit because the correlation matrix gets too large in multi-dimension
Latest models have millions of tokens context length

New Algorithms Drive Continued Improvements

Have not seen a plateauing effect yet
Healthy competition between open-source and closed-source models, closed-source models lead in absolute quality but gap becomes narrower
There does not seem to be a “bubble” - consumption continue growing, not only model training1

Historical analogies

Timeline Algorithm Analogies
Nov 2022 GPT (3.5) AlexNet
(structural innovation, freed researchers from using simple, handcrafted features)
Dec 2023 Mixture of Expert (Mixtral 8*7B): massively sparse model and sparsely activated model to improve efficiency and decrease model size Ensemble Learning
Inception
ResNet
Sep 2024 Test time scaling: having model accumulate and reflect on its intermediate guesses to get a better result Fully convolutional network: predict multiple times across the space domain and try to get a better result
Multi-instance learning
Jan 2025
(and earlier)
Reinforcement Learning: principled way to match the intention where it is difficult to define a loss function General RL
GANs

Applications: to C (consumer) apps thrive, to B (business) apps hopeful and nascent

“Perfect app experience is correlated, but independent from models”

AI Infra is the 3rd Pillar in IT Strategy

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin”

– Professor Richard Sutton, The Bitter Lesson

Interpretation of the above:
The biggest lesson that we have learned in the 70 years of AI research is that we can employ a generalized approach (in today’s context, large language models, deep learning models etc.). We deploy a large amount of numerical computation onto a large amount of data, and that seems to be the most effective way.

Timeline Infra Examples Characteristics
1970s Scientific computing NERSC <ul><li>large cluster of scientific computing machines</li><li>used for large-scale physics / weather simulations</li></ul>
1990s Virtual private servers   <ul><li>someone else taking care of the physical machines</li><li>people build applications on top</li><li>still limited offering of software and applications on top of the machines</li></ul>
2000s Web service cloud Amazon Web Services (AWS)
Google Cloud Platform (GCP)
Microsoft Azure
<ul><li>host a bunch of machines and do virtualization</li><li>put microservices or small workloads over there</li><li>does a good job in web serving, moving data like webpages / images / videos around</li></ul>
2010s Data cloud Snowflake
Databricks
<ul><li>massive amount of data and distributed computation using SQL</li></ul>
2020s Need for AI Cloud CoreWeave
Lambda
DigitalOcean
<ul><li>a lot of compute and communication over high performance, heterogeneous and cloud native infra</li><li>this has never been seen in the history of cloud computing</li></ul>

Two parameters governing our thought of how we think about infrastructure: 1) data movement (IO); 2) numerical computation (compute)

Service IO vs. compute Characteristics Use and infra
Data compute IO ≫ compute <ul><li>abstractions like Spark or MapReduce to carry out one SQL query over a big distributed system</li><li>a lot of complicated distributed scheduling under the hood</li></ul> <ul><li>easy to use</li><li>hard for infra</li></ul>
Web services IO > compute <ul><li>arbitrary code</li><li>embarrassingly parallel system, can scale up / down</li><li>read things off the disk and serve to the customers</li></ul> <ul><li>(kinda) easy to use</li><li>(kinda) easy for infra: if some machines go down, redo the compute of that chunk of compute on the next available MapReduce shard</li></ul>
AI compute compute ≫ IO <ul><li>arbitrary code</li><li>very distributed systems</li></ul> <ul><li>(pretty) hard to use</li><li>(pretty) hard for infra: if one GPU machine goes down, need to restart the whole job</li></ul>

Conventional cloud value proposition no longer holds…

Cloud Software variety Software workload Supply chain flexibility Supply chain interchangeability
Conventional cloud complicated: many applications, middleware varied: compute, storage, network, big data, database etc. high: microservices work well on virtualization, can migrate workloads between different machines high: CPU machines can be re-purposed; VMs can do many different jobs
AI cloud simple: “AI frameworks” (PyTorch, TensorFlow etc.) and dependent libraries unified: numerical computation, matrix multiplications low: a bunch of GPU machines physically stuck together with high performance networks, hard to live migrate low: good for numerical computation, but cannot really run web services etc.

SemiAnalysis coined the term neocloud with a focus on AI-centric computation resources

Should not run on bare metal; Kubernetes is a nice abstraction but normally for site reliability engineers or operations

Efficiency in conflicting state:

Best practices for start-ups:

AI infra is different, but in the end, a lot of the conventional wisdom of operating a cluster and services start coming back
e.g., similar to conventional cloud services - when resuming chatbot service after it is down, put a traffic regulator in front of the service so that only a certain amount of traffic coming in is allowed, then service can be gradually putting up to speed

Hardware and Software Design: back to the Future?

Type Characteristics Example
Mainframe <ul><li>one bus and can attach CPUs, memory and storage</li><li>operate as if it is one single computer, CPUs and GPUs able to access all those memory in that rack</li><li>nice to do programming</li><li>not that flexible</li></ul> Cray-2
Conventional cloud <ul><li>small modular machines, each operates on its own</li><li>each machine cannot access another machine’s memory without asking for permission</li><li>can shovel it in / out individual machine</li><li>serves microservices well</li></ul> Open Compute Project (OCP) server design
AI compute <ul><li>rack-level nicely integrated set of servers that can do large distributed training</li><li>high bandwidth switches able to directly access other machine’s memory without asking for permission</li><li>able to host large model, much easier for optimization (e.g., disaggregated prefilling and decoding, prompt caching)</li></ul> DGX

Summary by NVIDIA:

Era Timeline Model size Inference running on
Early AI era ~2010 and before <= 100M parameters
doubling every 20 months
CPUs
GPU AI era ~2015 ~100M to ~1B parameters
doubling every 6 months
1 GPU
Multi-GPU AI era ~2020-2025 ~1B to multi-trillion parameters
doubling every 10 months
up to 8 GPUs
Age of AI reasoning at scale 2025 onwards drastic increase in compute for reasoning
expansion of distributed parallelism techniques
up to 72 GPUs

Q&A after this

References

  1. Gartner: The 2025 Hype Cycle for Artificial Intelligence Goes Beyond GenAI