ucb_agentic_ai

Lecture 01: LLM Agents Overview

Link to lecture recording on YouTube

Date: 2025-09-15

Speaker: Yann Dubois

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Computer Science, 2021-2024, Stanford University, advised by Prof. Tatsunori Hashimoto and Prof. Percy Liang
M.Phil in Machine Learning and Machine Intelligence, 2018-2019, University of Cambridge
B.Sc. in Biomedical Engineering, 2014-2017, Federal Polytechnic School of Lausanne (École Polytechnique Fédérale de Lausanne)

Work:

Member of Technical Staff, OpenAI

Notes

General LLM Training Pipeline

Stage	Purpose	Task	Data	Duration	Compute cost	Bottleneck	Remarks
Pretraining	teach the model everything in the world	predict next word	any reasonable data on the Internet >10T tokens >20B unique webpages	months	>$10M	data and compute	key since GPT-2 (2019)
Reasoning RL (only for reasoning models)	teach the model to reason	think on questions with objective answers or ground truth, and answer correctly	~1M problems, any hard task with verifiable answer	weeks	>$1M	optimizing for objective truth	important since o1 (2024)
Classic post-training / Reinforcement Learning with Human Feedback (RLHF)	steer the model to be useful on real-world tasks	maximize answer preferences of humans	~100K problems	days	>$100K	quality of data and evaluation	important since ChatGPT (2022)

The speaker usually bundles the 2^nd and 3^rd stages as post-training
(numbers are approximate from different open-source projects)

Five things to consider when training an LLM

Model architecture: (variants of) transformers, mixture of experts (MoE)
Training algorithm / loss: what are being optimized for this architecture
Data and RL environment
Evaluation: knowing whether making any progress
Systems and infrastructure: make sure these runs can be scaled up

Most of academia was focusing on the first two until 2023. In reality, the last three matter in practice

LLM Specializing Pipeline

Stage	Purpose	Data	Duration	Compute cost	Bottleneck
Prompting	interact and specialize the model for use case	0	hours	0	evaluation
Finetuning	2^nd stage of post-training to domain specific data	~10K-100K problems	days	~$10K-100K	quality of data and evaluation

More about Pre-training

Steps: tokenize → forward → predict probability of next token → sample → detokenize
(last two steps only happen at inference time)

Example: a simple language model: N-grams
in order to predict the word after “the grass is __”, take all occurrence of “the grass is __” on Wikipedia, and predict the probability as:

\[P(X | \text{"the grass is"}) = \frac{count(X | \text{"the grass is"})}{count(\text{"the grass is"})}\]

problems: 1) huge memory requirement; 2) can not generalize as most sentences / codes are unique
solution: neural networks

Neural Language Models

sentence → split into tokens → associate tokens with word embedding (vector representation of the word) → pass through neural network (think of it as nonlinear aggregator of these vectors) → output vector representation of context → linear layer → softmax → probability distribution of next word → optimize cross entropy loss by backpropagate and tune the weights

Pre-training Data

Idea: to use all of the clean internet
note: internet is dirty and not representative of what we want to ship to users or optimize model on

Practice:

download all of internet, common crawl: 250 billion pages, > 1 PB (= 10⁶ GB), WARC file
text extraction from HTML (computationally expensive to clean and extract data; challenges: how to deal with JavaScript, boiler plate code, math etc.)
filter undesirable content
deduplicate data (e.g., all headers/footers/menu in forums are the same); idea is to not train too many times on the exact same data
heuristic filtering: remove low quality documents (evidence: number of words, word length, outlier/dirty tokens)
model-based filter (one idea is distribution matching; e.g., every page referenced on Wikipedia likely to be high quality)
data mix; classify data categories (code/book/entertainment); re-weight domains using scaling law to get high downstream performance

also: keep 2^nd distribution of higher quality data (e.g., Wikipedia, arXiv); idea is to fine-tune or optimize after pretraining on these high-quality data such that model learn to be as good as possible

Data processing steps of the FineWeb¹ datasets

Steps	Size reduction
Simple document filtering (repetition, length etc.)	200T → 36T
Deduplicate (remove text duplicated >100 times)	36T → 20T
JavaScript, lorem ipsum removal	20T → 18.6T
Additional filters	18.5T → 15T

General observations:

aggregate accuracy improves with increasing number of training tokens
aggregate accuracy improves with data processing steps above

Mid-training Data

continuing pretraining to adapt the model to desired properties / higher quality data (<10% of pretraining, ~1T tokens)
data mix changes; e.g., more scientific, coding, multilingual data
during pre-training, do not want to train on large context length since that is computationally intensive; extend context length during mid-training
add desired formatting or instruction following
higher quality data kept for the end (e.g., first learn how to speak grammatically correctly, then learn the real meaning of text)
reasoning data about teaching model how to think

Pre / mid-training Data: the key to practical LLM

Lot of research have been done and more to be done:

how to process well and efficiently
whether to use synthetic data (e.g., generated by big models)
how much multimodal data to put in
how to balance domains

Secrecy due to 1) competitive dynamics: easier to replicate hence companies do not want to tell what they have been training on; 2) copyright liability

Common academic datasets: C4 (150B tokens, 800 GB), Dolma² (3T tokens), The Pile³ (280B tokens), FineWeb¹ (15T tokens)

Compute

Empirical evidence: for any type of data and model, the most important is how much compute spent on training (both how much data to put in the model and the size of the model)
performance can be predicted well with compute with scaling laws: minimum loss achieved is in inverse linear relationship with compute (both in log scale)⁴

Scaling laws: tuning

old pipeline: tune hyperparameters on big models, pick the best among 20-30 runs, but each of them is only trained on ~ $\frac{1}{30}$ of the compute
new pipeline: tune hyperparameters at small scale for a short amount of time, extrapolate to see the performance at larger scale, majority of the compute can go for the full run of the final huge model

example: comparison between transformer and LSTM: transformers have better constant and scaling rate (slope); scaling rate matters more than constant since the question is how the model will perform if it is much larger

Insights from Chinchilla⁵ paper: in order to optimally allocate training resources (size vs. data), use 20:1 tokens for each parameter; note that this does not consider inference cost, use larger (>150:1) in practice

“the only thing that matters in the long run is the leveraging of computation”

– Professor Richard Sutton, The Bitter Lesson

Do not spend time over complicating: do simple things and make sure that they scale.

Method	Idea	Data collection	Data quantity	What to learn	Remarks	Problem
Supervised finetuning (SFT)	finetune the LLM with language modeling (“next word prediction”) of the desired answers (“supervised”)	ask humans⁶; use LLMs to scale data collection⁷ now synthetic data generation is a whole field on its own	~10k are sufficient for learning style and instruction following⁸; the knowledge of format is already in the pre-trained LLM, just need to specialize to one “type of user”	<ul><li>instruction following</li><li>desired format or style</li><li>tool use</li><li>early reasoning</li><li>anything where you can get good input / output pairs</li></ul>	either seen as<ul><li>a final stage for training</li><li>or a preparation for RL</li></ul>	SFT is behavior cloning of humans:<ol><li>bound by human abilities: humans may prefer things that they are unable to generate</li><li>hallucination: cloning correct answer teaches the model to make up plausibly sounding references</li></ol>
Reinforcement learning (RL)	maximize desired behavior rather than clone it	what rewards to maximize:<ul><li>rule-based rewards</li><li>reward model trained to predict human preference (RLHF)</li><li>LLM as a judge</li></ul>			<ul><li>infra is key: sampling is a bottleneck since multiple outputs per problem are sampled</li><li>especially for agents: 1) long rollouts; 2) slow environment feedback</li><li>engines are collocated to avoid communication overhead</li></ul>
RL from human feedback (RLHF)	maximize human preference rather than clone their behavior	<ul><li>for each instruction: generate 2 answers from a pretty good model</li><li>ask labelers to select their preferred answer</li><li>finetune the model to generate more preferred answers (PPO or DPO)</li></ul>			challenges of human data<ul><li>slow and expensive: have to write extremely detailed rubrics to tell humans what is considered a good/bad answer</li><li>hard to focus on content correctness rather than form (e.g., length)⁹</li><li>annotator distribution shifts behavior (e.g. different views on many things)¹⁰</li><li>crowdsourcing ethics</li></ul>idea: replace human preferences with LLM preferences⁷

Evaluation

Quantify progress towards desired task to:

identify improvements, what to change and what hyperparameters to select etc.
select models for a specific application

Type	Idea	Example	Challenge
Close-ended	automatically verify if problems have a few possible answers	MMLU¹²	<ul><li>sensitive to prompting: inconsistent answers from different ways of prompting</li><li>train & test contamination: model appears to be much better if trained on the eval</li></ul>
Open-ended	ask for annotator preference between answers	see below	<ul><li>large diversity of use cases¹³</li><li>hard to automate since tasks are open-ended</li></ul>
	human evaluation: have users interact (blinded) with two chatbots, rate which is better	LMArena¹⁴	<ul><li>costly</li><li>slow</li></ul>
	LLM evaluation - use LLM instead of human<ul><li>generate output by baseline (a human or a model) and the model to evaluate</li><li>ask another LLM which output is better</li><li>number of times the answer is better than the baseline - win rate</li></ul>using LLM can be good as a judge: cheaper and highly correlated with LMArena	AlpacaEval

System and Infrastructure

GPUs

Scaling is what matters, but everyone is bottlenecked by compute:

GPUs are expensive and scarce
physical limitations (e.g., communications between GPUs)

Importance of resource allocation and optimized pipelines

GPUs are massively parallel, optimized for throughput: same instruction applied on all threads but different inputs; fast matrix multiplication

compute improved much faster across time than memory and communication; the bottleneck for GPUs is not performing the computation, but it is actually keeping the processor fed with data¹⁵

memory hierarchy: closer to cores ⇒ faster but less memory; and vice versa

Metric (50% is great):

\[\text{Model FLOP Utilization (MFU)} = \frac{\text{observed throughput}}{\text{theoretical best for that GPU}}\]

Low precision

fewer bits ⇒ faster communication & lower memory consumption
for deep learning, the actual decimal precision is not that important except exponentials, normalization and weight updates
- matrix multiplications can use bf16 instead of fp32
for training, automatic mixed precision (AMP)
- weights stored in fp32, convert to bf16 before computation
- activation in bf16 ⇒ main memory gains
- (only) matrix multiplication in bf16 ⇒ speed gains
- gradients in bf16 ⇒ memory gains
- master weights updated fp32 (full precision)

Operator fusion

Problem: 1) communication is slow; 2) every new PyTorch line moves variables to global memory (e.g. below)

x1 = x.cos()   # read from x in global memory, write to x1
x2 = x1.cos()   # read from x1 in global memory, write to x2

Idea: kernel fusion - communicate once, do all the operations, and then communicate it back (e.g., torch.compile)

Tiling

idea: group and order threads to minimize global memory access (slow), e.g., matrix multiplication
compute matrix multiplications in subspaces to reuse reads (~cache)

e.g., FlashAttention¹⁶ combines kernel fusion, tiling and re-computation (sometimes it’s cheaper to redo a computation than actually reading from memory)

Parallelization

Problem: big models can not fit on one GPU; want to use as many GPUs as possible to make training run fast
Idea: split memory and compute across GPUs
Background: to naively train a P parameter model, need at least 16*P GP of DRAM

4*P GB for model weights (4 bytes (fp32) for every parameter)
2*4P GB for optimizer (e.g. Adam, need to store both the mean and variance of every parameter)
4*P GB for gradients (for backpropagation)

hence 112 GB needed for 7B model

Naive data parallelization (use parallel GPU but no memory gains, big model still does not fit):

copy model and optimizer on each GPU
split data, have each GPU working on the same model but different set of data
communicate and reduce (sum) the gradients

Split memory
idea: sharding - each GPU updates a subset of weights and hold them, then communicate them before the next step¹⁷

Model parallelism
problem: data parallelism only works if batch size $\geq$ number of GPUs
idea: have every GPU take care of applying specific parameters (rather than updating)

pipeline parallel: every GPU has access to different layers¹⁸
tensor parallel: split matrices (inside a layer / between GPUs)¹⁹

Architecture sparsity
idea: models are huge ⇒ not every data point needs to go through every parameter²⁰
e.g., Mixture-of-Experts (MoE): use a selector layer to have only some parameters “active” for some types of the data points; have different GPUs contain the parameters required for different data points

References

Guilherme Penedo et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557 [cs.CL]. 2024. ↩ ↩²
Luca Soldaini et al. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv:2402.00159 [cs.CL]. 2024. ↩
Leo Gao et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL]. 2020. ↩
Jared Kaplan et al. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]. 2020. ↩
Jordan Hoffmann et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]. 2022. ↩
Andreas Köpf et al. OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327 [cs.CL]. 2023. ↩
Yann Dubois et al. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387 [cs.LG]. 2023. ↩ ↩²
Chunting Zhou et al. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL]. 2023. ↩
Prasann Singhal et al. A Long Way to Go: Investigating Length Correlations in RLHF. arXiv:2310.03716 [cs.CL]. 2023. ↩
Shibani Santurkar et al. Whose Opinions Do Language Models Reflect?. arXiv:2303.17548 [cs.CL]. 2023. ↩
Nisan Stiennon et al. Learning to summarize from human feedback. arXiv:2009.01325 [cs.CL]. 2020. ↩
Dan Hendrycks et al. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]. 2020. ↩
Long Ouyang et al. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]. 2022. ↩
Wei-Lin Chiang et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]. 2024. ↩
Andrei Ivanov et al. Data Movement Is All You Need: A Case Study on Optimizing Transformers. arXiv:2007.00072 [cs.LG]. 2020. ↩
Tri Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]. 2022. ↩
Samyam Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG]. 2019. ↩
Yanping Huang et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv:1811.06965 [cs.CV]. 2018. ↩
Mohammad Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]. 2019. ↩
William Fedus, Jeff Dean, Barret Zoph. A Review of Sparse Expert Models in Deep Learning. arXiv:2209.01667 [cs.LG]. 2022. ↩

This site is open source. Improve this page.