ucb_agentic_ai

Lecture 01: LLM Agents Overview

Link to lecture recording on YouTube

Date: 2025-09-15

Speaker: Yann Dubois

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Work:

Notes

General LLM Training Pipeline

Stage Purpose Task Data Duration Compute cost Bottleneck Remarks
Pretraining teach the model everything in the world predict next word any reasonable data on the Internet
>10T tokens
>20B unique webpages
months >$10M data and compute key since GPT-2 (2019)
Reasoning RL
(only for reasoning models)
teach the model to reason think on questions with objective answers or ground truth, and answer correctly ~1M problems, any hard task with verifiable answer weeks >$1M optimizing for objective truth important since o1 (2024)
Classic post-training / Reinforcement Learning with Human Feedback (RLHF) steer the model to be useful on real-world tasks maximize answer preferences of humans ~100K problems days >$100K quality of data and evaluation important since ChatGPT (2022)

The speaker usually bundles the 2nd and 3rd stages as post-training
(numbers are approximate from different open-source projects)

Five things to consider when training an LLM

Most of academia was focusing on the first two until 2023. In reality, the last three matter in practice

LLM Specializing Pipeline

Stage Purpose Data Duration Compute cost Bottleneck
Prompting interact and specialize the model for use case 0 hours 0 evaluation
Finetuning 2nd stage of post-training to domain specific data ~10K-100K problems days ~$10K-100K quality of data and evaluation

More about Pre-training

Steps: tokenize → forward → predict probability of next token → sample → detokenize
(last two steps only happen at inference time)

Example: a simple language model: N-grams
in order to predict the word after “the grass is __”, take all occurrence of “the grass is __” on Wikipedia, and predict the probability as:

\[P(X | \text{"the grass is"}) = \frac{count(X | \text{"the grass is"})}{count(\text{"the grass is"})}\]

problems: 1) huge memory requirement; 2) can not generalize as most sentences / codes are unique
solution: neural networks

Neural Language Models

sentence → split into tokens → associate tokens with word embedding (vector representation of the word) → pass through neural network (think of it as nonlinear aggregator of these vectors) → output vector representation of context → linear layer → softmax → probability distribution of next word → optimize cross entropy loss by backpropagate and tune the weights

Pre-training Data

Idea: to use all of the clean internet
note: internet is dirty and not representative of what we want to ship to users or optimize model on

Practice:

  1. download all of internet, common crawl: 250 billion pages, > 1 PB (= 106 GB), WARC file
  2. text extraction from HTML (computationally expensive to clean and extract data; challenges: how to deal with JavaScript, boiler plate code, math etc.)
  3. filter undesirable content
  4. deduplicate data (e.g., all headers/footers/menu in forums are the same); idea is to not train too many times on the exact same data
  5. heuristic filtering: remove low quality documents (evidence: number of words, word length, outlier/dirty tokens)
  6. model-based filter (one idea is distribution matching; e.g., every page referenced on Wikipedia likely to be high quality)
  7. data mix; classify data categories (code/book/entertainment); re-weight domains using scaling law to get high downstream performance

also: keep 2nd distribution of higher quality data (e.g., Wikipedia, arXiv); idea is to fine-tune or optimize after pretraining on these high-quality data such that model learn to be as good as possible

Data processing steps of the FineWeb1 datasets

Steps Size reduction
Simple document filtering
(repetition, length etc.)
200T → 36T
Deduplicate (remove text duplicated >100 times) 36T → 20T
JavaScript, lorem ipsum removal 20T → 18.6T
Additional filters 18.5T → 15T

General observations:

Mid-training Data

Pre / mid-training Data: the key to practical LLM

Lot of research have been done and more to be done:

Secrecy due to 1) competitive dynamics: easier to replicate hence companies do not want to tell what they have been training on; 2) copyright liability

Common academic datasets: C4 (150B tokens, 800 GB), Dolma2 (3T tokens), The Pile3 (280B tokens), FineWeb1 (15T tokens)

Compute

Empirical evidence: for any type of data and model, the most important is how much compute spent on training (both how much data to put in the model and the size of the model)
performance can be predicted well with compute with scaling laws: minimum loss achieved is in inverse linear relationship with compute (both in log scale)4

Scaling laws: tuning

example: comparison between transformer and LSTM: transformers have better constant and scaling rate (slope); scaling rate matters more than constant since the question is how the model will perform if it is much larger

Insights from Chinchilla5 paper: in order to optimally allocate training resources (size vs. data), use 20:1 tokens for each parameter; note that this does not consider inference cost, use larger (>150:1) in practice

“the only thing that matters in the long run is the leveraging of computation”

– Professor Richard Sutton, The Bitter Lesson

Do not spend time over complicating: do simple things and make sure that they scale.

More about Post-training

Problem: language modeling is not what we want - it does not assist users

Classic post-training (also called instruction following or alignment)
Reasoning: test-time scaling with more compute after training

Post-training Methods

Task: “alignment” - we want LLM follows user instructions and designer’s desires (e.g., moderation)

Background:

Idea: finetune pre-trained LLM on a little desired data

Method Idea Data collection Data quantity What to learn Remarks Problem
Supervised finetuning (SFT) finetune the LLM with language modeling (“next word prediction”) of the desired answers (“supervised”) ask humans6; use LLMs to scale data collection7
now synthetic data generation is a whole field on its own
~10k are sufficient for learning style and instruction following8;
the knowledge of format is already in the pre-trained LLM, just need to specialize to one “type of user”
<ul><li>instruction following</li><li>desired format or style</li><li>tool use</li><li>early reasoning</li><li>anything where you can get good input / output pairs</li></ul> either seen as<ul><li>a final stage for training</li><li>or a preparation for RL</li></ul> SFT is behavior cloning of humans:<ol><li>bound by human abilities: humans may prefer things that they are unable to generate</li><li>hallucination: cloning correct answer teaches the model to make up plausibly sounding references</li></ol>
Reinforcement learning (RL) maximize desired behavior rather than clone it what rewards to maximize:<ul><li>rule-based rewards</li><li>reward model trained to predict human preference (RLHF)</li><li>LLM as a judge</li></ul>     <ul><li>infra is key: sampling is a bottleneck since multiple outputs per problem are sampled</li><li>especially for agents: 1) long rollouts; 2) slow environment feedback</li><li>engines are collocated to avoid communication overhead</li></ul>  
RL from human feedback (RLHF) maximize human preference rather than clone their behavior <ul><li>for each instruction: generate 2 answers from a pretty good model</li><li>ask labelers to select their preferred answer</li><li>finetune the model to generate more preferred answers (PPO or DPO)</li></ul>     challenges of human data<ul><li>slow and expensive: have to write extremely detailed rubrics to tell humans what is considered a good/bad answer</li><li>hard to focus on content correctness rather than form (e.g., length)9</li><li>annotator distribution shifts behavior (e.g. different views on many things)10</li><li>crowdsourcing ethics</li></ul>idea: replace human preferences with LLM preferences7  

SFT improves compared to pre-training, and reinforcement learning makes performance even better11

Evaluation

Quantify progress towards desired task to:

Type Idea Example Challenge
Close-ended automatically verify if problems have a few possible answers MMLU12 <ul><li>sensitive to prompting: inconsistent answers from different ways of prompting</li><li>train & test contamination: model appears to be much better if trained on the eval</li></ul>
Open-ended ask for annotator preference between answers see below <ul><li>large diversity of use cases13</li><li>hard to automate since tasks are open-ended</li></ul>
  human evaluation:
have users interact (blinded) with two chatbots, rate which is better
LMArena14 <ul><li>costly</li><li>slow</li></ul>
  LLM evaluation - use LLM instead of human<ul><li>generate output by baseline (a human or a model) and the model to evaluate</li><li>ask another LLM which output is better</li><li>number of times the answer is better than the baseline - win rate</li></ul>using LLM can be good as a judge: cheaper and highly correlated with LMArena AlpacaEval  

System and Infrastructure

GPUs

Scaling is what matters, but everyone is bottlenecked by compute:

Importance of resource allocation and optimized pipelines

GPUs are massively parallel, optimized for throughput: same instruction applied on all threads but different inputs; fast matrix multiplication

compute improved much faster across time than memory and communication; the bottleneck for GPUs is not performing the computation, but it is actually keeping the processor fed with data15

memory hierarchy: closer to cores ⇒ faster but less memory; and vice versa

Metric (50% is great):

\[\text{Model FLOP Utilization (MFU)} = \frac{\text{observed throughput}}{\text{theoretical best for that GPU}}\]

Low precision

Operator fusion

Problem: 1) communication is slow; 2) every new PyTorch line moves variables to global memory (e.g. below)

x1 = x.cos()   # read from x in global memory, write to x1
x2 = x1.cos()   # read from x1 in global memory, write to x2

Idea: kernel fusion - communicate once, do all the operations, and then communicate it back (e.g., torch.compile)

Tiling

e.g., FlashAttention16 combines kernel fusion, tiling and re-computation (sometimes it’s cheaper to redo a computation than actually reading from memory)

Parallelization

Problem: big models can not fit on one GPU; want to use as many GPUs as possible to make training run fast
Idea: split memory and compute across GPUs
Background: to naively train a P parameter model, need at least 16*P GP of DRAM

hence 112 GB needed for 7B model

Naive data parallelization (use parallel GPU but no memory gains, big model still does not fit):

  1. copy model and optimizer on each GPU
  2. split data, have each GPU working on the same model but different set of data
  3. communicate and reduce (sum) the gradients

Split memory
idea: sharding - each GPU updates a subset of weights and hold them, then communicate them before the next step17

Model parallelism
problem: data parallelism only works if batch size $\geq$ number of GPUs
idea: have every GPU take care of applying specific parameters (rather than updating)

Architecture sparsity
idea: models are huge ⇒ not every data point needs to go through every parameter20
e.g., Mixture-of-Experts (MoE): use a selector layer to have only some parameters “active” for some types of the data points; have different GPUs contain the parameters required for different data points

References

  1. Guilherme Penedo et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557 [cs.CL]. 2024.  2

  2. Luca Soldaini et al. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv:2402.00159 [cs.CL]. 2024. 

  3. Leo Gao et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL]. 2020. 

  4. Jared Kaplan et al. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]. 2020. 

  5. Jordan Hoffmann et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]. 2022. 

  6. Andreas Köpf et al. OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327 [cs.CL]. 2023. 

  7. Yann Dubois et al. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387 [cs.LG]. 2023.  2

  8. Chunting Zhou et al. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL]. 2023. 

  9. Prasann Singhal et al. A Long Way to Go: Investigating Length Correlations in RLHF. arXiv:2310.03716 [cs.CL]. 2023. 

  10. Shibani Santurkar et al. Whose Opinions Do Language Models Reflect?. arXiv:2303.17548 [cs.CL]. 2023. 

  11. Nisan Stiennon et al. Learning to summarize from human feedback. arXiv:2009.01325 [cs.CL]. 2020. 

  12. Dan Hendrycks et al. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]. 2020. 

  13. Long Ouyang et al. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]. 2022. 

  14. Wei-Lin Chiang et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]. 2024. 

  15. Andrei Ivanov et al. Data Movement Is All You Need: A Case Study on Optimizing Transformers. arXiv:2007.00072 [cs.LG]. 2020. 

  16. Tri Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]. 2022. 

  17. Samyam Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG]. 2019. 

  18. Yanping Huang et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv:1811.06965 [cs.CV]. 2018. 

  19. Mohammad Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]. 2019. 

  20. William Fedus, Jeff Dean, Barret Zoph. A Review of Sparse Expert Models in Deep Learning. arXiv:2209.01667 [cs.LG]. 2022.