Link to lecture recording on YouTube
Date: 2025-09-15
Speaker: Yann Dubois
Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
Work:
| Stage | Purpose | Task | Data | Duration | Compute cost | Bottleneck | Remarks |
|---|---|---|---|---|---|---|---|
| Pretraining | teach the model everything in the world | predict next word | any reasonable data on the Internet >10T tokens >20B unique webpages |
months | >$10M | data and compute | key since GPT-2 (2019) |
| Reasoning RL (only for reasoning models) |
teach the model to reason | think on questions with objective answers or ground truth, and answer correctly | ~1M problems, any hard task with verifiable answer | weeks | >$1M | optimizing for objective truth | important since o1 (2024) |
| Classic post-training / Reinforcement Learning with Human Feedback (RLHF) | steer the model to be useful on real-world tasks | maximize answer preferences of humans | ~100K problems | days | >$100K | quality of data and evaluation | important since ChatGPT (2022) |
The speaker usually bundles the 2nd and 3rd stages as post-training
(numbers are approximate from different open-source projects)
Most of academia was focusing on the first two until 2023. In reality, the last three matter in practice
| Stage | Purpose | Data | Duration | Compute cost | Bottleneck |
|---|---|---|---|---|---|
| Prompting | interact and specialize the model for use case | 0 | hours | 0 | evaluation |
| Finetuning | 2nd stage of post-training to domain specific data | ~10K-100K problems | days | ~$10K-100K | quality of data and evaluation |
Steps: tokenize → forward → predict probability of next token → sample → detokenize
(last two steps only happen at inference time)
Example: a simple language model: N-grams
in order to predict the word after “the grass is __”, take all occurrence of “the grass is __” on Wikipedia, and predict the probability as:
problems: 1) huge memory requirement; 2) can not generalize as most sentences / codes are unique
solution: neural networks
sentence → split into tokens → associate tokens with word embedding (vector representation of the word) → pass through neural network (think of it as nonlinear aggregator of these vectors) → output vector representation of context → linear layer → softmax → probability distribution of next word → optimize cross entropy loss by backpropagate and tune the weights
Idea: to use all of the clean internet
note: internet is dirty and not representative of what we want to ship to users or optimize model on
Practice:
also: keep 2nd distribution of higher quality data (e.g., Wikipedia, arXiv); idea is to fine-tune or optimize after pretraining on these high-quality data such that model learn to be as good as possible
Data processing steps of the FineWeb1 datasets
| Steps | Size reduction |
|---|---|
| Simple document filtering (repetition, length etc.) |
200T → 36T |
| Deduplicate (remove text duplicated >100 times) | 36T → 20T |
| JavaScript, lorem ipsum removal | 20T → 18.6T |
| Additional filters | 18.5T → 15T |
General observations:
Lot of research have been done and more to be done:
Secrecy due to 1) competitive dynamics: easier to replicate hence companies do not want to tell what they have been training on; 2) copyright liability
Common academic datasets: C4 (150B tokens, 800 GB), Dolma2 (3T tokens), The Pile3 (280B tokens), FineWeb1 (15T tokens)
Empirical evidence: for any type of data and model, the most important is how much compute spent on training (both how much data to put in the model and the size of the model)
performance can be predicted well with compute with scaling laws: minimum loss achieved is in inverse linear relationship with compute (both in log scale)4
Scaling laws: tuning
example: comparison between transformer and LSTM: transformers have better constant and scaling rate (slope); scaling rate matters more than constant since the question is how the model will perform if it is much larger
Insights from Chinchilla5 paper: in order to optimally allocate training resources (size vs. data), use 20:1 tokens for each parameter; note that this does not consider inference cost, use larger (>150:1) in practice
“the only thing that matters in the long run is the leveraging of computation”
– Professor Richard Sutton, The Bitter Lesson
Do not spend time over complicating: do simple things and make sure that they scale.
Problem: language modeling is not what we want - it does not assist users
Classic post-training (also called instruction following or alignment)
Reasoning: test-time scaling with more compute after training
Task: “alignment” - we want LLM follows user instructions and designer’s desires (e.g., moderation)
Background:
Idea: finetune pre-trained LLM on a little desired data
| Method | Idea | Data collection | Data quantity | What to learn | Remarks | Problem |
|---|---|---|---|---|---|---|
| Supervised finetuning (SFT) | finetune the LLM with language modeling (“next word prediction”) of the desired answers (“supervised”) | ask humans6; use LLMs to scale data collection7 now synthetic data generation is a whole field on its own |
~10k are sufficient for learning style and instruction following8; the knowledge of format is already in the pre-trained LLM, just need to specialize to one “type of user” |
<ul><li>instruction following</li><li>desired format or style</li><li>tool use</li><li>early reasoning</li><li>anything where you can get good input / output pairs</li></ul> | either seen as<ul><li>a final stage for training</li><li>or a preparation for RL</li></ul> | SFT is behavior cloning of humans:<ol><li>bound by human abilities: humans may prefer things that they are unable to generate</li><li>hallucination: cloning correct answer teaches the model to make up plausibly sounding references</li></ol> |
| Reinforcement learning (RL) | maximize desired behavior rather than clone it | what rewards to maximize:<ul><li>rule-based rewards</li><li>reward model trained to predict human preference (RLHF)</li><li>LLM as a judge</li></ul> | <ul><li>infra is key: sampling is a bottleneck since multiple outputs per problem are sampled</li><li>especially for agents: 1) long rollouts; 2) slow environment feedback</li><li>engines are collocated to avoid communication overhead</li></ul> | |||
| RL from human feedback (RLHF) | maximize human preference rather than clone their behavior | <ul><li>for each instruction: generate 2 answers from a pretty good model</li><li>ask labelers to select their preferred answer</li><li>finetune the model to generate more preferred answers (PPO or DPO)</li></ul> | challenges of human data<ul><li>slow and expensive: have to write extremely detailed rubrics to tell humans what is considered a good/bad answer</li><li>hard to focus on content correctness rather than form (e.g., length)9</li><li>annotator distribution shifts behavior (e.g. different views on many things)10</li><li>crowdsourcing ethics</li></ul>idea: replace human preferences with LLM preferences7 |
SFT improves compared to pre-training, and reinforcement learning makes performance even better11
Quantify progress towards desired task to:
| Type | Idea | Example | Challenge |
|---|---|---|---|
| Close-ended | automatically verify if problems have a few possible answers | MMLU12 | <ul><li>sensitive to prompting: inconsistent answers from different ways of prompting</li><li>train & test contamination: model appears to be much better if trained on the eval</li></ul> |
| Open-ended | ask for annotator preference between answers | see below | <ul><li>large diversity of use cases13</li><li>hard to automate since tasks are open-ended</li></ul> |
| human evaluation: have users interact (blinded) with two chatbots, rate which is better |
LMArena14 | <ul><li>costly</li><li>slow</li></ul> | |
| LLM evaluation - use LLM instead of human<ul><li>generate output by baseline (a human or a model) and the model to evaluate</li><li>ask another LLM which output is better</li><li>number of times the answer is better than the baseline - win rate</li></ul>using LLM can be good as a judge: cheaper and highly correlated with LMArena | AlpacaEval |
Scaling is what matters, but everyone is bottlenecked by compute:
Importance of resource allocation and optimized pipelines
GPUs are massively parallel, optimized for throughput: same instruction applied on all threads but different inputs; fast matrix multiplication
compute improved much faster across time than memory and communication; the bottleneck for GPUs is not performing the computation, but it is actually keeping the processor fed with data15
memory hierarchy: closer to cores ⇒ faster but less memory; and vice versa
Metric (50% is great):
\[\text{Model FLOP Utilization (MFU)} = \frac{\text{observed throughput}}{\text{theoretical best for that GPU}}\]Problem: 1) communication is slow; 2) every new PyTorch line moves variables to global memory (e.g. below)
x1 = x.cos() # read from x in global memory, write to x1
x2 = x1.cos() # read from x1 in global memory, write to x2
Idea: kernel fusion - communicate once, do all the operations, and then communicate it back (e.g., torch.compile)
e.g., FlashAttention16 combines kernel fusion, tiling and re-computation (sometimes it’s cheaper to redo a computation than actually reading from memory)
Problem: big models can not fit on one GPU; want to use as many GPUs as possible to make training run fast
Idea: split memory and compute across GPUs
Background: to naively train a P parameter model, need at least 16*P GP of DRAM
hence 112 GB needed for 7B model
Naive data parallelization (use parallel GPU but no memory gains, big model still does not fit):
Split memory
idea: sharding - each GPU updates a subset of weights and hold them, then communicate them before the next step17
Model parallelism
problem: data parallelism only works if batch size $\geq$ number of GPUs
idea: have every GPU take care of applying specific parameters (rather than updating)
Architecture sparsity
idea: models are huge ⇒ not every data point needs to go through every parameter20
e.g., Mixture-of-Experts (MoE): use a selector layer to have only some parameters “active” for some types of the data points; have different GPUs contain the parameters required for different data points
Guilherme Penedo et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557 [cs.CL]. 2024. ↩ ↩2
Luca Soldaini et al. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv:2402.00159 [cs.CL]. 2024. ↩
Leo Gao et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL]. 2020. ↩
Jared Kaplan et al. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]. 2020. ↩
Jordan Hoffmann et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]. 2022. ↩
Andreas Köpf et al. OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327 [cs.CL]. 2023. ↩
Yann Dubois et al. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387 [cs.LG]. 2023. ↩ ↩2
Chunting Zhou et al. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL]. 2023. ↩
Prasann Singhal et al. A Long Way to Go: Investigating Length Correlations in RLHF. arXiv:2310.03716 [cs.CL]. 2023. ↩
Shibani Santurkar et al. Whose Opinions Do Language Models Reflect?. arXiv:2303.17548 [cs.CL]. 2023. ↩
Nisan Stiennon et al. Learning to summarize from human feedback. arXiv:2009.01325 [cs.CL]. 2020. ↩
Dan Hendrycks et al. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]. 2020. ↩
Long Ouyang et al. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]. 2022. ↩
Wei-Lin Chiang et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]. 2024. ↩
Andrei Ivanov et al. Data Movement Is All You Need: A Case Study on Optimizing Transformers. arXiv:2007.00072 [cs.LG]. 2020. ↩
Tri Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]. 2022. ↩
Samyam Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG]. 2019. ↩
Yanping Huang et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv:1811.06965 [cs.CV]. 2018. ↩
Mohammad Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]. 2019. ↩
William Fedus, Jeff Dean, Barret Zoph. A Review of Sparse Expert Models in Deep Learning. arXiv:2209.01667 [cs.LG]. 2022. ↩