Link to lecture recording on YouTube
Date: 2025-09-29
Speaker: Jiantao Jiao 焦剑涛
Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
Work:
| Model | Aligned with | Design goal |
|---|---|---|
| Earlier chat models | human aligned | provide interactions that maximize human preference |
| Agentic models | environment feedback aligned | provide interactions that maximize verifiable rewards (in additional human preference) |
What more is necessary from earlier models to agentic models
Observation: if the LLM is not strong enough in producing essentially right answers, it is very hard to build a reliable system.
Training data: get good verifiable training data to train these agents
Evaluation: verifiable agents are intelligent, need to define what intelligence is
Training: the best way of feeding the verifiable data to the model so that the model’s intelligence actually improves
Environment
Tools
Verifier
Environment and verifier diversity is critical
Important to achieve good coverage of many different types of environments, tools and verifiers
Along with scale, verifier quality is also extremely important; make sure we have false positives and false negatives
example:
Build a very good pipeline to make sure that once solid coverage of training data is achieved, it will generalize to any environments have not seen.
Agents are expected to:
Off-the-Shelf Benchmarks (for 1, 2):
Agent Harness (for 4):
Focused “Unit-Test” Evaluations (for 3, 5):
Keep the definition of intelligence holistic, make sure we check all kinds of capabilities of LLMs
We should evaluate on:
How to quantify benchmark quality:
Significant community effort involved in ensuring the quality of benchmarks
What does “training” an agent mean
Core characteristics to focus on in the training process:
Training an agent means maximizing the model’s ability to use tools to achieve correctness
Train to imitate first and then to explore
| Step | Goal | Details | Examples |
|---|---|---|---|
| Supervised fine-tuning (SFT) | minimize non-meaningful attempts | imitate demonstration trajectories that achieve correctness | show LLM the actions to take (access a database, access an API etc.) and let LLM try to replicate |
| Reinforcement Learning (RL) | reinforce diverse, great attempts from the current model | let the model explore diverse trajectories, and reinforce trajectories that achieve correctness and discourage ones that do not diversity accelerates the learning process |
after SFT, at a stage where LLM is sometimes reasonably correct, or sometimes makes mistakes but not totally wrong; now let LLM learn better through exploring more by itself |
Why both SFT and RL?
SFT is just there to discourage meaningless and low-quality answers, so that we can sample meaningful trajectories during RL with tractable compute
RL is there to truly reinforce intelligence
we want to ensure we do not affect diversity of responses during RL by over SFT-ing on non-diverse examples
Good RL:
| Item | Idea | Example |
|---|---|---|
| Train for longer | maintain entropy balance, controlled model verbosity increase | saturation: every time LLM makes an attempt, it produces the exact same answers not good if it plateaus very quickly |
| Train on more difficult prompts and tasks | train on tasks and prompts that are related to the task at hand, but that have low (but NOT zero) chances of success for the current model (meaningfully difficult) | simple tasks can be solved well, hence not learning anything new |
| Train on more diverse responses, with high quality reward feedback | see as many different, meaningful answers per prompt as possible scale compute to ensure we get good responses |
if only study a single subject, certainly would not be able to answer questions in the exams of other subjects |
Important research topic at the moment:
With the increase in training steps, performance gain improves (76% at 200 steps, 93% at 800 steps) and entropy left reduces (27% at 200 steps, 6% at 800 steps)1
entropy: a measure of uncertainty or randomness in the model’s prediction of the next token (sampled from a distribution)
high entropy ⇒ able to get two very different trajectories if sampling twice
Without intervention (e.g., entropy or KL regulation), policy entropy is traded for reward predictably
Question:
Two approaches people have taken in the literature:
1st approach: on-policy vs. off-policy2:
on policy: only judge the attempts produced by the LLM itself whenever we ask the reward model or verifier to judge the quality of the attempts
with the increase in training steps
2nd approach: balance the update strength3:
observation: need to decouple the two $\epsilon$ (clipping to the likelihood ratio around 1), and make the $\epsilon_{high}$ even bigger
to encourage the model to increase the probabilities assigned to low probability tokens so the model is able to explore well
Add a loss to encourage entropy directly4: many ways of constructing the loss
Proxy for the correlation between the generation likelihoods of the sequence and the sequence quality (or the advantage):
\[Cov_{y \sim \pi_{\theta}(\cdot|x)}(\log \pi_{\theta} (y|x), \pi_{\theta} (y|x) \cdot A(y,x))\]Ideally: model is supposed to be very confident in generating the right answers, and uncertain in generating wrong answers
From Entropy Mechanism1 paper: correlation stays very low for low accuracy of 0.125, meaning the model is unable to generate very confidently the right answers
You should not expect the model to perform well if it is trained on too easy datasets, nor can you blindly throw in super hard prompts
From Entropy Mechanism1 paper:
Many approaches5 to improve learning signal; e.g.,
GenSelect6: let LLM come up with multiple answers, and pick the best one
DeepConf7: track the generation confidence of trajectories; only do majority voting on the sampled answers that remain above a threshold of confidence
Community can solve everything as a whole:
| Step | Details |
|---|---|
| Have a collection where we curate environments and verifiers to scale up intelligence | crowdsource environments / evals / recipes from individual contributors / companies / researchers, to have diverse, high-quality signals to scale up RL in a single place |
| Figure out the best definition of what intelligence is | <ul><li>analyze the strengths and weaknesses of benchmarks</li><li>include it in the collection so everyone can use it during RL scale up to ensure we are measuring intelligence holistically</li></ul> |
| Release algorithms to train longer, on harder prompts, and with diverse responses | <ul><li>create new algorithms that are stable, meaningful, and result in diverse response</li><li>add it into the collection so everyone can use it for all the environments we are curating</li></ul> |
Open questions:
Humans engage in many different environments, focus on some lessons more than others, learn from both exploring by themselves and from teachers, and are very often evaluated for intelligence for various tasks. How to bridge the gap between human learning and system learning?
Ganqu Cui et al. The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models. arXiv:2505.22617 [cs.LG]. 2025. ↩ ↩2 ↩3
Yaru Hao et al. On-Policy RL with Optimal Reward Baseline. arXiv:2505.23585 [cs.LG]. 2025. ↩
Qiying Yu et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476 [cs.LG]. 2025. ↩
Jujie He et al. Skywork Open Reasoner 1 Technical Report. arXiv:2505.22312 [cs.LG]. 2025. ↩
Jixiao Zhang, Chunsheng Zuo. GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models. arXiv:2504.09696 [cs.CL]. 2025. ↩
Shubham Toshniwal et al. GenSelect: A Generative Approach to Best-of-N. arXiv:2507.17797 [cs.LG]. 2025. ↩
Yichao Fu et al. Deep Think with Confidence. arXiv:2508.15260 [cs.LG]. 2025. ↩