ucb_agentic_ai

Lecture 03: Post-Training Verifiable Agents

Link to lecture recording on YouTube

Date: 2025-09-29

Speaker: Jiantao Jiao 焦剑涛

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Electrical, Electronics and Communications Engineering, 2012-2018, Stanford University

Work:

Assistant Professor, Department of Electrical Engineering and Computer Science, University of California, Berkeley
Assistant Professor, Department of Statistics, University of California, Berkeley
Director of Research & Distinguished Scientist, Nvidia

Notes

How do “agentic” models differ from traditional LLMs

Model	Aligned with	Design goal
Earlier chat models	human aligned	provide interactions that maximize human preference
Agentic models	environment feedback aligned	provide interactions that maximize verifiable rewards (in additional human preference)

What more is necessary from earlier models to agentic models

conversation is not verifiable, often no clear single correct answer, need to model preference
conversation requires interaction with the user only
many tasks have one (or a few) correct answers, should not produce wrong answers
require interactions with many entities (user and the environment) to understand the full state

Observation: if the LLM is not strong enough in producing essentially right answers, it is very hard to build a reliable system.

3 core steps to train the verifiable agentic model

Training data: get good verifiable training data to train these agents

Evaluation: verifiable agents are intelligent, need to define what intelligence is

challenge: have not gotten enough strong verifiers to make sure that we capture all the corner cases of the intelligence

Training: the best way of feeding the verifiable data to the model so that the model’s intelligence actually improves

challenge: to transform natural human feedback loop (do → make mistake → receive feedback → improve) to something implementable by an algorithm

More about training data:

Environment

models trained to consume tokens that inform the model about the environment’s current state and user’s intention
also includes system prompts already set up for different parts to work together
e.g., code repository; web browser

Tools

models trained to generate tokens that are consumed by software tools that provide additional relevant information to the model
agent to decompose a task, figure out the right tools to call, change the environment state (if necessary), produce the answer
e.g., APIs; tools that change the environment’s state

Verifier

models trained to generate tokens that maximize the reward from the verifier
good to think about verifier as a vector, evaluating quality of the model from different perspectives
e.g., unit tests for code; math checker for proofs; DOM scripts for web tasks

Environment and verifier diversity is critical

Important to achieve good coverage of many different types of environments, tools and verifiers

Along with scale, verifier quality is also extremely important; make sure we have false positives and false negatives

if there are several ways to get the right ground truth, you want the verifier to reward them all
do not want to reward wrong answers

example:

how much change will I get back if I buy a drink for $0.5 and I give $1? 1/2, 2/4, …, are correct
more stringent, if the simplest form is needed, only 1/2 is correct

Getting good evaluation data

Build a very good pipeline to make sure that once solid coverage of training data is achieved, it will generalize to any environments have not seen.

Agents are expected to:

work with many different tools, not to break with new or modified tools
work for many different use cases
work under vaguely specified instructions, understand human intentions, produce precisely formatted outputs
work under many software systems
to be robust, behaves like high quality engineers who will be able to debug all the different issues

Off-the-Shelf Benchmarks (for 1, 2):

agentic coding: SWEBench, TerminalBench, AIDER
tool calling: τ-bench, BFCL, ComplexFuncBench
knowledge: HLE, BrowseComp

Agent Harness (for 4):

SWEBench: OpenHands, MiniSWE, OpenSWE, AIDER-SWEBench
HLE: with / without search harness

Focused “Unit-Test” Evaluations (for 3, 5):

structured output adherence: structEval
instruction following with conditional instructions: IFBench
long trajectories: NexusBench

Keep the definition of intelligence holistic, make sure we check all kinds of capabilities of LLMs

We should evaluate on:

many tasks and verifiers
many different ways of letting the agent explore in the benchmarks’ sandbox by seeing how the accuracy changes as we swap out the harnesses
many tools within each task so that the agent understands how to use tools in different action spaces

How to quantify benchmark quality:

hardness
separability: benchmark able to separate the capabilities of LLMs
diversity: have good coverage

Significant community effort involved in ensuring the quality of benchmarks

Training well

What does “training” an agent mean

let LLM Agent attempt different actions in the environment
see what gets it to the final answer the best
whichever gets you the best results, you want the model to reinforce and repeat

Core characteristics to focus on in the training process:

you want each attempt to be very different from each other, since you want the model to reinforce or discourage many different styles of attempts
each attempt costs compute to sample and process
- if the task is super easy, all attempts are correct and it does not learn anything
- however, attempts with all failures would cost lots of resource, and you need to spend time to figure out how to make LLMs learn

Training an agent means maximizing the model’s ability to use tools to achieve correctness

Train to imitate first and then to explore

Step	Goal	Details	Examples
Supervised fine-tuning (SFT)	minimize non-meaningful attempts	imitate demonstration trajectories that achieve correctness	show LLM the actions to take (access a database, access an API etc.) and let LLM try to replicate
Reinforcement Learning (RL)	reinforce diverse, great attempts from the current model	let the model explore diverse trajectories, and reinforce trajectories that achieve correctness and discourage ones that do not diversity accelerates the learning process	after SFT, at a stage where LLM is sometimes reasonably correct, or sometimes makes mistakes but not totally wrong; now let LLM learn better through exploring more by itself

Why both SFT and RL?
SFT is just there to discourage meaningless and low-quality answers, so that we can sample meaningful trajectories during RL with tractable compute
RL is there to truly reinforce intelligence

demonstration samples should be diverse, otherwise the model may be unable to generate diverse attempts later in RL stage
SFT should be light, otherwise model may lose the ability to figure out answers by its own

we want to ensure we do not affect diversity of responses during RL by over SFT-ing on non-diverse examples

More about Reinforcement Learning (RL)

Good RL:

Item	Idea	Example
Train for longer	maintain entropy balance, controlled model verbosity increase	saturation: every time LLM makes an attempt, it produces the exact same answers not good if it plateaus very quickly
Train on more difficult prompts and tasks	train on tasks and prompts that are related to the task at hand, but that have low (but NOT zero) chances of success for the current model (meaningfully difficult)	simple tasks can be solved well, hence not learning anything new
Train on more diverse responses, with high quality reward feedback	see as many different, meaningful answers per prompt as possible scale compute to ensure we get good responses	if only study a single subject, certainly would not be able to answer questions in the exams of other subjects

Important research topic at the moment:

current algorithmic formulation for some combination of SFT and RL do not align very well with machine learning theory
more exploration in terms of different architectures, algorithms, better ways of utilizing data (both pre-training and post-training data) is critical in the community
good to move closer to the theoretical principles while doing something implementable in practice
speaker believes we have not figured out what is the right way to train those systems and deploy them in certain other critical applications

Train longer

With the increase in training steps, performance gain improves (76% at 200 steps, 93% at 800 steps) and entropy left reduces (27% at 200 steps, 6% at 800 steps)¹
entropy: a measure of uncertainty or randomness in the model’s prediction of the next token (sampled from a distribution)
high entropy ⇒ able to get two very different trajectories if sampling twice

Without intervention (e.g., entropy or KL regulation), policy entropy is traded for reward predictably

Question:

Is it possible to produce high quality answers but at the same time do not achieve entropy collapse? So later answers are correct, still diverse, and able to explore
What is the right intervention to have a better trade-off between entropy and reward curve?

Two approaches people have taken in the literature:

1^st approach: on-policy vs. off-policy²:
on policy: only judge the attempts produced by the LLM itself whenever we ask the reward model or verifier to judge the quality of the attempts

with the increase in training steps

rewards increase in a similar manner for either on-policy or off-policy algorithms ⇒ we do get a better and better policy
KL divergence is small for on-policy compared to off-policy
entropy stays high for on-policy compared to off-policy

2^nd approach: balance the update strength³:
observation: need to decouple the two $\epsilon$ (clipping to the likelihood ratio around 1), and make the $\epsilon_{high}$ even bigger
to encourage the model to increase the probabilities assigned to low probability tokens so the model is able to explore well

Add a loss to encourage entropy directly⁴: many ways of constructing the loss

Train on more difficult prompts and tasks

cannot just dump hard prompts into the model and expect it to work
need to ensure harder prompts still improve the model by ensuring the model’s confidence is meaningfully correlated with the reward

Proxy for the correlation between the generation likelihoods of the sequence and the sequence quality (or the advantage):

\[Cov_{y \sim \pi_{\theta}(\cdot|x)}(\log \pi_{\theta} (y|x), \pi_{\theta} (y|x) \cdot A(y,x))\]

Ideally: model is supposed to be very confident in generating the right answers, and uncertain in generating wrong answers
From Entropy Mechanism¹ paper: correlation stays very low for low accuracy of 0.125, meaning the model is unable to generate very confidently the right answers

You should not expect the model to perform well if it is trained on too easy datasets, nor can you blindly throw in super hard prompts
From Entropy Mechanism¹ paper:

when entropy is higher, final performance is better
validation accuracy is low if only train on GSM8K (easy datasets)
validation accuracy is lower for the more difficult datasets (Eurus-RL-Data-Difficult vs. Eurus-RL-Data)

Many approaches⁵ to improve learning signal; e.g.,

more rewards if prompt is hard
reward not as much for simple prompt
penalize long answer if simple answer is needed

Sample better

GenSelect⁶: let LLM come up with multiple answers, and pick the best one

DeepConf⁷: track the generation confidence of trajectories; only do majority voting on the sampled answers that remain above a threshold of confidence

Summary

Community can solve everything as a whole:

Step	Details
Have a collection where we curate environments and verifiers to scale up intelligence	crowdsource environments / evals / recipes from individual contributors / companies / researchers, to have diverse, high-quality signals to scale up RL in a single place
Figure out the best definition of what intelligence is	<ul><li>analyze the strengths and weaknesses of benchmarks</li><li>include it in the collection so everyone can use it during RL scale up to ensure we are measuring intelligence holistically</li></ul>
Release algorithms to train longer, on harder prompts, and with diverse responses	<ul><li>create new algorithms that are stable, meaningful, and result in diverse response</li><li>add it into the collection so everyone can use it for all the environments we are curating</li></ul>

Open questions:
Humans engage in many different environments, focus on some lessons more than others, learn from both exploring by themselves and from teachers, and are very often evaluated for intelligence for various tasks. How to bridge the gap between human learning and system learning?

Much like humans learn from so many environments, we should have our models do the same. What is the right design for this collection of open-source environments, evaluations, and algorithms?
Much like humans focus on some lessons more than others, we should do the same. What is the right algorithm that allow us to do so without hurting stability?
Much like humans learn from both their own exploration and from teachers, we should do the same. What is the right balance? Can we do both at the same time?
Much like humans are evaluated for their intelligence, we should do the same. How can we compare different definitions of intelligence?

References

Ganqu Cui et al. The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models. arXiv:2505.22617 [cs.LG]. 2025. ↩ ↩² ↩³
Yaru Hao et al. On-Policy RL with Optimal Reward Baseline. arXiv:2505.23585 [cs.LG]. 2025. ↩
Qiying Yu et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476 [cs.LG]. 2025. ↩
Jujie He et al. Skywork Open Reasoner 1 Technical Report. arXiv:2505.22312 [cs.LG]. 2025. ↩
Jixiao Zhang, Chunsheng Zuo. GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models. arXiv:2504.09696 [cs.CL]. 2025. ↩
Shubham Toshniwal et al. GenSelect: A Generative Approach to Best-of-N. arXiv:2507.17797 [cs.LG]. 2025. ↩
Yichao Fu et al. Deep Think with Confidence. arXiv:2508.15260 [cs.LG]. 2025. ↩

This site is open source. Improve this page.