ucb_agentic_ai

Lecture 03: Post-Training Verifiable Agents

Link to lecture recording on YouTube

Date: 2025-09-29

Speaker: Jiantao Jiao 焦剑涛

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Work:

Notes

How do “agentic” models differ from traditional LLMs

Model Aligned with Design goal
Earlier chat models human aligned provide interactions that maximize human preference
Agentic models environment feedback aligned provide interactions that maximize verifiable rewards (in additional human preference)

What more is necessary from earlier models to agentic models

Observation: if the LLM is not strong enough in producing essentially right answers, it is very hard to build a reliable system.

3 core steps to train the verifiable agentic model

Training data: get good verifiable training data to train these agents

Evaluation: verifiable agents are intelligent, need to define what intelligence is

Training: the best way of feeding the verifiable data to the model so that the model’s intelligence actually improves

More about training data:

Environment

Tools

Verifier

Environment and verifier diversity is critical

Important to achieve good coverage of many different types of environments, tools and verifiers

Along with scale, verifier quality is also extremely important; make sure we have false positives and false negatives

example:

Getting good evaluation data

Build a very good pipeline to make sure that once solid coverage of training data is achieved, it will generalize to any environments have not seen.

Agents are expected to:

  1. work with many different tools, not to break with new or modified tools
  2. work for many different use cases
  3. work under vaguely specified instructions, understand human intentions, produce precisely formatted outputs
  4. work under many software systems
  5. to be robust, behaves like high quality engineers who will be able to debug all the different issues

Off-the-Shelf Benchmarks (for 1, 2):

Agent Harness (for 4):

Focused “Unit-Test” Evaluations (for 3, 5):

Keep the definition of intelligence holistic, make sure we check all kinds of capabilities of LLMs

We should evaluate on:

How to quantify benchmark quality:

Significant community effort involved in ensuring the quality of benchmarks

Training well

What does “training” an agent mean

Core characteristics to focus on in the training process:

Training an agent means maximizing the model’s ability to use tools to achieve correctness

Train to imitate first and then to explore

Step Goal Details Examples
Supervised fine-tuning (SFT) minimize non-meaningful attempts imitate demonstration trajectories that achieve correctness show LLM the actions to take (access a database, access an API etc.) and let LLM try to replicate
Reinforcement Learning (RL) reinforce diverse, great attempts from the current model let the model explore diverse trajectories, and reinforce trajectories that achieve correctness and discourage ones that do not
diversity accelerates the learning process
after SFT, at a stage where LLM is sometimes reasonably correct, or sometimes makes mistakes but not totally wrong; now let LLM learn better through exploring more by itself

Why both SFT and RL?
SFT is just there to discourage meaningless and low-quality answers, so that we can sample meaningful trajectories during RL with tractable compute
RL is there to truly reinforce intelligence

we want to ensure we do not affect diversity of responses during RL by over SFT-ing on non-diverse examples

More about Reinforcement Learning (RL)

Good RL:

Item Idea Example
Train for longer maintain entropy balance, controlled model verbosity increase saturation: every time LLM makes an attempt, it produces the exact same answers
not good if it plateaus very quickly
Train on more difficult prompts and tasks train on tasks and prompts that are related to the task at hand, but that have low (but NOT zero) chances of success for the current model (meaningfully difficult) simple tasks can be solved well, hence not learning anything new
Train on more diverse responses, with high quality reward feedback see as many different, meaningful answers per prompt as possible
scale compute to ensure we get good responses
if only study a single subject, certainly would not be able to answer questions in the exams of other subjects

Important research topic at the moment:

Train longer

With the increase in training steps, performance gain improves (76% at 200 steps, 93% at 800 steps) and entropy left reduces (27% at 200 steps, 6% at 800 steps)1
entropy: a measure of uncertainty or randomness in the model’s prediction of the next token (sampled from a distribution)
high entropy ⇒ able to get two very different trajectories if sampling twice

Without intervention (e.g., entropy or KL regulation), policy entropy is traded for reward predictably

Question:

Two approaches people have taken in the literature:

1st approach: on-policy vs. off-policy2:
on policy: only judge the attempts produced by the LLM itself whenever we ask the reward model or verifier to judge the quality of the attempts

with the increase in training steps

2nd approach: balance the update strength3:
observation: need to decouple the two $\epsilon$ (clipping to the likelihood ratio around 1), and make the $\epsilon_{high}$ even bigger
to encourage the model to increase the probabilities assigned to low probability tokens so the model is able to explore well

Add a loss to encourage entropy directly4: many ways of constructing the loss

Train on more difficult prompts and tasks

Proxy for the correlation between the generation likelihoods of the sequence and the sequence quality (or the advantage):

\[Cov_{y \sim \pi_{\theta}(\cdot|x)}(\log \pi_{\theta} (y|x), \pi_{\theta} (y|x) \cdot A(y,x))\]

Ideally: model is supposed to be very confident in generating the right answers, and uncertain in generating wrong answers
From Entropy Mechanism1 paper: correlation stays very low for low accuracy of 0.125, meaning the model is unable to generate very confidently the right answers

You should not expect the model to perform well if it is trained on too easy datasets, nor can you blindly throw in super hard prompts
From Entropy Mechanism1 paper:

Many approaches5 to improve learning signal; e.g.,

Sample better

GenSelect6: let LLM come up with multiple answers, and pick the best one

DeepConf7: track the generation confidence of trajectories; only do majority voting on the sampled answers that remain above a threshold of confidence

Summary

Community can solve everything as a whole:

Step Details
Have a collection where we curate environments and verifiers to scale up intelligence crowdsource environments / evals / recipes from individual contributors / companies / researchers, to have diverse, high-quality signals to scale up RL in a single place
Figure out the best definition of what intelligence is <ul><li>analyze the strengths and weaknesses of benchmarks</li><li>include it in the collection so everyone can use it during RL scale up to ensure we are measuring intelligence holistically</li></ul>
Release algorithms to train longer, on harder prompts, and with diverse responses <ul><li>create new algorithms that are stable, meaningful, and result in diverse response</li><li>add it into the collection so everyone can use it for all the environments we are curating</li></ul>

Open questions:
Humans engage in many different environments, focus on some lessons more than others, learn from both exploring by themselves and from teachers, and are very often evaluated for intelligence for various tasks. How to bridge the gap between human learning and system learning?

References

  1. Ganqu Cui et al. The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models. arXiv:2505.22617 [cs.LG]. 2025.  2 3

  2. Yaru Hao et al. On-Policy RL with Optimal Reward Baseline. arXiv:2505.23585 [cs.LG]. 2025. 

  3. Qiying Yu et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476 [cs.LG]. 2025. 

  4. Jujie He et al. Skywork Open Reasoner 1 Technical Report. arXiv:2505.22312 [cs.LG]. 2025. 

  5. Jixiao Zhang, Chunsheng Zuo. GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models. arXiv:2504.09696 [cs.CL]. 2025. 

  6. Shubham Toshniwal et al. GenSelect: A Generative Approach to Best-of-N. arXiv:2507.17797 [cs.LG]. 2025. 

  7. Yichao Fu et al. Deep Think with Confidence. arXiv:2508.15260 [cs.LG]. 2025.