ucb_agentic_ai

Lecture 05: Some Challenges and Lessons from Training Agentic Models

Link to lecture recording on YouTube

Date: 2025-10-13

Speaker: Weizhu Chen

Speaker’s Social Profile: Company Profile / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Computer Science, The Hong Kong University of Science and Technology

Work:

Technical Fellow and CVP, Microsoft

Notes

Important aspects of agentic training:

goal oriented
tool usage
planning and reasoning
user interaction

Lessons from experience in the industry

data
evaluation: reinforcement learning, agentic model training is about how to define the grader
efficiency environment

RL data:

verifiable
- math
- code
non-verifiable:
- open and subjective data: style, writing, safety
- rubrics: much more complicated than most people think
- data synthesis

Rubrics: scorable with steerability

insufficient to just let human expert to write ground truth based on their opinions
no surprise to have 50 criteria for a question, and each criterion need to be highly detailed and interpretable
very structured, enabling precise grading; in case it’s hard to measure open questions, split them into several pieces where each of them can be measured
expensive exercise for experts to write up all these criteria, possibly take up hours, days, or even months
quality is more important than quantity

Data efficiency: RLVR (Reinforcement Learning with Verifiable Rewards) with one example
observation: after ~100 steps, the accuracy of RLVR with one example (~30% accuracy) is getting closer to training with 1200 examples (~36% accuracy)

the power of exploration: not just memorization; RL explores the building blocks for math problem
extremely high data efficiency: 1 sample to figure out most building blocks
quality matters: high entropy to encourage exploration; can’t be too hard (pass rate is 0), or too easy (without negative feedback)

Data mix: curating high quality data often outperforms alchemy in parameter tuning for the training

Tips	Examples
Hard problems are usually more useful for powerful models	put in more of the easier data at the beginning mix in more difficult data when training moves forward and the model becomes stronger
The goodness of data is also model dependent	for a coding model, people also ask non-coding questions such as write a document based on the code
Combine the use of real data and synthetic data; real data help in real cases, while synthetic data can be formalized in multiple styles	synthesize more data if the model is good at some categories but not others
Use powerful models as judger to generate more data

[Incomplete, work in progress]

References

This site is open source. Improve this page.