Lecture 05: Some Challenges and Lessons from Training Agentic Models
Link to lecture recording on YouTube
Date: 2025-10-13
Speaker: Weizhu Chen
Speaker’s Social Profile: Company Profile / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
- Ph.D. in Computer Science, The Hong Kong University of Science and Technology
Work:
- Technical Fellow and CVP, Microsoft
Notes
Important aspects of agentic training:
- goal oriented
- tool usage
- planning and reasoning
- user interaction
Lessons from experience in the industry
- data
- evaluation: reinforcement learning, agentic model training is about how to define the grader
- efficiency environment
RL data:
- verifiable
- non-verifiable:
- open and subjective data: style, writing, safety
- rubrics: much more complicated than most people think
- data synthesis
Rubrics: scorable with steerability
- insufficient to just let human expert to write ground truth based on their opinions
- no surprise to have 50 criteria for a question, and each criterion need to be highly detailed and interpretable
- very structured, enabling precise grading; in case it’s hard to measure open questions, split them into several pieces where each of them can be measured
- expensive exercise for experts to write up all these criteria, possibly take up hours, days, or even months
- quality is more important than quantity
Data efficiency: RLVR (Reinforcement Learning with Verifiable Rewards) with one example
observation: after ~100 steps, the accuracy of RLVR with one example (~30% accuracy) is getting closer to training with 1200 examples (~36% accuracy)
- the power of exploration: not just memorization; RL explores the building blocks for math problem
- extremely high data efficiency: 1 sample to figure out most building blocks
- quality matters: high entropy to encourage exploration; can’t be too hard (pass rate is 0), or too easy (without negative feedback)
Data mix: curating high quality data often outperforms alchemy in parameter tuning for the training
| Tips |
Examples |
| Hard problems are usually more useful for powerful models |
put in more of the easier data at the beginning mix in more difficult data when training moves forward and the model becomes stronger |
| The goodness of data is also model dependent |
for a coding model, people also ask non-coding questions such as write a document based on the code |
| Combine the use of real data and synthetic data; real data help in real cases, while synthetic data can be formalized in multiple styles |
synthesize more data if the model is good at some categories but not others |
| Use powerful models as judger to generate more data |
|
[Incomplete, work in progress]
References