ucb_agentic_ai

Lecture 04: Agent Evaluation & Project Overview

Link to lecture recording on YouTube
(this video is unlisted and is not part of the YouTube playlist)

Date: 2025-10-06

Speaker: Teaching Assistants of the course

Notes

Evaluation

systematic, repeatable measurement of models and agents
a structured way to measure performance across benchmarks and environments
helps measure capability that is grounded in reproducible evidence; help assess risk

Evaluation needed in every step of building AI models:

training stage: metric to optimize model: 1) loss function to do training; 2) validation loss to track performance
develop / model selection stage: 1) evaluate how the training went; 2) select best model
deploy / publish stage: trustworthy, standardized and reproducible evaluation

Why evaluation matters

enables fair comparisons across models and agents
guides safe and effective deployment decisions by exposing weakness and strength
reliable evaluation of agents critical to develop effective and safe agents in real-world applications

LLM agent requires more complex evaluation compared to LLM¹

Type	Environment	Action
LLM eval	static	text-to-text systems
LLM agent eval	dynamic	extend them with planning, tool-use, memory, and multi-step reasoning

Types of LLM (agents) eval:

Type	Characteristics	Example	Metrics	Limit
Close-ended tasks	<ul><li>limited number of potential answers</li><li>limited number of correct answers</li><li>enables automatic evaluation</li></ul>	<ul><li>sentiment analysis</li><li>entailment: SNLI</li><li>name entity recognition: CoNLL-2003</li><li>part-of-speech: PTB</li></ul>	accuracy / precision / recall / F1 etc.	limited to certain easier tasks
Open-ended tasks	<ul><li>long generations with too many possible correct answers to enumerate</li><li>better and worse answers (not just right and wrong)</li></ul>	<ul><li>summarization: CNN-DM / Gigaword</li><li>translation: WMT</li><li>instruction-following: Chatbot Arena / AlpacaEval / MT-Bench</li></ul>	<ul><li>verifiable tasks (having an “oracle” or clear criteria to test correctness; e.g. math proof, code generation): build an eval following the criteria</li><li>non-verifiable tasks (tasks without clear test criteria or objective ground-truth answer; e.g. storytelling, writing style adaptation): 1) human eval; 2) LLM-as-a-judge</li></ul>	human eval:<ul><li>slow</li><li>expensive</li><li>inter-annotator disagreement</li><li>intra-annotator disagreement across time</li><li>not reproducible</li></ul> LLM-as-a-judge (solutions in brackets):<ul><li>sometimes unreliable (cross-check with human eval)</li><li>output has randomness (repeat with same LLM for multiple times)</li><li>different LLMs have different biases (majority vote between different LLMs)</li><li>interpretability of scores is poor (try continuous instead of discrete scoring; score from multiple perspectives/set multiple rubrics)</li><li>sensitive to vague prompts (design detailed prompts; use chain-of-thought prompt or reasoning mode)</li></ul>

Static vs. dynamic eval:

Type of Benchmarks	Characteristics	Advantages	Examples
Static	fixed test cases and metrics	enable direct, reproducible comparisons between models	ImageNet, GLUE, MMLU
Dynamic	continuously update or periodically re-generate data to stay relevant with real-world data shifts	harder to overfit or be contaminated	DynaBench, LiveCodeBench

Taxonomy of agent evals:

evaluating a specific agent capability: planning and multi-step reasoning, function calling and tool use, self-reflection, memory
evaluating on a specific application: web, software engineering, scientific, conversational, safety, cybersecurity, legal, healthcare, finance
evaluating on a general set of applications

What is a good eval?²

task validity: success of the task is reflecting the target capability of the agent
outcome validity: output by agent should be faithfully reflecting the success of the task

Ways that eval can go wrong³

Issue	Solution
Data are noisy or biased	make sure the test data for eval are accurate and diverse enough
Not practical	think about the practitioners’ real needs, instead of testing on some toy examples
Eval can be gamed	avoid any shortcut that your eval probably has
Not challenging enough	design hard test cases to make sure agent is reliable

Considerations of constructing a good benchmark:

specify the goal of this benchmark and what to evaluate
clarify what is the task and environment to run the agent (e.g. what tool it has access to)
how to evaluate the outcomes (e.g. define the metric); build data collection pipeline to run the benchmark

Principles of a good benchmark:

covers relevant real-world applications
have different difficulty levels
not easy to get contaminated (data leakage resulting in it actually memorize the benchmark data) and saturated (significant increase in performance over a short period of time, almost making the benchmark obsolete)

Benchmark	Goal	Task	Environment	Data generation	How to evaluate
CyberGym⁴	evaluate an agent’s cybersecurity capabilities by testing its ability to reproduce real-world vulnerabilities at a large scale	<ul><li>given a vulnerability description and the pre-patch codebase + executable</li><li>agents must generate a proof-of-concept (PoC) test that successfully triggers the vulnerability in the corresponding unpatched codebase</li></ul>	a containerized sandbox (bash interface, exec output) to run programs	<ul><li>built from ARVO⁵ dataset and historical, real-world vulnerabilities found by OSS-fuzz</li><li>reconstruct pre/post patch commits & executables and include the ground truth proof-of-concept; rephrase into concise vulnerability descriptions with the help of LLMs and manual inspection</li></ul>	<ul><li>execute final proof-of-concept on pre-patch and post-patch builds; count success rate if it a) triggers the target vulnerability only for pre-patch (reproduction), or b) triggers any vulnerability post-patch (post-patch finding); report overall success rate</li><li>detection via runtime sanitizers (crash + stack trace), not subject judging</li><li>a data contamination analysis is performed by evaluating vulnerability samples found after LLM knowledge cutoff dates, no way for LLM agents to memorize existing samples</li></ul>
τ-bench⁶	evaluate an agent’s ability to reliably interact with users and APIs while consistently following complex, domain-specific policies	<ul><li>agents resolve a simulated user’s goal using API tools through a multi-turn, dynamic conversation</li><li>e.g. retail, airline customer service</li></ul>	single DB + agent tools each domain provides <ul><li>a set of API tools</li><li>a specific policy document to follow</li><li>an LLM-only user simulator that does not need to use tools</li></ul>	<ul><li>manual design of schemas/APIs/policies</li><li>LM-assisted synthetic data generation (GPT-4 helps produce original sampling codes, and humans will verify and polish)</li><li>manual scenario authorizing + iterative validation with many agent runs to ensure each task has a unique end-state outcome</li></ul>	<ul><li>evaluation is programmatic and verifiable</li><li>success is determined by comparing the final database state to the annotated goal state</li><li>report pass@1 (average success) and pass@k (all k successes across independently and identically distributed runs) to capture reliability/consistency</li></ul>
τ²-Bench⁷	τ²-Bench shifts from single-control to dual-control, decentralized partially observable Markov decision process (Dec-POMDP)⁸, both agent and user act via tools in a shared world stressing coordination and guidance	<ul><li>users will also be able to use tools to interact with the agent</li><li>user is an LLM-simulator constrained by available set of tools and observable state of the environment</li></ul>	In addition to $\tau$, adds 2 databases (Agent DB + User/Device DB) and separate toolsets for agent and user	a more complicated and comprehensive data creation pipeline: <ul><li>utilize LLM-drafted Product Requirement Document (PRD) to guide in the generation of code/mock DBs/unit tests along with user DB and tools</li><li>perform programmatic compositional task creation from atomic subtasks with security procedures (e.g. assertion, auto-verification)</li></ul>	additional categorical checks for different components or different steps of a given task e.g.<ul><li>environment assertions</li><li>communication assertions</li><li>natural language assertions</li><li>action assertions</li></ul>report pass@1 and pass@k
GDPval⁹	measure LLM performance on economically valuable, real-world knowledge-work tasks, comparing AI deliverables to industry experts across diverse occupations; a more AGI-relevant benchmark	models produces a one-shot deliverable (e.g. document, slide deck, spreadsheet, diagram, media)	each task is a realistic work assignment with reference files / context (e.g. documents, data, assets)	<ul><li>tasks authored by vetted professionals (average 14 years of experience)</li><li>pass a multi-step review (~5 rounds) plus LLM-based validation</li><li>prompts mirror day-to-day work and include attachments</li><li>gold deliverables are experts’ own solutions</li></ul>	<ul><li>blinded expert graders from the same occupations rank AI vs. human deliverables as better / as good as / worse, also compare time / cost</li><li>good example of a benchmark with low contamination risk and hard to get saturated as tasks require domain experts and tied to real-world, concrete work product</li></ul>
CRMArena¹⁰	evaluate LLM agents on professional Customer Relationship Management (CRM) workflows in a realistic, enterprise sandbox	<ul><li>9 tasks across 3 personas (service agent, analyst, manager)</li><li>new case routing, knowledge Q&A, top issue identification, monthly trend analysis etc.</li></ul>	<ul><li>live Salesforce sandbox (named Simple Demo Org, SDO) with UI & API access</li><li>action via SOQL/SOSL (SQL-similar tools) or function calls</li><li>rich enterprise schema (16 objects)</li></ul>	<ul><li>LLM synthesis on Salesforce Service Cloud schema; introduce latent variables (e.g. agent skill, customer shopping habit to categorize the tags we want to measure) to create realistic causal patterns</li><li>mini-batch prompting to generate sample data $\rightarrow$ de-duplication (string match) + dual verification (both format and content) to ensure quality of data before upload; LLM paraphrasing for query diversity</li></ul>	<ul><li>automatic metrics per task, different types of metric for different tasks, e.g. F1 for knowledge Q&A, exact match on ground-truth IDs for all other tasks; optional pass@k to report agent’s multi-run reliability and consistency</li><li>also report efficiency: number of turns performed, number of tokens spent, cost in dollars</li></ul>

Green Agents

green agent is a special hosting agent who determines the type of assessment and defines the specific tasks to be performed
the other agents involved are called participating agents, competition agents, or white agents

Green Agents

are specifically designed to serve as evaluators
act as the technicians at the repair shop, orchestrating the entire evaluation process
can interact with the platform through MCP, A2A or APIs; request permissions and resources from the platform and submit results to the platform

Responsibilities of the Green Agent:

preparing the environment for the benchmark to run
distributing test tasks to the participant white agents
collecting their results after the white agents finish
verifying the environment is run correctly hence the benchmark is performed correctly
gathering statistics and reporting back to the agent-based platform

The green agent dictates what metrics get measured and will carry out the full procedure of performing evaluations and deriving metrics
Full assessment flow will be run in the agent-based platforms

the platform first confirms that all agents (including the green agent) are online and reset
it then sends a task to the green agent, including the URLs of the participate agents to be tested
the green agent orchestrates the interaction: assigning tasks, supervising execution, managing tools or environments, and continuously reporting updates back to the platform
at the end, the green agent submits the metrics which it defines through its implementation

Components of the green agent:

a dataset of test tasks
a predefined testing process (e.g., which agent to test first, in what order tasks are sent, and how tools are provided for the green agent and white agent)
the environment where the tasks run: access to additional tools or MCP modules required for the task, along with instructions for the white agents on how to use them
AgentBeats provides a prompt-based toolkit to help developers quickly spin up a prototype green agent, making it easy to get started
for more rigorous or complex testing (e.g., strict workflows or custom environments), developers can also hand-code the logic of a green agent
ultimately, as long as it complies with the A2A protocol, any web service can serve as a green agent

Two types of class projects for building green agents:

Type	Goal	Details
Integrating an existing benchmark	adapt an existing benchmark (already published / tested) and integrate as a green agent in AgentBeats largely reuse existing evaluation metrics or rubrics	<ol><li>integration</li><li>benchmark quality analysis</li><li>correction and expansion</li></ol>
Building a new benchmark	create new benchmarks (no existing source)	<ul><li>realistic daily tasks to showcase agentic reasoning</li><li>tasks should reflect useful, real-world scenarios</li><li>evaluation: automatic or lightweight human checks</li></ul>

Step-by-step checklist:

Step	Details	Example
Choose the task to evaluate on		ticket-booking agent
Design the environment	<ul><li>tools that agent can interact with</li><li>actions that agent can make</li><li>environment feedback to the agent after each action</li></ul>	<ul><li>tools: web browser or app for ticket booking</li><li>actions: mouse clicking, keyboard typing, or app API</li><li>environment feedback: new webpage pops up after clicking</li></ul>
Design the metrics		success rate of booking price of ticket whether ticket satisfies user’s requirements
Design test cases	<ul><li>think about different scenarios</li><li>design test cases of white agents succeeding / failing in different ways</li><li>including as many edge cases as possible</li></ul>	white agent: successfully books the ticket books the wrong / more expensive ticket fails to find the website for booking tickets etc.

References

Asaf Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv:2503.16416 [cs.AI]. 2025. ↩
Yuxuan Zhu et al. Establishing Best Practices for Building Rigorous Agentic Benchmarks. arXiv:2507.02825 [cs.AI]. 2025. ↩
Maria Eriksson et al. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. arXiv:2502.06559 [cs.AI]. 2025. ↩
Zhun Wang et al. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale. arXiv:2506.02548 [cs.CR]. 2025. ↩
Xiang Mei et al. ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software. arXiv:2408.02153 [cs.CR]. 2024. ↩
Shunyu Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI]. 2024. ↩
Victor Barres et al. τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI]. 2025. ↩
Daniel S Bernstein, Shlomo Zilberstein, Neil Immerman. The Complexity of Decentralized Control of Markov Decision Processes. arXiv:1301.3836 [cs.AI]. 2013. ↩
Tejal Patwardhan et al. GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374 [cs.LG]. 2025. ↩
Kung-Hsiang Huang et al. CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. arXiv:2411.02305 [cs.CL]. 2024. ↩

This site is open source. Improve this page.