ucb_agentic_ai

Lecture 04: Agent Evaluation & Project Overview

Link to lecture recording on YouTube
(this video is unlisted and is not part of the YouTube playlist)

Date: 2025-10-06

Speaker: Teaching Assistants of the course

Notes

Evaluation

Evaluation needed in every step of building AI models:

Why evaluation matters

LLM agent requires more complex evaluation compared to LLM1

Type Environment Action
LLM eval static text-to-text systems
LLM agent eval dynamic extend them with planning, tool-use, memory, and multi-step reasoning

Types of LLM (agents) eval:

Type Characteristics Example Metrics Limit
Close-ended tasks <ul><li>limited number of potential answers</li><li>limited number of correct answers</li><li>enables automatic evaluation</li></ul> <ul><li>sentiment analysis</li><li>entailment: SNLI</li><li>name entity recognition: CoNLL-2003</li><li>part-of-speech: PTB</li></ul> accuracy / precision / recall / F1 etc. limited to certain easier tasks
Open-ended tasks <ul><li>long generations with too many possible correct answers to enumerate</li><li>better and worse answers (not just right and wrong)</li></ul> <ul><li>summarization: CNN-DM / Gigaword</li><li>translation: WMT</li><li>instruction-following: Chatbot Arena / AlpacaEval / MT-Bench</li></ul> <ul><li>verifiable tasks (having an “oracle” or clear criteria to test correctness; e.g. math proof, code generation): build an eval following the criteria</li><li>non-verifiable tasks (tasks without clear test criteria or objective ground-truth answer; e.g. storytelling, writing style adaptation): 1) human eval; 2) LLM-as-a-judge</li></ul> human eval:<ul><li>slow</li><li>expensive</li><li>inter-annotator disagreement</li><li>intra-annotator disagreement across time</li><li>not reproducible</li></ul> LLM-as-a-judge (solutions in brackets):<ul><li>sometimes unreliable (cross-check with human eval)</li><li>output has randomness (repeat with same LLM for multiple times)</li><li>different LLMs have different biases (majority vote between different LLMs)</li><li>interpretability of scores is poor (try continuous instead of discrete scoring; score from multiple perspectives/set multiple rubrics)</li><li>sensitive to vague prompts (design detailed prompts; use chain-of-thought prompt or reasoning mode)</li></ul>

Static vs. dynamic eval:

Type of Benchmarks Characteristics Advantages Examples
Static fixed test cases and metrics enable direct, reproducible comparisons between models ImageNet, GLUE, MMLU
Dynamic continuously update or periodically re-generate data to stay relevant with real-world data shifts harder to overfit or be contaminated DynaBench, LiveCodeBench

Taxonomy of agent evals:

What is a good eval?2

Ways that eval can go wrong3

Issue Solution
Data are noisy or biased make sure the test data for eval are accurate and diverse enough
Not practical think about the practitioners’ real needs, instead of testing on some toy examples
Eval can be gamed avoid any shortcut that your eval probably has
Not challenging enough design hard test cases to make sure agent is reliable

Considerations of constructing a good benchmark:

Principles of a good benchmark:

Benchmark Goal Task Environment Data generation How to evaluate
CyberGym4 evaluate an agent’s cybersecurity capabilities by testing its ability to reproduce real-world vulnerabilities at a large scale <ul><li>given a vulnerability description and the pre-patch codebase + executable</li><li>agents must generate a proof-of-concept (PoC) test that successfully triggers the vulnerability in the corresponding unpatched codebase</li></ul> a containerized sandbox (bash interface, exec output) to run programs <ul><li>built from ARVO5 dataset and historical, real-world vulnerabilities found by OSS-fuzz</li><li>reconstruct pre/post patch commits & executables and include the ground truth proof-of-concept; rephrase into concise vulnerability descriptions with the help of LLMs and manual inspection</li></ul> <ul><li>execute final proof-of-concept on pre-patch and post-patch builds; count success rate if it a) triggers the target vulnerability only for pre-patch (reproduction), or b) triggers any vulnerability post-patch (post-patch finding); report overall success rate</li><li>detection via runtime sanitizers (crash + stack trace), not subject judging</li><li>a data contamination analysis is performed by evaluating vulnerability samples found after LLM knowledge cutoff dates, no way for LLM agents to memorize existing samples</li></ul>
τ-bench6 evaluate an agent’s ability to reliably interact with users and APIs while consistently following complex, domain-specific policies <ul><li>agents resolve a simulated user’s goal using API tools through a multi-turn, dynamic conversation</li><li>e.g. retail, airline customer service</li></ul> single DB + agent tools

each domain provides <ul><li>a set of API tools</li><li>a specific policy document to follow</li><li>an LLM-only user simulator that does not need to use tools</li></ul>
<ul><li>manual design of schemas/APIs/policies</li><li>LM-assisted synthetic data generation
(GPT-4 helps produce original sampling codes, and humans will verify and polish)</li><li>manual scenario authorizing + iterative validation with many agent runs to ensure each task has a unique end-state outcome</li></ul>
<ul><li>evaluation is programmatic and verifiable</li><li>success is determined by comparing the final database state to the annotated goal state</li><li>report pass@1 (average success) and pass@k (all k successes across independently and identically distributed runs) to capture reliability/consistency</li></ul>
τ²-Bench7 τ²-Bench shifts from single-control to dual-control, decentralized partially observable Markov decision process (Dec-POMDP)8, both agent and user act via tools in a shared world stressing coordination and guidance <ul><li>users will also be able to use tools to interact with the agent</li><li>user is an LLM-simulator constrained by available set of tools and observable state of the environment</li></ul> In addition to $\tau$, adds 2 databases (Agent DB + User/Device DB) and separate toolsets for agent and user a more complicated and comprehensive data creation pipeline: <ul><li>utilize LLM-drafted Product Requirement Document (PRD) to guide in the generation of code/mock DBs/unit tests along with user DB and tools</li><li>perform programmatic compositional task creation from atomic subtasks with security procedures (e.g. assertion, auto-verification)</li></ul> additional categorical checks for different components or different steps of a given task
e.g.<ul><li>environment assertions</li><li>communication assertions</li><li>natural language assertions</li><li>action assertions</li></ul>report pass@1 and pass@k
GDPval9 measure LLM performance on economically valuable, real-world knowledge-work tasks, comparing AI deliverables to industry experts across diverse occupations;
a more AGI-relevant benchmark
models produces a one-shot deliverable (e.g. document, slide deck, spreadsheet, diagram, media) each task is a realistic work assignment with reference files / context (e.g. documents, data, assets) <ul><li>tasks authored by vetted professionals (average 14 years of experience)</li><li>pass a multi-step review (~5 rounds) plus LLM-based validation</li><li>prompts mirror day-to-day work and include attachments</li><li>gold deliverables are experts’ own solutions</li></ul> <ul><li>blinded expert graders from the same occupations rank AI vs. human deliverables as better / as good as / worse, also compare time / cost</li><li>good example of a benchmark with low contamination risk and hard to get saturated as tasks require domain experts and tied to real-world, concrete work product</li></ul>
CRMArena10 evaluate LLM agents on professional Customer Relationship Management (CRM) workflows in a realistic, enterprise sandbox <ul><li>9 tasks across 3 personas (service agent, analyst, manager)</li><li>new case routing, knowledge Q&A, top issue identification, monthly trend analysis etc.</li></ul> <ul><li>live Salesforce sandbox (named Simple Demo Org, SDO) with UI & API access</li><li>action via SOQL/SOSL (SQL-similar tools) or function calls</li><li>rich enterprise schema (16 objects)</li></ul> <ul><li>LLM synthesis on Salesforce Service Cloud schema; introduce latent variables (e.g. agent skill, customer shopping habit to categorize the tags we want to measure) to create realistic causal patterns</li><li>mini-batch prompting to generate sample data $\rightarrow$ de-duplication (string match) + dual verification (both format and content) to ensure quality of data before upload; LLM paraphrasing for query diversity</li></ul> <ul><li>automatic metrics per task, different types of metric for different tasks, e.g. F1 for knowledge Q&A, exact match on ground-truth IDs for all other tasks; optional pass@k to report agent’s multi-run reliability and consistency</li><li>also report efficiency: number of turns performed, number of tokens spent, cost in dollars</li></ul>

Green Agents

Green Agents

Responsibilities of the Green Agent:

The green agent dictates what metrics get measured and will carry out the full procedure of performing evaluations and deriving metrics
Full assessment flow will be run in the agent-based platforms

Components of the green agent:

Two types of class projects for building green agents:

Type Goal Details
Integrating an existing benchmark adapt an existing benchmark (already published / tested) and integrate as a green agent in AgentBeats
largely reuse existing evaluation metrics or rubrics
<ol><li>integration</li><li>benchmark quality analysis</li><li>correction and expansion</li></ol>
Building a new benchmark create new benchmarks (no existing source) <ul><li>realistic daily tasks to showcase agentic reasoning</li><li>tasks should reflect useful, real-world scenarios</li><li>evaluation: automatic or lightweight human checks</li></ul>

Step-by-step checklist:

Step Details Example
Choose the task to evaluate on   ticket-booking agent
Design the environment <ul><li>tools that agent can interact with</li><li>actions that agent can make</li><li>environment feedback to the agent after each action</li></ul> <ul><li>tools: web browser or app for ticket booking</li><li>actions: mouse clicking, keyboard typing, or app API</li><li>environment feedback: new webpage pops up after clicking</li></ul>
Design the metrics   success rate of booking
price of ticket
whether ticket satisfies user’s requirements
Design test cases <ul><li>think about different scenarios</li><li>design test cases of white agents succeeding / failing in different ways</li><li>including as many edge cases as possible</li></ul> white agent:
successfully books the ticket
books the wrong / more expensive ticket
fails to find the website for booking tickets
etc.

References

  1. Asaf Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv:2503.16416 [cs.AI]. 2025. 

  2. Yuxuan Zhu et al. Establishing Best Practices for Building Rigorous Agentic Benchmarks. arXiv:2507.02825 [cs.AI]. 2025. 

  3. Maria Eriksson et al. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. arXiv:2502.06559 [cs.AI]. 2025. 

  4. Zhun Wang et al. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale. arXiv:2506.02548 [cs.CR]. 2025. 

  5. Xiang Mei et al. ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software. arXiv:2408.02153 [cs.CR]. 2024. 

  6. Shunyu Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI]. 2024. 

  7. Victor Barres et al. τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI]. 2025. 

  8. Daniel S Bernstein, Shlomo Zilberstein, Neil Immerman. The Complexity of Decentralized Control of Markov Decision Processes. arXiv:1301.3836 [cs.AI]. 2013. 

  9. Tejal Patwardhan et al. GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374 [cs.LG]. 2025. 

  10. Kung-Hsiang Huang et al. CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. arXiv:2411.02305 [cs.CL]. 2024.