Link to lecture recording on YouTube
(this video is unlisted and is not part of the YouTube playlist)
Date: 2025-10-06
Speaker: Teaching Assistants of the course
Evaluation needed in every step of building AI models:
Why evaluation matters
LLM agent requires more complex evaluation compared to LLM1
| Type | Environment | Action |
|---|---|---|
| LLM eval | static | text-to-text systems |
| LLM agent eval | dynamic | extend them with planning, tool-use, memory, and multi-step reasoning |
Types of LLM (agents) eval:
| Type | Characteristics | Example | Metrics | Limit |
|---|---|---|---|---|
| Close-ended tasks | <ul><li>limited number of potential answers</li><li>limited number of correct answers</li><li>enables automatic evaluation</li></ul> | <ul><li>sentiment analysis</li><li>entailment: SNLI</li><li>name entity recognition: CoNLL-2003</li><li>part-of-speech: PTB</li></ul> | accuracy / precision / recall / F1 etc. | limited to certain easier tasks |
| Open-ended tasks | <ul><li>long generations with too many possible correct answers to enumerate</li><li>better and worse answers (not just right and wrong)</li></ul> | <ul><li>summarization: CNN-DM / Gigaword</li><li>translation: WMT</li><li>instruction-following: Chatbot Arena / AlpacaEval / MT-Bench</li></ul> | <ul><li>verifiable tasks (having an “oracle” or clear criteria to test correctness; e.g. math proof, code generation): build an eval following the criteria</li><li>non-verifiable tasks (tasks without clear test criteria or objective ground-truth answer; e.g. storytelling, writing style adaptation): 1) human eval; 2) LLM-as-a-judge</li></ul> | human eval:<ul><li>slow</li><li>expensive</li><li>inter-annotator disagreement</li><li>intra-annotator disagreement across time</li><li>not reproducible</li></ul> LLM-as-a-judge (solutions in brackets):<ul><li>sometimes unreliable (cross-check with human eval)</li><li>output has randomness (repeat with same LLM for multiple times)</li><li>different LLMs have different biases (majority vote between different LLMs)</li><li>interpretability of scores is poor (try continuous instead of discrete scoring; score from multiple perspectives/set multiple rubrics)</li><li>sensitive to vague prompts (design detailed prompts; use chain-of-thought prompt or reasoning mode)</li></ul> |
Static vs. dynamic eval:
| Type of Benchmarks | Characteristics | Advantages | Examples |
|---|---|---|---|
| Static | fixed test cases and metrics | enable direct, reproducible comparisons between models | ImageNet, GLUE, MMLU |
| Dynamic | continuously update or periodically re-generate data to stay relevant with real-world data shifts | harder to overfit or be contaminated | DynaBench, LiveCodeBench |
Taxonomy of agent evals:
What is a good eval?2
Ways that eval can go wrong3
| Issue | Solution |
|---|---|
| Data are noisy or biased | make sure the test data for eval are accurate and diverse enough |
| Not practical | think about the practitioners’ real needs, instead of testing on some toy examples |
| Eval can be gamed | avoid any shortcut that your eval probably has |
| Not challenging enough | design hard test cases to make sure agent is reliable |
Considerations of constructing a good benchmark:
Principles of a good benchmark:
| Benchmark | Goal | Task | Environment | Data generation | How to evaluate |
|---|---|---|---|---|---|
| CyberGym4 | evaluate an agent’s cybersecurity capabilities by testing its ability to reproduce real-world vulnerabilities at a large scale | <ul><li>given a vulnerability description and the pre-patch codebase + executable</li><li>agents must generate a proof-of-concept (PoC) test that successfully triggers the vulnerability in the corresponding unpatched codebase</li></ul> | a containerized sandbox (bash interface, exec output) to run programs | <ul><li>built from ARVO5 dataset and historical, real-world vulnerabilities found by OSS-fuzz</li><li>reconstruct pre/post patch commits & executables and include the ground truth proof-of-concept; rephrase into concise vulnerability descriptions with the help of LLMs and manual inspection</li></ul> | <ul><li>execute final proof-of-concept on pre-patch and post-patch builds; count success rate if it a) triggers the target vulnerability only for pre-patch (reproduction), or b) triggers any vulnerability post-patch (post-patch finding); report overall success rate</li><li>detection via runtime sanitizers (crash + stack trace), not subject judging</li><li>a data contamination analysis is performed by evaluating vulnerability samples found after LLM knowledge cutoff dates, no way for LLM agents to memorize existing samples</li></ul> |
| τ-bench6 | evaluate an agent’s ability to reliably interact with users and APIs while consistently following complex, domain-specific policies | <ul><li>agents resolve a simulated user’s goal using API tools through a multi-turn, dynamic conversation</li><li>e.g. retail, airline customer service</li></ul> | single DB + agent tools each domain provides <ul><li>a set of API tools</li><li>a specific policy document to follow</li><li>an LLM-only user simulator that does not need to use tools</li></ul> |
<ul><li>manual design of schemas/APIs/policies</li><li>LM-assisted synthetic data generation (GPT-4 helps produce original sampling codes, and humans will verify and polish)</li><li>manual scenario authorizing + iterative validation with many agent runs to ensure each task has a unique end-state outcome</li></ul> |
<ul><li>evaluation is programmatic and verifiable</li><li>success is determined by comparing the final database state to the annotated goal state</li><li>report pass@1 (average success) and pass@k (all k successes across independently and identically distributed runs) to capture reliability/consistency</li></ul> |
| τ²-Bench7 | τ²-Bench shifts from single-control to dual-control, decentralized partially observable Markov decision process (Dec-POMDP)8, both agent and user act via tools in a shared world stressing coordination and guidance | <ul><li>users will also be able to use tools to interact with the agent</li><li>user is an LLM-simulator constrained by available set of tools and observable state of the environment</li></ul> | In addition to $\tau$, adds 2 databases (Agent DB + User/Device DB) and separate toolsets for agent and user | a more complicated and comprehensive data creation pipeline: <ul><li>utilize LLM-drafted Product Requirement Document (PRD) to guide in the generation of code/mock DBs/unit tests along with user DB and tools</li><li>perform programmatic compositional task creation from atomic subtasks with security procedures (e.g. assertion, auto-verification)</li></ul> | additional categorical checks for different components or different steps of a given task e.g.<ul><li>environment assertions</li><li>communication assertions</li><li>natural language assertions</li><li>action assertions</li></ul>report pass@1 and pass@k |
| GDPval9 | measure LLM performance on economically valuable, real-world knowledge-work tasks, comparing AI deliverables to industry experts across diverse occupations; a more AGI-relevant benchmark |
models produces a one-shot deliverable (e.g. document, slide deck, spreadsheet, diagram, media) | each task is a realistic work assignment with reference files / context (e.g. documents, data, assets) | <ul><li>tasks authored by vetted professionals (average 14 years of experience)</li><li>pass a multi-step review (~5 rounds) plus LLM-based validation</li><li>prompts mirror day-to-day work and include attachments</li><li>gold deliverables are experts’ own solutions</li></ul> | <ul><li>blinded expert graders from the same occupations rank AI vs. human deliverables as better / as good as / worse, also compare time / cost</li><li>good example of a benchmark with low contamination risk and hard to get saturated as tasks require domain experts and tied to real-world, concrete work product</li></ul> |
| CRMArena10 | evaluate LLM agents on professional Customer Relationship Management (CRM) workflows in a realistic, enterprise sandbox | <ul><li>9 tasks across 3 personas (service agent, analyst, manager)</li><li>new case routing, knowledge Q&A, top issue identification, monthly trend analysis etc.</li></ul> | <ul><li>live Salesforce sandbox (named Simple Demo Org, SDO) with UI & API access</li><li>action via SOQL/SOSL (SQL-similar tools) or function calls</li><li>rich enterprise schema (16 objects)</li></ul> | <ul><li>LLM synthesis on Salesforce Service Cloud schema; introduce latent variables (e.g. agent skill, customer shopping habit to categorize the tags we want to measure) to create realistic causal patterns</li><li>mini-batch prompting to generate sample data $\rightarrow$ de-duplication (string match) + dual verification (both format and content) to ensure quality of data before upload; LLM paraphrasing for query diversity</li></ul> | <ul><li>automatic metrics per task, different types of metric for different tasks, e.g. F1 for knowledge Q&A, exact match on ground-truth IDs for all other tasks; optional pass@k to report agent’s multi-run reliability and consistency</li><li>also report efficiency: number of turns performed, number of tokens spent, cost in dollars</li></ul> |
Green Agents
Responsibilities of the Green Agent:
The green agent dictates what metrics get measured and will carry out the full procedure of performing evaluations and deriving metrics
Full assessment flow will be run in the agent-based platforms
Components of the green agent:
Two types of class projects for building green agents:
| Type | Goal | Details |
|---|---|---|
| Integrating an existing benchmark | adapt an existing benchmark (already published / tested) and integrate as a green agent in AgentBeats largely reuse existing evaluation metrics or rubrics |
<ol><li>integration</li><li>benchmark quality analysis</li><li>correction and expansion</li></ol> |
| Building a new benchmark | create new benchmarks (no existing source) | <ul><li>realistic daily tasks to showcase agentic reasoning</li><li>tasks should reflect useful, real-world scenarios</li><li>evaluation: automatic or lightweight human checks</li></ul> |
Step-by-step checklist:
| Step | Details | Example |
|---|---|---|
| Choose the task to evaluate on | ticket-booking agent | |
| Design the environment | <ul><li>tools that agent can interact with</li><li>actions that agent can make</li><li>environment feedback to the agent after each action</li></ul> | <ul><li>tools: web browser or app for ticket booking</li><li>actions: mouse clicking, keyboard typing, or app API</li><li>environment feedback: new webpage pops up after clicking</li></ul> |
| Design the metrics | success rate of booking price of ticket whether ticket satisfies user’s requirements |
|
| Design test cases | <ul><li>think about different scenarios</li><li>design test cases of white agents succeeding / failing in different ways</li><li>including as many edge cases as possible</li></ul> | white agent: successfully books the ticket books the wrong / more expensive ticket fails to find the website for booking tickets etc. |
Asaf Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv:2503.16416 [cs.AI]. 2025. ↩
Yuxuan Zhu et al. Establishing Best Practices for Building Rigorous Agentic Benchmarks. arXiv:2507.02825 [cs.AI]. 2025. ↩
Maria Eriksson et al. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. arXiv:2502.06559 [cs.AI]. 2025. ↩
Zhun Wang et al. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale. arXiv:2506.02548 [cs.CR]. 2025. ↩
Xiang Mei et al. ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software. arXiv:2408.02153 [cs.CR]. 2024. ↩
Shunyu Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI]. 2024. ↩
Victor Barres et al. τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI]. 2025. ↩
Daniel S Bernstein, Shlomo Zilberstein, Neil Immerman. The Complexity of Decentralized Control of Markov Decision Processes. arXiv:1301.3836 [cs.AI]. 2013. ↩
Tejal Patwardhan et al. GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374 [cs.LG]. 2025. ↩
Kung-Hsiang Huang et al. CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. arXiv:2411.02305 [cs.CL]. 2024. ↩