Link to lecture recording on YouTube
Date: 2024-09-16
Speaker: Shunyu Yao 姚顺雨
Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
Work:
Agent: an intelligent system that interacts with some environment
Define “agent” by defining “intelligence” and “environment”
| Type | Characteristics | Examples | Details |
|---|---|---|---|
| Level 1: text agent | uses text action and observation | ELIZA (1966): text agent via rule design LSTM-DQN1 (2015): text agent via reinforcement learning (RL) |
ELIZA: domain specific, requires manual design LSTM-DQN: domain specific, requires scalar reward signals and extensive training (a feature of RL) |
| Level 2: LLM agent | uses LLM to act | SayCan, Language Planner | promise of LLMs: generality and few-shot learning2 training: next-token prediction on massive text corpora inference: (few-shot) prompting for various tasks |
| Level 3: reasoning agent | uses LLM to reason to act | ReAct3, AutoGPT | see below |
GPT-3 is the beginning of LLM, then people start to explore across different tasks:
Paradigms of reasoning and acting start to converge and we start to build reasoning agent
Answering questions may require
Retrieval-augmented generation (RAG) for knowledge: think of retrieval as a search engine; retriever pulls the relevant information from the corpora, then append that to the context of the language model
What if both knowledge and reasoning are needed? ideas:
Solutions to question answering are scattered - people come up with solutions for each of the benchmark.
Can we have a simple and unifying solution? We need a higher level abstraction beyond individual tasks or methods
| Pros | Cons | |
|---|---|---|
| CoT | intuitive, flexible and general way to augment test-time compute and to think for longer during inference time to solve complex questions |
lack of external knowledge and tools |
| Paradigm of acting (RAG / Retrieval / tool use) | flexible and general to augment knowledge, computation and feedback | lack of reasoning |
ReAct: a new paradigm of agents that reason and act; synergy of reasoning and acting, simple and intuitive to use, general across domains
ReAct beyond question answering: many tasks can be turned into text games6⸴7
| Type | Action space | Details |
|---|---|---|
| Traditional agents | action space A defined by the environment | <ul><li>external feedback $o_{t}$</li><li>agent context $c_{t} = (o_{1}, a_{1}, o_{2}, a_{2}, \dots , o_{t})$</li><li>agent action $a_{t} \sim \pi(a \mid c_{t}) \in A$</li></ul> |
| ReAct | action space $\hat{A} = A \cup \mathcal{L}$ augmented by reasoning |
<ul><li>$\hat{a}_{t} \in \mathcal{L}$ can be any language sequence</li><li>agent context $c_{t+1} = (c_{t}, \hat{a}_{t}, a_{t}, o_{t+1})$</li><li>$\hat{a}_{t} \in \mathcal{L}$ only updates internal context</li></ul> |
Reasoning agent: reasoning is an internal action for agents
| Short-term memory | Long-term Memory |
|---|---|
| <ul><li>append-only</li><li>limited context</li><li>Limited attention</li><li>Do not persist over new tasks</li></ul> | <ul><li>read and write</li><li>stores experience, knowledge, skills…</li><li>persist over new experience</li></ul> |
Reflexion8: reflect on failure or success, keep track of the experience as a long-term memory, then try to be better next time
task → trajectory → evaluation (internal / external) → reflection → next trajectory
traditional form of reinforcement learning: get a scalar reward (sparse signal) after an action, then backpropagate the reward to update the weights of policy (credit assignment)
reflexion (“verbal” RL): 1) not a scalar reward: code execution result, text etc.; 2) not doing learning by gradient descent: learning by updating the long-term memory of task knowledge, which affects the future behavior of policy
Voyager9: a procedural memory of code-based skills
Idea: add skills to the skill library; pull the skill next time instead of trying from scratch
Generative agents10: episodic memory of experience
Idea: each agent keeps a log of events; look at the log and decide what to work on later
Think of the language model as a form of long-term memory; improve yourself by:
Cognitive architectures for language agents (CoALA)11: express any agent by
This research11 also discussed:
What distinguishes external environment vs. internal memory? e.g.,
What distinguishes long vs. short term memory? e.g., is a context of 10 million tokens considered long-term memory
A very minimal history of agents: | Timeline | Era | Examples | | – | – | – | | 1960s - 1990s | Symbolic AI agent | SHRDLU, Expert System, Cognitive architecture, DeepBlue… | | 1990s - 2000s | “AI winter” | | | 2010s onwards | (Deep) RL agent | Atari-DQN, AlphaGo, OpenAI Five, MuZero… | | 2020s onwards | LLM agent | |
Difference: what kind of representation do you use to process from the observation to the action
| Type | Mapping |
|---|---|
| Symbolic agents | map observations into a set of logical expressions |
| Deep RL agents | map observations into some kind of embedding |
Symbolic state or neural embedding
Open-ended natural language
Digital automation (e.g., file reports on SAP concur, code experiments on VS Code, explore papers on arXiv): tremendous practical values, but little progress
Underlying research challenge:
The history of LLM
Examples: WebShop12, WebArena13, SWE-Bench14, ChemCrow15
Some lessons for research:
| Stage | Details | Research example |
|---|---|---|
| Training | instead of just prompting, models should be trained specifically for agentic behavior using trajectory data (e.g., self-evaluation thoughts) that is rarely found on the open internet | FireAct16 |
| Interface | environments should be redesigned specifically for agents (Human-Computer-Agent Interface) | SWE-agent17 |
| Robustness human-in-the-loop |
developing agents that can interact effectively with “humans-in-the-loop,” such as simulated users who do not provide all information upfront | |
| Benchmark | future benchmarks must move beyond “pass@k” (solving a task once out of many tries) toward 100% reliability, especially for high-consequence roles like customer service | τ-bench18 |
Karthik Narasimhan, Tejas Kulkarni, Regina Barzilay. Language Understanding for Text-based Games Using Deep Reinforcement Learning. arXiv:1506.08941 [cs.CL]. 2015. ↩
Tom B. Brown et al. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]. 2020. ↩
Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]. 2022. ↩
Aaron Parisi, Yao Zhao, Noah Fiedel. TALM: Tool Augmented Language Models. arXiv:2205.12255 [cs.CL]. 2022. ↩
Timo Schick et al. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]. 2023. ↩
Mohit Shridhar et al. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. arXiv:2010.03768 [cs.CL]. 2020. ↩
Wenlong Huang et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]. 2022. ↩
Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]. 2023. ↩
Guanzhi Wang et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI]. 2023. ↩
Joon Sung Park et al. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC]. 2023. ↩
Theodore R. Sumers et al. Cognitive Architectures for Language Agents. arXiv:2309.02427 [cs.AI]. 2023. ↩ ↩2
Shunyu Yao et al. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. arXiv:2207.01206 [cs.CL]. 2022. ↩
Shuyan Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI]. 2023. ↩
Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. arXiv:2310.06770 [cs.CL]. 2023. ↩
Andres M Bran et al. ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376 [physics.chem-ph]. 2023. ↩
Baian Chen et al. FireAct: Toward Language Agent Fine-tuning. arXiv:2310.05915 [cs.CL]. 2023. ↩
John Yang et al. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE]. 2024. ↩
Shunyu Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI]. 2024. ↩