ucb_agentic_ai

Lecture 02: LLM Agents: Brief History and Overview

Link to lecture recording on YouTube

Date: 2024-09-16

Speaker: Shunyu Yao 姚顺雨

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Work:

Notes

What are LLM agents

Agent: an intelligent system that interacts with some environment

A brief history of LLM agents

Define “agent” by defining “intelligence” and “environment”

Type Characteristics Examples Details
Level 1: text agent uses text action and observation ELIZA (1966): text agent via rule design
LSTM-DQN1 (2015): text agent via reinforcement learning (RL)
ELIZA: domain specific, requires manual design
LSTM-DQN: domain specific, requires scalar reward signals and extensive training (a feature of RL)
Level 2: LLM agent uses LLM to act SayCan, Language Planner promise of LLMs: generality and few-shot learning2
training: next-token prediction on massive text corpora
inference: (few-shot) prompting for various tasks
Level 3: reasoning agent uses LLM to reason to act ReAct3, AutoGPT see below

GPT-3 is the beginning of LLM, then people start to explore across different tasks:

Paradigms of reasoning and acting start to converge and we start to build reasoning agent

Example task: question answering

Answering questions may require

Retrieval-augmented generation (RAG) for knowledge: think of retrieval as a search engine; retriever pulls the relevant information from the corpora, then append that to the context of the language model

Tool-use45

What if both knowledge and reasoning are needed? ideas:

Solutions to question answering are scattered - people come up with solutions for each of the benchmark.
Can we have a simple and unifying solution? We need a higher level abstraction beyond individual tasks or methods

  Pros Cons
CoT intuitive, flexible and general way to augment test-time compute
and to think for longer during inference time to solve complex questions
lack of external knowledge and tools
Paradigm of acting (RAG / Retrieval / tool use) flexible and general to augment knowledge, computation and feedback lack of reasoning

ReAct: a new paradigm of agents that reason and act; synergy of reasoning and acting, simple and intuitive to use, general across domains

ReAct beyond question answering: many tasks can be turned into text games67

Type Action space Details
Traditional agents action space A defined by the environment <ul><li>external feedback $o_{t}$</li><li>agent context $c_{t} = (o_{1}, a_{1}, o_{2}, a_{2}, \dots , o_{t})$</li><li>agent action $a_{t} \sim \pi(a \mid c_{t}) \in A$</li></ul>
ReAct action space $\hat{A} = A \cup \mathcal{L}$
augmented by reasoning
<ul><li>$\hat{a}_{t} \in \mathcal{L}$ can be any language sequence</li><li>agent context $c_{t+1} = (c_{t}, \hat{a}_{t}, a_{t}, o_{t+1})$</li><li>$\hat{a}_{t} \in \mathcal{L}$ only updates internal context</li></ul>

Reasoning agent: reasoning is an internal action for agents

Memory

Short-term memory Long-term Memory
<ul><li>append-only</li><li>limited context</li><li>Limited attention</li><li>Do not persist over new tasks</li></ul> <ul><li>read and write</li><li>stores experience, knowledge, skills…</li><li>persist over new experience</li></ul>

Reflexion8: reflect on failure or success, keep track of the experience as a long-term memory, then try to be better next time
task → trajectory → evaluation (internal / external) → reflection → next trajectory

traditional form of reinforcement learning: get a scalar reward (sparse signal) after an action, then backpropagate the reward to update the weights of policy (credit assignment)
reflexion (“verbal” RL): 1) not a scalar reward: code execution result, text etc.; 2) not doing learning by gradient descent: learning by updating the long-term memory of task knowledge, which affects the future behavior of policy

Voyager9: a procedural memory of code-based skills
Idea: add skills to the skill library; pull the skill next time instead of trying from scratch

Generative agents10: episodic memory of experience
Idea: each agent keeps a log of events; look at the log and decide what to work on later

Think of the language model as a form of long-term memory; improve yourself by:

  1. changing the parameters of the neural network, or
  2. writing some piece of code or language in the long-term memory
    • think of the neural network or text corpora as both a form of long-term memory, then we have a unified abstraction of learning
    • then we have an agent that has this power of reasoning over a special form of short-term memory called context

Cognitive architectures for language agents (CoALA)11: express any agent by

This research11 also discussed:

What distinguishes external environment vs. internal memory? e.g.,

What distinguishes long vs. short term memory? e.g., is a context of 10 million tokens considered long-term memory

How are reasoning agents different from previous agents?

A very minimal history of agents: | Timeline | Era | Examples | | – | – | – | | 1960s - 1990s | Symbolic AI agent | SHRDLU, Expert System, Cognitive architecture, DeepBlue… | | 1990s - 2000s | “AI winter” | | | 2010s onwards | (Deep) RL agent | Atari-DQN, AlphaGo, OpenAI Five, MuZero… | | 2020s onwards | LLM agent | |

Difference: what kind of representation do you use to process from the observation to the action

Type Mapping
Symbolic agents map observations into a set of logical expressions
Deep RL agents map observations into some kind of embedding

Symbolic state or neural embedding

Open-ended natural language

Digital automation (e.g., file reports on SAP concur, code experiments on VS Code, explore papers on arXiv): tremendous practical values, but little progress
Underlying research challenge:

The history of LLM

Examples: WebShop12, WebArena13, SWE-Bench14, ChemCrow15

Some lessons for research:

Future directions of LLM agents

Stage Details Research example
Training instead of just prompting, models should be trained specifically for agentic behavior using trajectory data (e.g., self-evaluation thoughts) that is rarely found on the open internet FireAct16
Interface environments should be redesigned specifically for agents (Human-Computer-Agent Interface) SWE-agent17
Robustness
human-in-the-loop
developing agents that can interact effectively with “humans-in-the-loop,” such as simulated users who do not provide all information upfront  
Benchmark future benchmarks must move beyond “pass@k” (solving a task once out of many tries) toward 100% reliability, especially for high-consequence roles like customer service τ-bench18

References

  1. Karthik Narasimhan, Tejas Kulkarni, Regina Barzilay. Language Understanding for Text-based Games Using Deep Reinforcement Learning. arXiv:1506.08941 [cs.CL]. 2015. 

  2. Tom B. Brown et al. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]. 2020. 

  3. Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]. 2022. 

  4. Aaron Parisi, Yao Zhao, Noah Fiedel. TALM: Tool Augmented Language Models. arXiv:2205.12255 [cs.CL]. 2022. 

  5. Timo Schick et al. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]. 2023. 

  6. Mohit Shridhar et al. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. arXiv:2010.03768 [cs.CL]. 2020. 

  7. Wenlong Huang et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]. 2022. 

  8. Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]. 2023. 

  9. Guanzhi Wang et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI]. 2023. 

  10. Joon Sung Park et al. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC]. 2023. 

  11. Theodore R. Sumers et al. Cognitive Architectures for Language Agents. arXiv:2309.02427 [cs.AI]. 2023.  2

  12. Shunyu Yao et al. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. arXiv:2207.01206 [cs.CL]. 2022. 

  13. Shuyan Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI]. 2023. 

  14. Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. arXiv:2310.06770 [cs.CL]. 2023. 

  15. Andres M Bran et al. ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376 [physics.chem-ph]. 2023. 

  16. Baian Chen et al. FireAct: Toward Language Agent Fine-tuning. arXiv:2310.05915 [cs.CL]. 2023. 

  17. John Yang et al. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE]. 2024. 

  18. Shunyu Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI]. 2024.