ucb_agentic_ai

Lecture 02: LLM Agents: Brief History and Overview

Link to lecture recording on YouTube

Date: 2024-09-16

Speaker: Shunyu Yao 姚顺雨

Speaker’s Social Profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Computer Science, 2019-2024, Princeton University
Bachelor’s degree in Computer Science, 2015-2019, Tsinghua University

Work:

Research Scientist, OpenAI

Notes

What are LLM agents

Agent: an intelligent system that interacts with some environment

physical environments: robot, autonomous car, …
digital environments: DQN for Atari, Siri, AlphaGo, …

A brief history of LLM agents

in the recent context of LLM
in the ancient context of agents
humans as environments: chatbot

Define “agent” by defining “intelligence” and “environment”

Type	Characteristics	Examples	Details
Level 1: text agent	uses text action and observation	ELIZA (1966): text agent via rule design LSTM-DQN¹ (2015): text agent via reinforcement learning (RL)	ELIZA: domain specific, requires manual design LSTM-DQN: domain specific, requires scalar reward signals and extensive training (a feature of RL)
Level 2: LLM agent	uses LLM to act	SayCan, Language Planner	promise of LLMs: generality and few-shot learning² training: next-token prediction on massive text corpora inference: (few-shot) prompting for various tasks
Level 3: reasoning agent	uses LLM to reason to act	ReAct³, AutoGPT	see below

GPT-3 is the beginning of LLM, then people start to explore across different tasks:

reasoning tasks: symbolic question & answer, chain-of-thought (CoT), self-consistency etc.
acting tasks (grounding, tool use etc.): game, robotics, RAG etc.

Paradigms of reasoning and acting start to converge and we start to build reasoning agent

new applications / tasks / benchmarks: web browsing, software engineering, scientific discovery
new methods: memory, learning, planning, multi-agent etc.

Example task: question answering

Answering questions may require

reasoning (e.g., complex statement)
knowledge (e.g., after LLM knowledge cut-off date)
computation (e.g., prime factorization of a large number)

Retrieval-augmented generation (RAG) for knowledge: think of retrieval as a search engine; retriever pulls the relevant information from the corpora, then append that to the context of the language model

Tool-use⁴⸴⁵

What if both knowledge and reasoning are needed? ideas:

interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions
measuring and narrowing the compositionality gap in language models

Solutions to question answering are scattered - people come up with solutions for each of the benchmark.
Can we have a simple and unifying solution? We need a higher level abstraction beyond individual tasks or methods

	Pros	Cons
CoT	intuitive, flexible and general way to augment test-time compute and to think for longer during inference time to solve complex questions	lack of external knowledge and tools
Paradigm of acting (RAG / Retrieval / tool use)	flexible and general to augment knowledge, computation and feedback	lack of reasoning

ReAct: a new paradigm of agents that reason and act; synergy of reasoning and acting, simple and intuitive to use, general across domains

reasoning: update internal belief
acting: obtain external feedback

ReAct beyond question answering: many tasks can be turned into text games⁶⸴⁷

Type	Action space	Details
Traditional agents	action space A defined by the environment	<ul><li>external feedback $o_{t}$</li><li>agent context $c_{t} = (o_{1}, a_{1}, o_{2}, a_{2}, \dots , o_{t})$</li><li>agent action $a_{t} \sim \pi(a \mid c_{t}) \in A$</li></ul>
ReAct	action space $\hat{A} = A \cup \mathcal{L}$ augmented by reasoning	<ul><li>$\hat{a}_{t} \in \mathcal{L}$ can be any language sequence</li><li>agent context $c_{t+1} = (c_{t}, \hat{a}_{t}, a_{t}, o_{t+1})$</li><li>$\hat{a}_{t} \in \mathcal{L}$ only updates internal context</li></ul>

Reasoning agent: reasoning is an internal action for agents

Memory

Short-term memory	Long-term Memory
<ul><li>append-only</li><li>limited context</li><li>Limited attention</li><li>Do not persist over new tasks</li></ul>	<ul><li>read and write</li><li>stores experience, knowledge, skills…</li><li>persist over new experience</li></ul>

Reflexion⁸: reflect on failure or success, keep track of the experience as a long-term memory, then try to be better next time
task → trajectory → evaluation (internal / external) → reflection → next trajectory

traditional form of reinforcement learning: get a scalar reward (sparse signal) after an action, then backpropagate the reward to update the weights of policy (credit assignment)
reflexion (“verbal” RL): 1) not a scalar reward: code execution result, text etc.; 2) not doing learning by gradient descent: learning by updating the long-term memory of task knowledge, which affects the future behavior of policy

Voyager⁹: a procedural memory of code-based skills
Idea: add skills to the skill library; pull the skill next time instead of trying from scratch

Generative agents¹⁰: episodic memory of experience
Idea: each agent keeps a log of events; look at the log and decide what to work on later

Think of the language model as a form of long-term memory; improve yourself by:

changing the parameters of the neural network, or
writing some piece of code or language in the long-term memory
- think of the neural network or text corpora as both a form of long-term memory, then we have a unified abstraction of learning
- then we have an agent that has this power of reasoning over a special form of short-term memory called context

Cognitive architectures for language agents (CoALA)¹¹: express any agent by

memory: where the information is stored
action space: what the agent can do
decision-making procedure

This research¹¹ also discussed:

What distinguishes external environment vs. internal memory? e.g.,

if an agent opens up Google doc and write something there, is it a form of long-term memory, or action to change the external environment
if an agent retrieves knowledge from an archive of Internet, is it a kind of action, or retrieval from long-term memory

What distinguishes long vs. short term memory? e.g., is a context of 10 million tokens considered long-term memory

How are reasoning agents different from previous agents?

Difference: what kind of representation do you use to process from the observation to the action

Type	Mapping
Symbolic agents	map observations into a set of logical expressions
Deep RL agents	map observations into some kind of embedding

Symbolic state or neural embedding

intensive efforts to design or train
task specific, hard to generalize

Open-ended natural language

rich priors from LLMs
inference-time scalable
general and generalizable

Digital automation (e.g., file reports on SAP concur, code experiments on VS Code, explore papers on arXiv): tremendous practical values, but little progress
Underlying research challenge:

reasoning over real-world language (and other modalities); e.g., when writing code, the paradigm of sequence-to-sequence mapping is not enough
decision making over open-ended actions and long-horizon

The history of LLM

on one side is all the math getting better and better
equally on the other side is we are getting more practical and more scalable tasks

Examples: WebShop¹², WebArena¹³, SWE-Bench¹⁴, ChemCrow¹⁵

Some lessons for research:

simplicity and generality
need both:
- thinking in abstraction
- familiarity with tasks (not task-specific methods)
learning history and other subjects helps

Future directions of LLM agents

Stage	Details	Research example
Training	instead of just prompting, models should be trained specifically for agentic behavior using trajectory data (e.g., self-evaluation thoughts) that is rarely found on the open internet	FireAct¹⁶
Interface	environments should be redesigned specifically for agents (Human-Computer-Agent Interface)	SWE-agent¹⁷
Robustness human-in-the-loop	developing agents that can interact effectively with “humans-in-the-loop,” such as simulated users who do not provide all information upfront
Benchmark	future benchmarks must move beyond “pass@k” (solving a task once out of many tries) toward 100% reliability, especially for high-consequence roles like customer service	τ-bench¹⁸

References

Karthik Narasimhan, Tejas Kulkarni, Regina Barzilay. Language Understanding for Text-based Games Using Deep Reinforcement Learning. arXiv:1506.08941 [cs.CL]. 2015. ↩
Tom B. Brown et al. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]. 2020. ↩
Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]. 2022. ↩
Aaron Parisi, Yao Zhao, Noah Fiedel. TALM: Tool Augmented Language Models. arXiv:2205.12255 [cs.CL]. 2022. ↩
Timo Schick et al. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]. 2023. ↩
Mohit Shridhar et al. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. arXiv:2010.03768 [cs.CL]. 2020. ↩
Wenlong Huang et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]. 2022. ↩
Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]. 2023. ↩
Guanzhi Wang et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI]. 2023. ↩
Joon Sung Park et al. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC]. 2023. ↩
Theodore R. Sumers et al. Cognitive Architectures for Language Agents. arXiv:2309.02427 [cs.AI]. 2023. ↩ ↩²
Shunyu Yao et al. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. arXiv:2207.01206 [cs.CL]. 2022. ↩
Shuyan Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI]. 2023. ↩
Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. arXiv:2310.06770 [cs.CL]. 2023. ↩
Andres M Bran et al. ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376 [physics.chem-ph]. 2023. ↩
Baian Chen et al. FireAct: Toward Language Agent Fine-tuning. arXiv:2310.05915 [cs.CL]. 2023. ↩
John Yang et al. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE]. 2024. ↩
Shunyu Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI]. 2024. ↩

This site is open source. Improve this page.