Lecture 07: Multimodal Agents - From Perception to Action
Link to lecture recording on YouTube
Date: 2025-03-17
Speaker: Caiming Xiong
Speaker’s social profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
- Ph.D. in Computer Science and Engineering, 2008-2014, State University of New York at Buffalo
- B.S. and M.S. in Computer Science, 2001-2007, Huazhong University of Science and Technology
Work:
- SVP, AI Research & Applied Research, Salesforce
Notes
We have powerful frontier foundation models whose intelligence grows rapidly, even surpassing humans
Multimodal agents (e.g., coding agents, web agents, physical agents):
- computer tasks often involve multiple apps and interfaces
- powered by advancements in large vision-language-action models (VLA-Ms)
- make digital interactions more accessible and vastly increase human productivity
Environment / Benchmark: should be reconfigurable and expandable
Data: diverse modalities, large-scale, covering a wide range of tasks
Model / System: unified vision-language-reasoning-action model, and long-context inference
[Incomplete, work in progress]
References