Link to lecture recording on YouTube
Date: 2025-01-27
Speaker: Xinyun Chen 陈昕昀
Speaker’s social profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)
Education:
Work:
Large Language Model (LLM) Agent Framework
Agentic workflow facilitates complex tasks:
LLM agents applications: code generation, computer use, personal assistant, robotics etc.
domains: education, legal, finance, healthcare, cybersecurity etc.
Rapid progress of reasoning models since Sep 2024; great improvement between versions of the same model; most impressive performance on competitive math and coding
Topics covered in this course:
Highlights of LLMs in 2024: the advancement of reasoning models1
Existing models before OpenAI o1 achieved <25% accuracy if no special inference-time techniques were used
OpenAI o3 model achieves2
What is different for o1 model?
Shared core idea of the breakthrough in reasoning models:
need some way to trigger the large language model to generate long chain-of-thought (CoT) before it concludes with the final solution
Focus of this lecture: inference-time techniques for scaling token budget
Idea: use more token budget to generate a single solution
Before the advancement in post-training techniques, standard prompting performance is poor on reasoning benchmarks
Issue: standard few-shot exemplars only provide information on the final solution format, but not the rationale to derive the solution
CoT performance improves more significantly with the increase of model size
Better models benefit more from CoT generation; a drastic improvement on reasoning performance when the model reaches a certain scale
These experiments (from papers in 2022) used pre-trained only LLMs; with the recent post-training techniques, they may show different scaling curves but the main conclusions still hold3⸴4
Zero-shot CoT: elicit CoT generation with an instruction
“Let’s think step by step” triggers CoT generation without exemplars
Zero-shot CoT significantly outperforms zero-shot performance, especially on harder tasks (e.g., math, symbolic reasoning)5
Issue: zero-shot CoT performance is still worse than few-shot
Question: how to improve CoT performance without manually labeling exemplars?
Analogical prompting6: prompt the LLM to first recall relevant exemplars, before solving the test problem
Benefit:
Motivation from human analogical reasoning: humans are not explicitly given demonstrations every time for a new task; instead, humans intrinsically recall from past relevant experience
“We look for a formerly solved problem which is linked to our present one by GENERALIZATION, SPECIALIZATION or ANALOGY”
– Professor George Polya, How to solve it
Besides exemplars, the LLM can also self-generate high-level knowledge, which complements the problems with broader insights
e.g., for a coding problem, first instruct the model to self-generate some knowledge (higher level tutorial about the algorithms needed) → model to identify the main algorithms and come up with the examples with corresponding code → model goes back to solve initial problem
Analogical prompting not only outperforms zero-shot CoT, but also outperforms manual few-shot CoT
Which instructions work for CoT generation?5
LLMs for prompt engineering7:
LLM as the optimizer to iteratively improve the prompt8, in addition to proposing instructions and do research and mutation
core idea: instruct the LLM to leverage the past optimization trajectory represented as sorted (solution, score) pairs; remove older ones if the context is no enough
Chain-of-thought prompting: variable computation of the thought process adapting to tasks of different difficulty levels; more complex problems → more reasoning steps
Reasoning strategies enabled by CoT: decomposition, planning
We can explicitly instruct the LLM with the desired reasoning strategies for problem solving
The best practice to interact with LLMs evolves over time
Idea: scale the inference-time compute by increasing the width and sampling multiple branches in the solution space
What is missing so far?
Self-consistency13: select the response with the most consistent final answer; the selection is only based on the final answer, the reasoning paths do not need to be the same across different sampled responses
Self-consistency performance scales much better than probability-based ranking (sample-and-rank baseline: select the response with the highest log probability), unless the model is trained to be a good verifier
Self-consistency using sampling scales with more samples; the sampling method needs to ensure the response diversity, e.g., using a high temperature, nucleus sampling etc.
Example: consistency-based code selection in AlphaCode14: a stage of filtering & clustering to select small set of candidates from large set of potential solutions
Competitive programming problem:
Clustering by execution on generated inputs, predicts code based on the consistency on execution results
Results on Codeforces: 1) clustering provides additional performance gain over filtering only; 2) still a gap from the oracle selection
Universal self-consistency15: ask the LLM to perform consistency-based selection instead of having the answer extraction process
Intuition: it will be hard for the model to judge the answer correctness by itself, but consistency should be a simpler criterion to measure
Improve further over consistency-based selection16⸴17: train LLMs to be the ranker, and hope this ranker can perform better than the simple consistency criterion
Two types of LLM-based verifiers / reward models:
(Strong) LLM-based verifiers outperform consistency-based selection; PRM scales better with more samples compared to ORM and majority voting baseline
So far, response selection only after the generation of full response; this does not fully utilize a step-wise scorer
LLM + Tree Search18: prioritize the exploration of more promising partial solutions
example: game of 24, at each step
Voting-based state evaluation: LLM votes multiple times, then selects the majority vote as the final choice
Original paper use basic BFS / DFS algorithms, more advanced search algorithms (e.g., Monte-Carlo Tree Search) can be integrated
Idea: increase the depth to reach the final solution
Even the best LLM still make (sometimes obvious) mistakes; on the other hand, humans also tend to make (sometimes trivial) mistakes at first thought
Sampling multiple solutions can reduce mistakes from a single prediction, but it is still optimal because there is no feedback loop to correct the mistakes after a complete solution is generated
Inference-time self-improvement: LLM iteratively improves its own response for the given task, which aligns better with human’s error correction process
Reflexion19 and self-refine20: two steps after generating each solution
Reflexion improves on tasks with effective evaluation heuristics (e.g., ALFWorld)
External evaluation gives the answer correctness at each reflection step (e.g., HotPotQA)
Self-debugging21 is a natural workflow for code generation
Code execution provides natural external feedback: humans often debug better with an IDE
| Feedback formats | Details |
|---|---|
| Simple | a short universal feedback for all wrong code |
| Unit test feedback | include the execution results |
| Code explanation | line-by-line explanation of the implementation |
| Trace | line-by-line simulation of the execution trace |
Self-debugging consistently boosts the performance across different LLMs; more informative feedback further improves the debugging performance
How does self-correction work for QA-style reasoning tasks?
Self-correction without oracle feedback22 hurts the reasoning performance
General-purpose feedback prompt variants do not improve the performance: edit the feedback prompt affects the self-correction behavior (tendency to keep the initial response), but none of them significantly improves over the initial performance
Multi-agent debate23 does not improve over self-consistency
Putting everything together: how to balance the inference budget for generating multiple samples, in parallel or sequentially?
This is mostly a model-specific and task-specific empirical question, depending on the model’s self-reflection and correction abilities
Overall conclusion from research24:
Model size is another factor for optimizing inference cost25:
The best practice to interact with an LLM should be adapted according to its capabilities
“One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great”
“We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done”
– Professor Richard Sutton, The Bitter Lesson
OpenAI Research: Learning to reason with LLMs ↩
Blog post: OpenAI o3 Breakthrough ↩
Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]. 2022. ↩ ↩2
Maxwell Nye et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114 [cs.LG]. 2021. ↩
Takeshi Kojima et al. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL]. 2022. ↩ ↩2
Michihiro Yasunaga et al. Large Language Models as Analogical Reasoners. arXiv:2310.01714 [cs.LG]. 2023. ↩
Yongchao Zhou et al. Large Language Models Are Human-Level Prompt Engineers. arXiv:2211.01910 [cs.LG]. 2022. ↩
Chengrun Yang et al. Large Language Models as Optimizers. arXiv:2309.03409 [cs.LG]. 2023. ↩
Denny Zhou et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625 [cs.AI]. 2022. ↩
Andrew Drozdov et al. Compositional Semantic Parsing with Large Language Models. arXiv:2209.15003 [cs.CL]. 2022. ↩
Daniel Keysers et al. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data. arXiv:1912.09713 [cs.LG]. 2019. ↩
Pei Zhou et al. Self-Discover: Large Language Models Self-Compose Reasoning Structures. arXiv:2402.03620 [cs.AI]. 2024. ↩
Xuezhi Wang et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]. 2022. ↩
Yujia Li et al. Competition-Level Code Generation with AlphaCode. arXiv:2203.07814 [cs.PL]. 2022. ↩
Xinyun Chen et al. Universal Self-Consistency for Large Language Model Generation. arXiv:2311.17311 [cs.CL]. 2023. ↩
Karl Cobbe et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG]. 2021. ↩
Hunter Lightman et al. Let’s Verify Step by Step. arXiv:2305.20050 [cs.LG]. 2023. ↩
Shunyu Yao et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]. 2023. ↩
Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]. 2023. ↩
Aman Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL]. 2023. ↩
Xinyun Chen et al. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL]. 2023. ↩
Jie Huang et al. Large Language Models Cannot Self-Correct Reasoning Yet. arXiv:2310.01798 [cs.CL]. 2023. ↩
Yilun Du et al. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL]. 2023. ↩
Charlie Snell et al. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314 [cs.LG]. 2024. ↩
Yangzhen Wu et al. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. arXiv:2408.00724 [cs.AI]. 2024. ↩