ucb_agentic_ai

Lecture 01: Inference-Time Techniques for LLM Reasoning

Link to lecture recording on YouTube

Date: 2025-01-27

Speaker: Xinyun Chen 陈昕昀

Speaker’s social profile: Website / Google Scholar / GitHub / LinkedIn / X (Twitter)

Education:

Ph.D. in Computer Science, 2017-2022, University of California, Berkeley, advised by Prof. Dawn Song
B.S. (ACM Honored Class), 2013-2017, Shanghai Jiao Tong University

Work:

Staff Research Scientist, Google Deepmind

Overview of the Course

Large Language Model (LLM) Agent Framework

the brain of the agent is LLM, which performs reasoning and planning to take action
action to interact with the environment
receive feedback and revise its internal memory so it can better plan for the next step
solving real-world tasks typically involves a trial-and-error process
utilize tools and external information retrieval

Agentic workflow facilitates complex tasks:

task decomposition
allocation of subtasks to specialized modules
division of labor for project collaboration
multi-agent generation inspires better responses

LLM agents applications: code generation, computer use, personal assistant, robotics etc.
domains: education, legal, finance, healthcare, cybersecurity etc.

Rapid progress of reasoning models since Sep 2024; great improvement between versions of the same model; most impressive performance on competitive math and coding

Topics covered in this course:

fundamental reasoning techniques: inference-time scaling, training techniques, search and planning
LLMs for software engineering: code generation and verification, web applications
LLMs for mathematics: fundamental training techniques, autoformalization and theorem proving
agentic workflow design, real-world applications
safety and ethics

Notes

Highlights of LLMs in 2024: the advancement of reasoning models¹

Existing models before OpenAI o1 achieved <25% accuracy if no special inference-time techniques were used

OpenAI o3 model achieves²

a performance matching the average human annotators or mechanical turks with a budget of ~$20 per task
87.5% accuracy on ARC-AGI with ~$1000 inference time cost

What is different for o1 model?

existing LLM before o1 model will directly generate response when a user query is entered
o1 interface includes a thought process; it takes different amount of time depending on the difficulty of the task
demo: Gemini 2.0 Thinking model is able to figure out changing 9 to 6 by rotating the ball, in order to get the sum of 30

Shared core idea of the breakthrough in reasoning models:
need some way to trigger the large language model to generate long chain-of-thought (CoT) before it concludes with the final solution

few-shot CoT prompting³
instruction prompting
instruction tuning
reinforcement learning

Focus of this lecture: inference-time techniques for scaling token budget

Introduction to Basic Prompting Techniques

Idea: use more token budget to generate a single solution

Before the advancement in post-training techniques, standard prompting performance is poor on reasoning benchmarks
Issue: standard few-shot exemplars only provide information on the final solution format, but not the rationale to derive the solution

CoT performance improves more significantly with the increase of model size
Better models benefit more from CoT generation; a drastic improvement on reasoning performance when the model reaches a certain scale
These experiments (from papers in 2022) used pre-trained only LLMs; with the recent post-training techniques, they may show different scaling curves but the main conclusions still hold³⸴⁴

Zero-shot CoT: elicit CoT generation with an instruction
“Let’s think step by step” triggers CoT generation without exemplars
Zero-shot CoT significantly outperforms zero-shot performance, especially on harder tasks (e.g., math, symbolic reasoning)⁵
Issue: zero-shot CoT performance is still worse than few-shot
Question: how to improve CoT performance without manually labeling exemplars?

Analogical prompting⁶: prompt the LLM to first recall relevant exemplars, before solving the test problem
Benefit:

exemplars are self-generated by LLMs, no manual labeling
exemplars are tailored to individual problems: model generates different exemplars which are more relevant to the topics

Motivation from human analogical reasoning: humans are not explicitly given demonstrations every time for a new task; instead, humans intrinsically recall from past relevant experience

“We look for a formerly solved problem which is linked to our present one by GENERALIZATION, SPECIALIZATION or ANALOGY”

– Professor George Polya, How to solve it

Besides exemplars, the LLM can also self-generate high-level knowledge, which complements the problems with broader insights
e.g., for a coding problem, first instruct the model to self-generate some knowledge (higher level tutorial about the algorithms needed) → model to identify the main algorithms and come up with the examples with corresponding code → model goes back to solve initial problem

Analogical prompting not only outperforms zero-shot CoT, but also outperforms manual few-shot CoT

Which instructions work for CoT generation?⁵

current LLMs are sensitive to prompt design
there is no clear principle of how to write optimal prompts
question: how to reduce the manual work for writing prompts?

LLMs for prompt engineering⁷:

high-level idea: let LLM propose prompts, and hope it will perform even better than human written prompts
proposal generation: leverage the LLM to generate initial instructions given the task description
score each instruction based on the prediction correctness on a small validation set (~100 problems)

LLM as the optimizer to iteratively improve the prompt⁸, in addition to proposing instructions and do research and mutation
core idea: instruct the LLM to leverage the past optimization trajectory represented as sorted (solution, score) pairs; remove older ones if the context is no enough

two LLMs, one as optimizer and the other as evaluator
- optimizer: responsible for proposing a new instruction
- evaluator: evaluates the accuracy or performance of a given instruction

Chain-of-thought prompting: variable computation of the thought process adapting to tasks of different difficulty levels; more complex problems → more reasoning steps
Reasoning strategies enabled by CoT: decomposition, planning

We can explicitly instruct the LLM with the desired reasoning strategies for problem solving

Least-to-most prompting⁹: easy-to-hard generalization by explicitly telling the model what is the best practice to decompose the original problem
dynamic least-to-most prompting¹⁰⸴¹¹: dynamic selection of exemplars for each subproblem
self-discover¹²: instruct the LLM to compose task-specific reasoning structures without manual labeling of exemplars for every single problem

The best practice to interact with LLMs evolves over time

Search and Selection from Multiple Candidates

Idea: scale the inference-time compute by increasing the width and sampling multiple branches in the solution space

What is missing so far?

should not limit the LLM to generate only one solution per problem
exploring multiple branches allows the LLM to recover from mistakes in a single generation
- generate multiple candidate solutions per problem
- generate multiple potential next reasoning steps given the current (partial) thought
challenge: how to select the best response from multiple candidates?
- in most cases, we do not have an oracle scorer at inference time

Self-consistency¹³: select the response with the most consistent final answer; the selection is only based on the final answer, the reasoning paths do not need to be the same across different sampled responses

Self-consistency performance scales much better than probability-based ranking (sample-and-rank baseline: select the response with the highest log probability), unless the model is trained to be a good verifier

Self-consistency using sampling scales with more samples; the sampling method needs to ensure the response diversity, e.g., using a high temperature, nucleus sampling etc.

beam search: keep top k paths with the highest probabilities in the decoding process
ensemble baselines: apply greedy decoding for all prompt variants of a problem

Example: consistency-based code selection in AlphaCode¹⁴: a stage of filtering & clustering to select small set of candidates from large set of potential solutions

Competitive programming problem:

long and complicated text description
a few input-output pairs as test cases
code needs to pass both given and held-out test cases

Clustering by execution on generated inputs, predicts code based on the consistency on execution results

train a model to generate new test inputs for these problems
execute sampled programs on all generated inputs
cluster all programs with the same outputs together, assuming all programs in the same cluster are semantically equivalent if the generated inputs are diverse and of high quality
sample one program from each of the 10 largest clusters

Results on Codeforces: 1) clustering provides additional performance gain over filtering only; 2) still a gap from the oracle selection

Universal self-consistency¹⁵: ask the LLM to perform consistency-based selection instead of having the answer extraction process

improve the performance over the baseline on open-ended generation (summarization, QA), where the original self-consistency is not directly applicable
match self-consistency performance on math reasoning and coding; does not require answer extraction and code execution
performance is bounded by the long-context capability

Intuition: it will be hard for the model to judge the answer correctness by itself, but consistency should be a simpler criterion to measure

give an instruction to the model and ask it to select the most consistent response based on the majority consensus
it also needs to take a look at all the candidate responses

Improve further over consistency-based selection¹⁶⸴¹⁷: train LLMs to be the ranker, and hope this ranker can perform better than the simple consistency criterion
Two types of LLM-based verifiers / reward models:

Outcome-supervised Reward Model (ORM): verify at the solution level
Process-supervised Reward Model (PRM): verify at the step level for each solution

(Strong) LLM-based verifiers outperform consistency-based selection; PRM scales better with more samples compared to ORM and majority voting baseline

So far, response selection only after the generation of full response; this does not fully utilize a step-wise scorer
LLM + Tree Search¹⁸: prioritize the exploration of more promising partial solutions
example: game of 24, at each step

thought generation: prompt the LLM to propose possible next thinking steps (select two numbers to perform calculation)
thought evaluation: prompt the LLM to evaluate how promising the current state is (decide whether it is possible to reach the final number of 24 after these manipulations)

Voting-based state evaluation: LLM votes multiple times, then selects the majority vote as the final choice
Original paper use basic BFS / DFS algorithms, more advanced search algorithms (e.g., Monte-Carlo Tree Search) can be integrated

Iterative Self-improvement

Idea: increase the depth to reach the final solution

Even the best LLM still make (sometimes obvious) mistakes; on the other hand, humans also tend to make (sometimes trivial) mistakes at first thought

Sampling multiple solutions can reduce mistakes from a single prediction, but it is still optimal because there is no feedback loop to correct the mistakes after a complete solution is generated

Inference-time self-improvement: LLM iteratively improves its own response for the given task, which aligns better with human’s error correction process

Reflexion¹⁹ and self-refine²⁰: two steps after generating each solution

LLM generates feedback on its output; use external evaluation when available
LLM self-refines its output based on both internal feedback and external evaluation

Reflexion improves on tasks with effective evaluation heuristics (e.g., ALFWorld)
External evaluation gives the answer correctness at each reflection step (e.g., HotPotQA)

Self-debugging²¹ is a natural workflow for code generation
Code execution provides natural external feedback: humans often debug better with an IDE

Feedback formats	Details
Simple	a short universal feedback for all wrong code
Unit test feedback	include the execution results
Code explanation	line-by-line explanation of the implementation
Trace	line-by-line simulation of the execution trace

Self-debugging consistently boosts the performance across different LLMs; more informative feedback further improves the debugging performance

How does self-correction work for QA-style reasoning tasks?

some prior work show improvement with self-correction, but using an oracle verifier
in practice, the oracle verifier will not be available in most use cases, especially in reasoning tasks
how do LLMs perform without such very explicit and accurate external feedback?

Self-correction without oracle feedback²² hurts the reasoning performance

oracle: utilize the ground truth answer for correction
without oracle feedback, LLMs need to judge the response correctness themselves
LLMs can wrongly judge the correctness of its predictions, leading to worse performance after self-correction

General-purpose feedback prompt variants do not improve the performance: edit the feedback prompt affects the self-correction behavior (tendency to keep the initial response), but none of them significantly improves over the initial performance

Multi-agent debate²³ does not improve over self-consistency

have the model generate several responses in parallel once
prompt the LLM to review multiple responses and give an updated one
recall: self-consistency selects the response with the most common final answer
without a good evaluator, multi-agent debate does not effectively utilize the token budget; self-consistency scales better than multi-agent debate if keeping the same number of response and budget

Utilize Different Combinations of These Techniques

Putting everything together: how to balance the inference budget for generating multiple samples, in parallel or sequentially?
This is mostly a model-specific and task-specific empirical question, depending on the model’s self-reflection and correction abilities

Overall conclusion from research²⁴:

for simple problems, the model can benefit from more from self-correction because it knows more about whether the current solution is wrong and how it should revise
for harder problems, there is a better point in the middle about how much you should perform parallel generation

Model size is another factor for optimizing inference cost²⁵:

with the same FLOPs budget, we can sample more solutions from a lighter model
the optimal model with different inference budget can be different

The best practice to interact with an LLM should be adapted according to its capabilities

“One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great”

“We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done”

– Professor Richard Sutton, The Bitter Lesson

References

OpenAI Research: Learning to reason with LLMs ↩
Blog post: OpenAI o3 Breakthrough ↩
Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]. 2022. ↩ ↩²
Maxwell Nye et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114 [cs.LG]. 2021. ↩
Takeshi Kojima et al. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL]. 2022. ↩ ↩²
Michihiro Yasunaga et al. Large Language Models as Analogical Reasoners. arXiv:2310.01714 [cs.LG]. 2023. ↩
Yongchao Zhou et al. Large Language Models Are Human-Level Prompt Engineers. arXiv:2211.01910 [cs.LG]. 2022. ↩
Chengrun Yang et al. Large Language Models as Optimizers. arXiv:2309.03409 [cs.LG]. 2023. ↩
Denny Zhou et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625 [cs.AI]. 2022. ↩
Andrew Drozdov et al. Compositional Semantic Parsing with Large Language Models. arXiv:2209.15003 [cs.CL]. 2022. ↩
Daniel Keysers et al. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data. arXiv:1912.09713 [cs.LG]. 2019. ↩
Pei Zhou et al. Self-Discover: Large Language Models Self-Compose Reasoning Structures. arXiv:2402.03620 [cs.AI]. 2024. ↩
Xuezhi Wang et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]. 2022. ↩
Yujia Li et al. Competition-Level Code Generation with AlphaCode. arXiv:2203.07814 [cs.PL]. 2022. ↩
Xinyun Chen et al. Universal Self-Consistency for Large Language Model Generation. arXiv:2311.17311 [cs.CL]. 2023. ↩
Karl Cobbe et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG]. 2021. ↩
Hunter Lightman et al. Let’s Verify Step by Step. arXiv:2305.20050 [cs.LG]. 2023. ↩
Shunyu Yao et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]. 2023. ↩
Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]. 2023. ↩
Aman Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL]. 2023. ↩
Xinyun Chen et al. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL]. 2023. ↩
Jie Huang et al. Large Language Models Cannot Self-Correct Reasoning Yet. arXiv:2310.01798 [cs.CL]. 2023. ↩
Yilun Du et al. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL]. 2023. ↩
Charlie Snell et al. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314 [cs.LG]. 2024. ↩
Yangzhen Wu et al. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. arXiv:2408.00724 [cs.AI]. 2024. ↩

This site is open source. Improve this page.