Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose a principled framework with provable regret guarantees to orchestrate reasoning and acting, which we call "reason for future, act for now" (RAFA). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future"). At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs to form an updated posterior of the unknown environment from the memory buffer (learning) and generate an optimal trajectory for multiple future steps that maximizes a value function (planning). The learning and planning subroutines are performed in an "in-context" manner to emulate the actor-critic update for MDPs. Our theoretical analysis proves that the novel combination of long-term reasoning and short-term acting achieves a √T regret. In particular, the regret bound highlights an intriguing interplay between the prior knowledge obtained through pretraining and the uncertainty reduction achieved by reasoning and acting. Our empirical validation shows that it outperforms various existing frameworks and achieves nearly perfect scores on a few benchmarks. By incorporating “classical” MDP techniques, RAFA introduces the first autonomous LLM agent with provable regret guarantees. Notably, LLMs do not function as actors, critics, or learned world models, but rather as an internal mechanism that improves them iteratively.
At the t-th step of RAFA (Algorithm 1), the LLM agent invokes the reasoning routine, which learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future" in Line 6), takes the initial action of the planned trajectory (“act for now” in Line 7), and stores the collected feedback (state, action, and reward) in the memory buffer (Line 8). Upon the state transition of the external environment, the LLM agent reinvokes the reasoning routine to replan another future trajectory from the new state (Line 6 following Line 9). To ensure the learning and planning stability, we impose the switching condition (Line 10) to decide whether to incorporate the newest chunk of history into the information state, which is used in the reasoning routine as contexts. For different concrete settings, we use different implementations of the LLM learner-planner. Please check our paper for more details.
Our empirical validation shows that RAFA outperforms various existing frameworks in interactive decision-making tasks, including ALFWorld, BlocksWorld, Game of 24, and a new benchmark based on TicTacToe. In a few benchmarks, it achieves nearly perfect scores.
Game of 24 is a mathematical puzzle to obtain 24 from four natural numbers through basic arithmetic operations. RAFA uses the beam search planner (Algorithm 4 in our paper) on Game of 24.
ALFWorld is an interactive environment for embodied agent simulations, encompassing 134 household tasks in six categories. RAFA uses the tree-search planner (Algorithm 2 in our paper) on ALFWorld.
BlocksWorld contains tasks to arrange blocks in specific configurations. RAFA uses the MCTS planner (Algorithm 5 in our paper) on BlocksWorld.
Tic-Tac-Toe is a competitive game where the X and O sides take turns to place marks. RAFA uses the MCTS planner (Algorithm 5 in our paper) on Tic-Tac-Toe.
@article{liu2023reason,
title={Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency},
author={Liu, Zhihan and Hu, Hao and Zhang, Shenao and Guo, Hongyi and Ke, Shuqi and Liu, Boyi and Wang, Zhaoran},
journal={arXiv preprint arXiv:2309.17382},
year={2023}
}