Abstract
Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.
Abstract (translated)
大型语言模型(LLM)已经表现出卓越的语言能力,但它们在独立解决互动任务时面临挑战。现有的方法要么依赖于梯度访问,这在先进的LLM如GPT-4中往往难以访问,要么需要 diverse 和高质量的上下文展示。在本研究中,我们提出了LLM-PO,一种新方法,可以使LLM在没有梯度访问或广泛展示的情况下解决这些问题。关键思想是保持文本计划,并要求LLM基于它收集的经验考虑当前计划的优点和缺点,更新计划,并收集更多关于新计划的经验。在HotpotQA上的实验表明,LLM-PO相对于上下文学习(ICL)基线实现更高的或与ICL相当的成功率,但需要较少推理成本。
URL
https://arxiv.org/abs/2305.15064