On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

Abstract
Abstract (translated)
URL
PDF

Abstract

This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks.

Abstract (translated)

本研究介绍了一个新的长篇数据库问题回答数据集，旨在评估大型语言模型（LLMs）如何与SQL解释器互动。任务要求LLMs通过策略生成多个SQL查询来检索数据库中的足够数据，理解和解释所获得上下文，并将它们合成为一个全面的分析故事。我们的研究结果表明，即使是最先进的GPT-4模型，这项任务也具有巨大的挑战。我们提出了并评估了两种交互策略，并提供了对交互过程中各个阶段的详细分析。一个关键的发现是，有效的互动有两个主要瓶颈：规划能力和生成多个SQL查询的能力。为了准确评估答案的质量，我们引入了一个多代理评估框架，模拟了学术同行评审过程，提高了我们的评估的精度和可靠性。这个框架使得我们对当前LLM在复杂检索和推理任务中的优势和局限性有了更加深入的理解。

URL

https://arxiv.org/abs/2311.09721

PDF

https://arxiv.org/pdf/2311.09721.pdf

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

Abstract

Abstract (translated)

URL

PDF Copy

PDF