Abstract
This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks.
Abstract (translated)
本研究介绍了一个新的长篇数据库问题回答数据集,旨在评估大型语言模型(LLMs)如何与SQL解释器互动。任务要求LLMs通过策略生成多个SQL查询来检索数据库中的足够数据,理解和解释所获得上下文,并将它们合成为一个全面的分析故事。我们的研究结果表明,即使是最先进的GPT-4模型,这项任务也具有巨大的挑战。我们提出了并评估了两种交互策略,并提供了对交互过程中各个阶段的详细分析。一个关键的发现是,有效的互动有两个主要瓶颈:规划能力和生成多个SQL查询的能力。为了准确评估答案的质量,我们引入了一个多代理评估框架,模拟了学术同行评审过程,提高了我们的评估的精度和可靠性。这个框架使得我们对当前LLM在复杂检索和推理任务中的优势和局限性有了更加深入的理解。
URL
https://arxiv.org/abs/2311.09721