Abstract
Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.
Abstract (translated)
翻译:表问题回答(TQA)旨在根据表格数据回答问题。虽然先前的研究已经表明,TQA模型缺乏稳健性,但理解这一问题的根本原因和性质仍然存在很大不确定性,这成为发展稳健 TQA 系统的重大障碍。在本文中,我们正式提出了三个对细粒度评估 TQA 系统稳健性的主要需求。它们应该(i)回答无论表格结构如何变化的问题,(ii)基于相关单元格的内容而不是基于偏见,(iii)展示稳健的数值推理能力。为了研究这些方面,我们在英语中创建并发布了一个新颖的 TQA 评估基准。我们广泛的实验分析发现,没有考察的先进 TQA 系统在三个方面都表现不佳。我们的基准是监控 TQA 系统行为的关键工具,为发展稳健 TQA 系统铺平道路。我们将基准公开发布。
URL
https://arxiv.org/abs/2404.18585