While impressive performance has been achieved on the task of Answer Sentence Selection (AS2) for English, the same does not hold for languages that lack large labeled datasets. In this work, we propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages in the tasks without the need of labeled data for the target language. To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages. We conduct extensive experiments on Xtr-WikiQA and TyDi-AS2 with multiple teachers, diverse monolingual and multilingual pretrained language models (PLMs) as students, and both monolingual and multilingual training. The results demonstrate that CLKD either outperforms or rivals even supervised fine-tuning with the same amount of labeled data and a combination of machine translation and the teacher model. Our method can potentially enable stronger AS2 models for low-resource languages, while TyDi-AS2 can serve as the largest multilingual AS2 dataset for further studies in the research community.
虽然英语在回答句子选择任务(AS2)方面取得了令人印象深刻的表现,但对于缺乏大型标注数据的语言来说,情况并不是这样。在本文中,我们提出了跨语言知识蒸馏(CLKD)方法,从一位强大的英语AS2老师那里提出,用于训练低资源语言中的AS2模型,而不需要目标语言的标注数据。为了评估我们的方法,我们介绍了1) Xtr-WikiQA,一个基于翻译的WikiQA数据集,适用于9个其他语言,以及2) TyDi-AS2,一个跨越多个语言类型的AS2数据集,超过7000个问题,涵盖了8个类型不同的语言。我们使用多个老师、多种语言的预训练语言模型(PLM)作为学生,进行双语和多语的训练。结果证明,CLKD在同样数量的标注数据和机器翻译与教师模型的结合下,优于或甚至与 supervised fine-tuning 相等。我们的方法可以潜力地支持低资源语言中的更强AS2模型,而 TyDi-AS2可以作为 research community 中最大的跨语言AS2数据集进行进一步研究。
https://arxiv.org/abs/2305.16302
Leveraging external knowledge to enhance the reasoning ability is crucial for commonsense question answering. However, the existing knowledge bases heavily rely on manual annotation which unavoidably causes deficiency in coverage of world-wide commonsense knowledge. Accordingly, the knowledge bases fail to be flexible enough to support the reasoning over diverse questions. Recently, large-scale language models (LLMs) have dramatically improved the intelligence in capturing and leveraging knowledge, which opens up a new way to address the issue of eliciting knowledge from language models. We propose a Unified Facts Obtaining (UFO) approach. UFO turns LLMs into knowledge sources and produces relevant facts (knowledge statements) for the given question. We first develop a unified prompt consisting of demonstrations that cover different aspects of commonsense and different question styles. On this basis, we instruct the LLMs to generate question-related supporting facts for various commonsense questions via prompting. After facts generation, we apply a dense retrieval-based fact selection strategy to choose the best-matched fact. This kind of facts will be fed into the answer inference model along with the question. Notably, due to the design of unified prompts, UFO can support reasoning in various commonsense aspects (including general commonsense, scientific commonsense, and social commonsense). Extensive experiments on CommonsenseQA 2.0, OpenBookQA, QASC, and Social IQA benchmarks show that UFO significantly improves the performance of the inference model and outperforms manually constructed knowledge sources.
利用外部知识增强推理能力对于常识问题回答至关重要。然而,现有的知识库严重依赖手动标注,这不可避免地会导致全球常识知识覆盖面的短缺。因此,知识库无法变得足够灵活,支持在各种问题上进行推理。最近,大型语言模型(LLMs)已经显著改进了捕获和利用知识的智商,这开辟了一种新方式,以从语言模型中获取知识。我们提出了一种统一的获取事实(UFO)方法。 UFO将LLMs转化为知识库,为给定问题生成相关的事实(知识陈述)。我们首先开发一个统一的提示,包括演示,涵盖了常识的不同方面和不同问题风格。基于这个提示,我们指令LLMs通过提示生成与问题相关的支持事实,以各种常识问题为例。在事实生成后,我们应用密集检索based的事实选择策略,选择最佳的匹配事实。这种类型的事实将随着问题一起输入答案推理模型。值得注意的是,由于统一提示的设计, UFO可以支持各种常识方面(包括一般常识、科学常识和社会常识)的推理。在常识问题QA 2.0、OpenBookQA、QASC和社交IQA基准实验中,广泛测试表明, UFO显著提高了推理模型的性能,并击败了手动构建的知识库。
https://arxiv.org/abs/2305.16048
The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while "peeking" into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model -- termed QAmden -- and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.
将多文档预训练目标融入语言模型,导致了多文档后续任务的重大改进。在这项工作中,我们建议扩展这个思想,通过预训练一个通用的多文档模型,从一个新的跨文档问答预训练目标开始。为此,给定一组(或簇)相关的文档,我们 systematic 地从一份文档中的一条引人注目的句子中生成语义相关的提问,并在预训练期间挑战模型回答这些问题,同时“窥探”其他相关的文档。类似地,模型也被挑战恢复生成的提问的句子,同时利用跨文档信息。这个新的多文档QA formulation指示模型更好地恢复跨文本信息关系,并引入了一种自然的增强,从而增加了预训练数据。此外,与以前的多文档模型专注于分类或总结任务不同,我们的预训练目标 formulation使模型能够同时涉及短文本生成(如QA)和长文本生成(如总结)的任务。按照这个方案,我们预训练我们的模型——称为QAmden——并评估它在多个多文档任务中的表现,包括多文档QA、总结和提问聚焦总结,取得了高达7%的改进,显著超越了零样本GPT-3.5和GPT-4。
https://arxiv.org/abs/2305.15387
Recent studies show that sentence-level extractive QA, i.e., based on Answer Sentence Selection (AS2), is outperformed by Generation-based QA (GenQA) models, which generate answers using the top-k answer sentences ranked by AS2 models (a la retrieval-augmented generation style). In this paper, we propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA). Specifically, we propose three strategies to transfer knowledge from these QA evaluation models to a GenQA model: (i) augmenting training data with answers generated by the GenQA model and labelled by GAVA (either statically, before training, or (ii) dynamically, at every training epoch); and (iii) using the GAVA score for weighting the generator loss during the learning of the GenQA model. We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.
最近的研究表明,基于答案句子选择(AS2)的句级提取式QA(GenQA)模型比基于生成式的QA(GenQA)模型表现更好。GenQA模型使用基于AS2模型的排名前K的答案句子(类似于检索增强的生成风格)来生成答案。在本文中,我们提出了一种使用自动QA评估模型(GAVA)监督的GenQA模型的新训练范式。具体来说,我们提出了三种策略来将这些QA评估模型的知识转移到GenQA模型中:(i) 在训练数据中添加由GenQA模型生成的答案并标记为GAVA( either静态地,在训练之前,或动态地,在每个训练 epoch 时);(ii) 在每个训练 epoch 时动态地使用GAVA得分加权生成损失;(iii) 使用GAVA得分来重测GenQA模型的学习。我们评估了我们提出的这些方法,在两个学术和一个工业数据集上进行了测试,取得了回答准确性方面的重大改进,超越了以前的技术水平。
https://arxiv.org/abs/2305.15344
We identify two crucial limitations in the evaluation of recent parallel-integrated method Parallel Context Windows (PCW), which extends the maximum context lengths of language models, e.g., 2048 for LLaMA, by harnessing window-wise attention and positional embedding techniques. We first show that a simple yet strong baseline, weighted sum ensemble, is missing for the in-context few-shot classification. Moreover, on more challenging Chain-of-Thought (CoT) reasoning (e.g., HotpotQA), PCW would present unexpected deterioration regarding question miscomprehension and false inference. Based on our findings, we suggest that the existing PCW design may not guarantee sufficient improvement and practicality in handling lengthy documents in real-world applications. More community efforts on enabling language models' long context understanding ability should be paid.
我们对最近的并行集成方法并行上下文窗口(PCW)进行评估时,发现了两个关键限制。PCW通过利用窗口注意力和位置嵌入技术,扩展了语言模型的最大上下文长度,例如LLaMA模型的2048,实现了在上下文中的少量样本分类。首先,我们表明,一个简单的但强大的基线、加权平均集合是缺失的,特别是对于问题误解和错误的推断。基于我们的发现,我们建议,现有的PCW设计可能不能保证在现实世界应用程序中处理较长的文档的充分改进和实用性。应该投入更多的社区努力,促进语言模型的长上下文理解能力。
https://arxiv.org/abs/2305.15262
In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset.
在本研究中,我们对文本到文本(T2T)模型中的输出词汇表在SPARQL语义解析任务中的作用进行了分析。我们在知识图问答(KGQA)的背景下进行了实验,该任务是将自然语言问题转换为SPARQL查询语言。我们观察到,查询词汇与人类词汇存在显著差异。语言模型(LMs)主要是针对人类语言任务进行训练的,因此,如果查询词汇被替换为更适应LM分割的单词,模型性能可能会提高。我们对查询进行了精心筛选的词汇替换,并在GrailQA数据集上发现了17%的绝对增益。
https://arxiv.org/abs/2305.15108
Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at this https URL
大型语言模型(LLMs)的进步激发了研究探索它们扩展到视觉领域的热情。尽管最近的模型在生成抽象图像标题和进行自然对话方面表现出潜力,但它们在文本丰富的图像上的表现不佳。在本文中,我们提出了Contrastive Reading Model(Cream),这是一种新神经网络架构,旨在通过捕获通常被忽视的精细细节,提高LLMs的语言-图像理解能力。Cream将视觉和辅助编码器相结合,并借助Contrastive feature alignment技术进行补充,从而在文档图像内更有效地理解文本信息。因此,我们的目标是填补视觉和语言理解之间的差距,为更复杂的文档情报助理铺平道路。在不同任务上,例如在文档图像中的视觉问答,进行了严格的评估,证明了Cream作为视觉文档理解领域最先进的模型的有效性。我们提供了我们的代码库和新生成的数据集,在本网站上提供。
https://arxiv.org/abs/2305.15080
Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.
大型语言模型(LLM)已经表现出卓越的语言能力,但它们在独立解决互动任务时面临挑战。现有的方法要么依赖于梯度访问,这在先进的LLM如GPT-4中往往难以访问,要么需要 diverse 和高质量的上下文展示。在本研究中,我们提出了LLM-PO,一种新方法,可以使LLM在没有梯度访问或广泛展示的情况下解决这些问题。关键思想是保持文本计划,并要求LLM基于它收集的经验考虑当前计划的优点和缺点,更新计划,并收集更多关于新计划的经验。在HotpotQA上的实验表明,LLM-PO相对于上下文学习(ICL)基线实现更高的或与ICL相当的成功率,但需要较少推理成本。
https://arxiv.org/abs/2305.15064
Explainable question answering (XQA) aims to answer a given question and provide an explanation why the answer is selected. Existing XQA methods focus on reasoning on a single knowledge source, e.g., structured knowledge bases, unstructured corpora, etc. However, integrating information from heterogeneous knowledge sources is essential to answer complex questions. In this paper, we propose to leverage question decomposing for heterogeneous knowledge integration, by breaking down a complex question into simpler ones, and selecting the appropriate knowledge source for each sub-question. To facilitate reasoning, we propose a novel two-stage XQA framework, Reasoning over Hierarchical Question Decomposition Tree (RoHT). First, we build the Hierarchical Question Decomposition Tree (HQDT) to understand the semantics of a complex question; then, we conduct probabilistic reasoning over HQDT from root to leaves recursively, to aggregate heterogeneous knowledge at different tree levels and search for a best solution considering the decomposing and answering probabilities. The experiments on complex QA datasets KQA Pro and Musique show that our framework outperforms SOTA methods significantly, demonstrating the effectiveness of leveraging question decomposing for knowledge integration and our RoHT framework.
Explainable question answering (XQA)旨在回答给定的问题并提供解释,现有XQA方法专注于在一个知识源上推理,例如结构化知识库、无结构 Corpus 等。然而,将来自不同知识源的信息整合在一起是回答复杂问题的关键。在本文中,我们提议利用问题分解来整合来自不同知识源的信息,将复杂的问题分解为更简单的问题,并为每个子问题选择适当的知识源。为了促进推理,我们提出了一种新的两阶段XQA框架,推理在Hierarchical Question Decomposition Tree(RoHT)上。首先,我们构建Hierarchical Question Decomposition Tree(HQDT)以理解复杂问题的意义;然后,我们从根到叶级在HQDT上进行概率推理,以将来自不同知识源的信息聚合到不同的树级别上,并考虑分解和回答问题的概率。在KQA Pro和Musique等复杂QA数据集的实验中,我们的框架显著优于现有SOTA方法,证明了利用问题分解来知识整合和我们的RoHT框架的有效性。
https://arxiv.org/abs/2305.15056
Text to image generation methods (T2I) are widely popular in generating art and other creative artifacts. While visual hallucinations can be a positive factor in scenarios where creativity is appreciated, such artifacts are poorly suited for cases where the generated image needs to be grounded in complex natural language without explicit visual elements. In this paper, we propose to strengthen the consistency property of T2I methods in the presence of natural complex language, which often breaks the limits of T2I methods by including non-visual information, and textual elements that require knowledge for accurate generation. To address these phenomena, we propose a Natural Language to Verified Image generation approach (NL2VI) that converts a natural prompt into a visual prompt, which is more suitable for image generation. A T2I model then generates an image for the visual prompt, which is then verified with VQA algorithms. Experimentally, aligning natural prompts with image generation can improve the consistency of the generated images by up to 11% over the state of the art. Moreover, improvements can generalize to challenging domains like cooking and DIY tasks, where the correctness of the generated image is crucial to illustrate actions.
文本到图像生成方法(T2I)在生成艺术和其他创意产物方面非常流行。虽然视觉幻觉可能在欣赏创造力的情况下是一种积极的因素,但在这种情况下,生成图像需要基于复杂的自然语言,而这种方法常常包括非视觉信息以及需要准确生成知识的文本元素,这使得这种方法并不适合生成图像,特别是当需要生成图像时,它们没有明确的视觉元素。在本文中,我们提议加强T2I方法在自然复杂语言中的一致性性质,这种语言常常包括非视觉信息并突破T2I方法的极限,以包括需要准确生成知识的文本元素。为了解决这些问题,我们提出了一种自然语言到验证图像生成方法(NL2VI),将自然提示转换为视觉提示,更适合于图像生成。T2I模型随后生成针对视觉提示的图像,然后使用VQA算法进行验证。实验表明,将自然提示与图像生成对齐可以改进生成图像的一致性,比当前技术水平提高了11%。此外,这些改进可以扩展到挑战性的领域,例如烹饪和 DIY任务,其中正确生成的图像对于演示行动至关重要。
https://arxiv.org/abs/2305.15026
Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at this https URL.
近年来,对大型语言模型(LLM)的多模态能力扩展引起了越来越多的关注,例如视觉语言(VL)学习,被认为是人工智能通用智能的下一个里程碑。然而,现有的解决方案非常昂贵,不仅需要优化过多的参数,还需要在VL指令调整之前进行另一大规模的预训练。在本文中,我们提出了一种新颖且成本较低的解决方案,称为混合模态适应(MMA),以有效适应LLM的VL学习,该解决方案被称为Adapters。 Instead of使用大型神经网络连接图像编码器和LLM,MMA采用轻量级模块,即适配器,以连接LLM和VL任务之间的差异,并实现图像和语言模型的联合优化。同时,MMA还配备了路由算法,以帮助LLM实现单模态和多模态指令的自动转换,而不会影响其自然语言理解能力。为了验证MMA,我们将其应用于最近开发的LLM称为LLaMA,并将形成的大型视觉语言指示模型称为LaVIN。为了验证MMA和LaVIN,我们在两个设置下进行了广泛的实验,即 multimodal科学问题回答和 multimodal对话。实验结果不仅证明了LaVIN比现有的多模态LLM更具竞争力性能和更好的训练效率,还确认了其作为通用聊天机器人的巨大潜力。更重要的是,LaVIN的实际支出非常便宜,例如仅需要1.4小时的训练时间,并具有380万可训练参数,极大地证实了MMA的有效性。我们的项目在此httpsURL发布。
https://arxiv.org/abs/2305.15023
Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.
身体感知型人工智能是机器人领域的一个关键前沿,能够为机器人在物理环境中实现长期目标的计划和执行行动序列。在本文中,我们介绍了EmbodiedGPT,这是一种面向身体感知型人工智能的身体感知型多媒基座模型,赋予身体感知型代理多媒理解和执行能力。为了实现这一点,我们采取了以下努力:(i) 我们制作了一个大规模的身体感知型规划数据集,称为EgoCOT。该数据集精选了Ego4D数据集中的 carefully selected 视频,并配上高质量的语言指令。具体来说,我们使用“思维链”模式生成一组子目标,以进行有效的身体感知型规划。(ii) 我们引入了高效的训练方法,为EmbodiedGPT提供高质量的规划生成,通过前缀调整将7B的大型语言模型(LLM)适应到EgoCOT数据集上。(iii) 我们引入了一种范式,从LLM生成的规划查询中提取任务相关特征,形成高级别的规划和低级别的控制之间的闭合循环。广泛的实验表明,EmbodiedGPT对身体感知型任务的有效性,包括身体感知规划、身体感知控制、视觉标题制作和视觉问答。值得注意的是,EmbodiedGPT通过提取更有效的特征,显著增强了身体感知型控制任务的成功率。它在 Franka Kitchen 基准测试中成功率的显著提高,以及在Meta-World基准测试中成功率的1.3倍显著提高,相比之下,与Ego4D数据集微调的BLIP-2基线相比,其成功率显著提高。
https://arxiv.org/abs/2305.15021
This report overviews our ongoing work in enriching chain-of-thoughts datasets requiring arithmetical reasoning with the integration of non-parametric components, such as a calculator. We conduct an analysis of prominent relevant datasets such as GSM8K, Ape210K, AQuA-RAT, and MathQA and propose a machine-processable HTML-like format specifically tailored for working with semi-structured chains. By converting the datasets into this unified format, we enable the effective integration of large language models and symbolic systems, empowering them to tackle arithmetical reasoning tasks more efficiently.
本报告概述了我们正在改进需要算术推理的思辨数据集,通过将非参数组件(如计算器)融入其中,实现的思维数据集的增强工作。我们对一些著名的相关数据集(如GSM8K、Ape210K、AQuA-RAT和MathQA)进行了分析,并提出了专门为处理半结构化链而设计的可机器处理HTML-like格式。通过将数据集转换为这个统一格式,我们可以使大型语言模型和符号系统有效地整合在一起,使其能够更有效地处理算术推理任务。
https://arxiv.org/abs/2305.15017
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.
在视觉问答系统(VQA)中,视觉基线(VG) metrics 主要用于测量系统在推断给定问题答案时对图像相关部分的依赖程度。缺乏 VG 是当前 VQA 系统中的一种普遍问题,可能会表现为过度依赖无关的图像部分或完全忽视视觉特性。虽然 VQA 模型的推断能力往往可以通过一些定性插图来展示,但大多数系统对他们的 VG 性质没有定量评估。我们相信,容易计算的标准 criterion 可以帮助弥补这一缺点,并为模型评估和分析添加另一个有价值的维度。为此,我们提出了一个新的 VG 度量方法,该方法可以捕捉如果一个模型 a) 在场景中识别相关物体,并且 b) 在产生答案时实际上依赖于相关物体中的信息,即它的视觉基线是“可靠”和“可信”的。我们的度量方法被称为“可靠可信的视觉基线分配”(FPVG),对于大多数 VQA 模型设计来说,可以轻松确定。我们详细描述了 FPVG 方法和评估了多个参考系统,支持 GQA 数据集的度量计算代码可在 GitHub 上找到。
https://arxiv.org/abs/2305.15015
LLM-powered chatbots are becoming widely adopted in applications such as healthcare, personal assistants, industry hiring decisions, etc. In many of these cases, chatbots are fed sensitive, personal information in their prompts, as samples for in-context learning, retrieved records from a database, or as part of the conversation. The information provided in the prompt could directly appear in the output, which might have privacy ramifications if there is sensitive information there. As such, in this paper, we aim to understand the input copying and regurgitation capabilities of these models during inference and how they can be directly instructed to limit this copying by complying with regulations such as HIPAA and GDPR, based on their internal knowledge of them. More specifically, we find that when ChatGPT is prompted to summarize cover letters of a 100 candidates, it would retain personally identifiable information (PII) verbatim in 57.4% of cases, and we find this retention to be non-uniform between different subgroups of people, based on attributes such as gender identity. We then probe ChatGPT's perception of privacy-related policies and privatization mechanisms by directly instructing it to provide compliant outputs and observe a significant omission of PII from output.
利用机器学习(LM)技术构建的聊天机器人在医疗、个人助理、行业雇用决策等应用中越来越普遍采用。在这些应用中,聊天机器人在提示中接收敏感个人信息,作为上下文学习样本、从数据库中检索记录或者作为对话的一部分。提示中提供的信息可以直接出现在输出中,如果存在敏感信息,这可能会带来隐私问题。因此,在本文中,我们旨在理解这些模型在推理时的输入复制和再分发能力,以及它们如何通过遵守HIPAA和GDPR等法规来限制这种复制,基于它们内部的相关知识。具体来说,我们发现,当ChatGPT被提示概括100个候选人的求职信时,它在57.4%的案例中保留相同的个人身份信息(PII),并且我们发现这种保留在不同人群体中并不均匀,基于性别身份等属性。随后,我们直接指示ChatGPT提供符合法规的输出,并观察输出中显著缺少PII的情况。
https://arxiv.org/abs/2305.15008
Chain-of-Thought prompting (CoT) enables large-scale language models to solve complex reasoning problems by decomposing the problem and tackling it step-by-step. However, Chain-of-Thought is a greedy thinking process that requires the language model to come up with a starting point and generate the next step solely based on previous steps. This thinking process is different from how humans approach a complex problem e.g., we proactively raise sub-problems related to the original problem and recursively answer them. In this work, we propose Socratic Questioning, a divide-and-conquer fashion algorithm that simulates the self-questioning and recursive thinking process. Socratic Questioning is driven by a Self-Questioning module that employs a large-scale language model to propose sub-problems related to the original problem as intermediate steps and Socratic Questioning recursively backtracks and answers the sub-problems until reaches the original problem. We apply our proposed algorithm to the visual question-answering task as a case study and by evaluating it on three public benchmark datasets, we observe a significant performance improvement over all baselines on (almost) all datasets. In addition, the qualitative analysis clearly demonstrates the intermediate thinking steps elicited by Socratic Questioning are similar to the human's recursively thinking process of a complex reasoning problem.
思维引导(CoT)使得大规模语言模型能够通过分解问题并逐步解决它来解决复杂的推理问题。然而,思维引导是一种贪婪的思考过程,需要语言模型想出一个起点,仅基于以前的步骤生成下一步。这种思考过程与人类如何解决复杂问题的方式不同,例如,我们主动提出与原始问题相关的子问题,并递归地回答它们。在这个工作中,我们提出了苏格拉底问题求解算法,这是一种分而治之的方式,模拟了自我思考和递归思考过程。苏格拉底问题求解算法由自我问答模块驱动,该模块使用大规模语言模型提出与原始问题相关的子问题作为中间步骤,苏格拉底问题求解算法递归地退回并回答子问题,直到达到原始问题。我们将我们提出的算法应用于视觉问答任务作为案例研究,通过评估三个公共基准数据集,我们观察到在所有数据集上(几乎)对所有基准点的显著性能改进。此外,定性分析清楚地表明,苏格拉底问题求解算法提取的中间思考步骤与人类解决复杂推理问题时的递归思考过程相似。
https://arxiv.org/abs/2305.14999
We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions. We propose Chain-of-Questions, a framework that trains a model to generate sub-questions and sub-answers one at a time by leveraging human annotated question decomposition meaning representation (QDMR). The key technical challenge is that QDMR only contains sub-questions but not answers to those sub-questions, so we treat sub-answers as latent variables and optimize them using a novel dynamic mixture of Hard-EM and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA adversarial set, thus demonstrating the effectiveness and robustness of our framework.
我们训练了一个语言模型(LM)以通过生成和回答子问题来 robustly 回答多步骤问题。我们提出了 Chain-of-Questions 框架,该框架利用人类标注的问题分解意义表示(QDMR)来训练模型逐个生成子问题和子回答,从而实现了对多步骤问题的有效回答。然而,该框架的主要技术挑战是 QDMR 仅包含子问题,但不含对这些子问题的解答,因此我们将子回答视为潜在变量,并使用 Hard-EM 和 MAPO 的新型动态混合来优化它们。 Chain-of-Questions 在 DROP Contrast 集合上通过 9.0 F1 显著超越了强大的神经符号方法,而在 HOTpotQA 对抗性集合上比 GPT-3.5 提高了 24.3 F1,从而证明了我们框架的有效性和鲁棒性。
https://arxiv.org/abs/2305.14901
Model interpretability has long been a hard problem for the AI community especially in the multimodal setting, where vision and language need to be aligned and reasoned at the same time. In this paper, we specifically focus on the problem of Visual Question Answering (VQA). While previous researches try to probe into the network structures of black-box multimodal models, we propose to tackle the problem from a different angle -- to treat interpretability as an explicit additional goal. Given an image and question, we argue that an interpretable VQA model should be able to tell what conclusions it can get from which part of the image, and show how each statement help to arrive at an answer. We introduce InterVQA: Interpretable-by-design VQA, where we design an explicit intermediate dynamic reasoning structure for VQA problems and enforce symbolic reasoning that only use the structure for final answer prediction to take place. InterVQA produces high-quality explicit intermediate reasoning steps, while maintaining similar to the state-of-the-art (sota) end-task performance.
模型解释性一直是一个对人工智能社区来说的难题,特别是在多媒件环境中,视觉和语言需要同时对齐和推理。在本文中,我们特别注重视觉问答问题(VQA)的问题。尽管以前的研究试图探索黑盒多媒件模型的网络结构,但我们建议从不同的角度解决这个问题 - 将解释性作为明确的额外目标。给定一张图片和一个问题,我们主张一个可解释的VQA模型应该能够从图片的某个部分得出什么结论,并展示每个陈述如何帮助得出答案。我们介绍了InterVQA:设计可解释的VQA模型,我们在VQA问题中设计了一份明确的中间动态推理结构,并强制使用结构进行最终答案预测的唯一用途是符号推理。InterVQA生成高质量的明确中间推理步骤,同时保持类似于最先进的任务表现的水平。
https://arxiv.org/abs/2305.14882
The task of zero-shot commonsense question answering evaluates models on their capacity to reason about general scenarios beyond those presented in specific datasets. Existing approaches for tackling this task leverage external knowledge from CommonSense Knowledge Bases (CSKBs) by pretraining the model on synthetic QA pairs constructed from CSKBs. In these approaches, negative examples (distractors) are formulated by randomly sampling from CSKBs using fairly primitive keyword constraints. However, two bottlenecks limit these approaches: the inherent incompleteness of CSKBs limits the semantic coverage of synthetic QA pairs, and the lack of human annotations makes the sampled negative examples potentially uninformative and contradictory. To tackle these limitations above, we propose Conceptualization-Augmented Reasoner (CAR), a zero-shot commonsense question-answering framework that fully leverages the power of conceptualization. Specifically, CAR abstracts a commonsense knowledge triple to many higher-level instances, which increases the coverage of CSKB and expands the ground-truth answer space, reducing the likelihood of selecting false-negative distractors. Extensive experiments demonstrate that CAR more robustly generalizes to answering questions about zero-shot commonsense scenarios than existing methods, including large language models, such as GPT3.5 and ChatGPT. Our codes, data, and model checkpoints are available at this https URL.
零样本常识性问题解答评估模型的任务,需要在特定的数据集之外,对模型进行推理能力的评价。目前用于解决这个问题的方法利用常识知识库(CSKBs)的外部知识,通过从CSKBs中构造的人工合成QA对进行模型预训练。在这些方法中,负样本(干扰器)是通过随机采样从CSKBs中运用相当基本的关键字约束来制定的。然而,有两个瓶颈限制了这些方法:CSKBs自身的 incompleteness 限制了合成QA对语义覆盖,而缺乏人类标注使采样到的负样本可能缺乏有用性和矛盾性。为了克服这些限制,我们提出了概念化增强推理器(CAR),这是一个零样本常识性问题解答框架, fully 利用了概念化的力量。具体来说,CAR将常识知识三元抽象为许多更高级别的实例,增加了CSKB的覆盖率并扩大了真相回答空间,降低了选择False-negative干扰器的概率。广泛的实验表明,CAR比现有的方法,包括大型语言模型,如GPT3.5和ChatGPT,更有力地Generalize 到回答零样本常识性问题的场景。我们的代码、数据和模型 checkpoint 可在这个https URL上获取。
https://arxiv.org/abs/2305.14869
Contemporary face recognition (FR) models achieve near-ideal recognition performance in constrained settings, yet do not fully translate the performance to unconstrained (realworld) scenarios. To help improve the performance and stability of FR systems in such unconstrained settings, face image quality assessment (FIQA) techniques try to infer sample-quality information from the input face images that can aid with the recognition process. While existing FIQA techniques are able to efficiently capture the differences between high and low quality images, they typically cannot fully distinguish between images of similar quality, leading to lower performance in many scenarios. To address this issue, we present in this paper a supervised quality-label optimization approach, aimed at improving the performance of existing FIQA techniques. The developed optimization procedure infuses additional information (computed with a selected FR model) into the initial quality scores generated with a given FIQA technique to produce better estimates of the "actual" image quality. We evaluate the proposed approach in comprehensive experiments with six state-of-the-art FIQA approaches (CR-FIQA, FaceQAN, SER-FIQ, PCNet, MagFace, SDD-FIQA) on five commonly used benchmarks (LFW, CFPFP, CPLFW, CALFW, XQLFW) using three targeted FR models (ArcFace, ElasticFace, CurricularFace) with highly encouraging results.
当代人脸识别(FR)模型在约束条件下实现接近 ideal 的识别性能,但并未将性能完全转化为无约束(实际)场景。为了在无约束条件下改善 FR 系统的性能与稳定性,人脸图像质量评估(FIQA)技术试图从输入人脸图像中推断样本质量信息,以协助识别过程。尽管现有的FIQA技术能够高效捕捉高质量图像与低质量图像之间的差异,但它们通常无法完全区分相似质量的图像,导致在许多场景中性能下降。为了解决这一问题,本文提出了一种监督质量标签优化方法,旨在改善现有FIQA技术的性能。该开发的优化程序将额外的信息(通过选择 FR 模型计算)注入到给定的FIQA技术生成的初始质量得分中,以产生更准确的“实际”图像质量估计。我们综合了使用六个最先进的FIQA方法(CR-FIQA、FaceQAN、 SER-FIQ、PCNet、 MagFace、SDD-FIQA)在五个常用的基准(LFW、CFPFP、 CPLFW、CALFW、XQLFW)上使用三个目标 FR 模型(ArcFace、 ElasticFace、 curricularFace)并取得高度令人鼓舞的结果进行评估。
https://arxiv.org/abs/2305.14856