Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.
翻译:表问题回答(TQA)旨在根据表格数据回答问题。虽然先前的研究已经表明,TQA模型缺乏稳健性,但理解这一问题的根本原因和性质仍然存在很大不确定性,这成为发展稳健 TQA 系统的重大障碍。在本文中,我们正式提出了三个对细粒度评估 TQA 系统稳健性的主要需求。它们应该(i)回答无论表格结构如何变化的问题,(ii)基于相关单元格的内容而不是基于偏见,(iii)展示稳健的数值推理能力。为了研究这些方面,我们在英语中创建并发布了一个新颖的 TQA 评估基准。我们广泛的实验分析发现,没有考察的先进 TQA 系统在三个方面都表现不佳。我们的基准是监控 TQA 系统行为的关键工具,为发展稳健 TQA 系统铺平道路。我们将基准公开发布。
https://arxiv.org/abs/2404.18585
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
在医疗应用中卓越表现对AI提出了相当大的挑战,需要先进的推理能力、最新的医学知识和理解复杂多模态数据的能力。Gemini模型,在多模态和长语境推理方面具有强大的通用能力,在医学领域具有令人兴奋的可能性。在Gemini模型的核心优势的基础上,我们引入了Med-Gemini系列高度 capable的多模态模型,具有使用网络搜索平滑地使用医疗多模态数据的能力,并且可以采用自定义编码器将其定制为新颖的模态。我们在14个医疗基准上评估了Med-Gemini,其中10个基准建立了与GPT-4模型家族匹敌的新标杆性能,并在每个可进行直接比较的基准上超过了GPT-4。在热门的MedQA(USMLE)基准中,我们表现最佳的Med-Gemini模型实现了SoTA性能的91.1%,采用了一种新颖的不确定性指导搜索策略。在包括NEJM图像挑战和MMMU(健康与医学)在内的7个多模态基准上,Med-Gemini比GPT-4V提高了平均相对分数44.5%。我们通过在长匿名健康记录和医疗视频问答中的 needle-in-a-haystack 检索任务等长语境推理任务上的SoTA表现,展示了Med-Gemini长语境能力的效果。最后,Med-Gemini的表现表明,在诸如医疗文本摘要、多模态医疗对话、医学研究和教育等领域具有实际应用价值。尽管在实际部署前还需要进行进一步的严谨评估,但我们的结果确实为Med-Gemini的潜力提供了有力的证据。
https://arxiv.org/abs/2404.18416
In recent years, image generation technology has rapidly advanced, resulting in the creation of a vast array of AI-generated images (AIGIs). However, the quality of these AIGIs is highly inconsistent, with low-quality AIGIs severely impairing the visual experience of users. Due to the widespread application of AIGIs, the AI-generated image quality assessment (AIGIQA), aimed at evaluating the quality of AIGIs from the perspective of human perception, has garnered increasing interest among scholars. Nonetheless, current research has not yet fully explored this field. We have observed that existing databases are limited to images generated from single scenario settings. Databases such as AGIQA-1K, AGIQA-3K, and AIGCIQA2023, for example, only include images generated by text-to-image generative models. This oversight highlights a critical gap in the current research landscape, underscoring the need for dedicated databases catering to image-to-image scenarios, as well as more comprehensive databases that encompass a broader range of AI-generated image scenarios. Addressing these issues, we have established a large scale perceptual quality assessment database for both text-to-image and image-to-image AIGIs, named PKU-AIGIQA-4K. We then conduct a well-organized subjective experiment to collect quality labels for AIGIs and perform a comprehensive analysis of the PKU-AIGIQA-4K database. Regarding the use of image prompts during the training process, we propose three image quality assessment (IQA) methods based on pre-trained models that include a no-reference method NR-AIGCIQA, a full-reference method FR-AIGCIQA, and a partial-reference method PR-AIGCIQA. Finally, leveraging the PKU-AIGIQA-4K database, we conduct extensive benchmark experiments and compare the performance of the proposed methods and the current IQA methods.
近年来,图像生成技术快速发展,导致产生了大量AI生成的图像(AIGIs)。然而,这些AIGIs的质量差异很大,低质量的AIGIs严重地损害了用户的使用体验。由于AIGIs的广泛应用,从人类的角度评估AIGI质量的人工智能图像质量评估(AIGIQA)受到了越来越多的关注。然而,目前的 research 尚未完全探索这个领域。我们观察到,现有的数据库仅限于从单一场景设置生成的图像。例如,AGIQA-1K,AGIQA-3K和AIGCIQA2023等数据库仅包括由文本到图像生成模型的图像。这一缺陷突显了当前研究格局中的关键空白,强调了需要针对图像到图像场景建立专门的数据库以及更全面的涵盖更广泛AI生图像场景的数据库。为解决这些问题,我们建立了一个大规模的主观质量评估数据库,名为PKU-AIGIQA-4K。然后,我们进行了一个组织良好的主观实验,收集了AIGIs的质量标签,并对PKU-AIGIQA-4K数据库进行了全面分析。关于在训练过程中使用图像提示的问题,我们提出了三种基于预训练模型的图像质量评估(IQA)方法,包括无参考方法NR-AIGCIQA,完整参考方法FR-AIGCIQA和部分参考方法PR-AIGCIQA。最后,利用PKU-AIGIQA-4K数据库,我们进行了广泛的基准实验,并比较了所提出方法和现有IQA方法的性能。
https://arxiv.org/abs/2404.18409
Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset. Furthermore, we introduce a novel approach, called VisionReader, which achieved 0.4116 in EM and 0.6990 in the F1-score on the test set. Through the results, we found that the OCR system plays a very important role in VQA models on the ViOCRVQA dataset. In addition, the objects in the image also play a role in improving model performance. We open access to our dataset at link (this https URL) for further research in OCR-VQA task in Vietnamese.
光字符识别 - 视觉问题回答(OCR-VQA)是近年来在英语中显著发展起来的图像中含有的文本信息回答问题的任务。然而,在低资源语言(如越南语)中,对这一任务的深入研究有限。因此,我们引入了一个新的数据集ViOCRVQA(越南光学字符识别 - 视觉问题回答数据集),包括28,000+张图像和120,000+个问题-答案对。在这个数据集中,所有图像都包含图像中的文本及其相关信息的文本。我们借鉴了为英语提出的最先进方法的思想,对数据集进行实验,揭示了越南数据集中存在的挑战和困难。此外,我们引入了一种名为VisionReader的新方法,在EM测试集上实现了0.4116的分数,在F1分数上实现了0.6990。通过这些结果,我们发现OCR系统在ViOCRVQA数据集中的VQA模型中发挥着非常重要的作用。此外,图像中的对象也对模型的性能提高有影响。我们将我们的数据集的访问链接公开发布在(https:// this https URL)上,供进一步研究OCR-VQA任务在越南语中的OCR。
https://arxiv.org/abs/2404.18397
The proliferation of social media has led to information overload and increased interest in opinion mining. We propose "Question-Answering Network Analysis" (QANA), a novel opinion mining framework that utilizes Large Language Models (LLMs) to generate questions from users' comments, constructs a bipartite graph based on the comments' answerability to the questions, and applies centrality measures to examine the importance of opinions. We investigate the impact of question generation styles, LLM selections, and the choice of embedding model on the quality of the constructed QA networks by comparing them with annotated Key Point Analysis datasets. QANA achieves comparable performance to previous state-of-the-art supervised models in a zero-shot manner for Key Point Matching task, also reducing the computational cost from quadratic to linear. For Key Point Generation, questions with high PageRank or degree centrality align well with manually annotated key points. Notably, QANA enables analysts to assess the importance of key points from various aspects according to their selection of centrality measure. QANA's primary contribution lies in its flexibility to extract key points from a wide range of perspectives, which enhances the quality and impartiality of opinion mining.
社交媒体的普及导致信息过载和意见挖掘兴趣的增加。我们提出了“问题-回答网络分析”(QANA),一种新颖的意见挖掘框架,利用大型语言模型(LLMs)生成用户评论,基于评论对问题的回答能力构建二分图,并应用中心度度量来研究意见的重要性。我们研究了问题生成方式、LLM选择和嵌入模型的选择对构建的QA网络质量的影响,将它们与注释的Key Point Analysis数据集进行比较。QANA在零散射击情况下实现了与先前最先进的监督模型相媲美的性能,并且从二次方降到线性。对于Key Point生成,具有高PageRank或度中心性的问题与手动注释的关键点对齐得很好。值得注意的是,QANA使分析员根据其选择的中心度度量从各种角度评估关键点的重要性。QANA的主要贡献在于其从各种角度提取关键点的灵活性,这提高了意见挖掘的质量和平等性。
https://arxiv.org/abs/2404.18371
Although large multi-modality models (LMMs) have seen extensive exploration and application in various quality assessment studies, their integration into Point Cloud Quality Assessment (PCQA) remains unexplored. Given LMMs' exceptional performance and robustness in low-level vision and quality assessment tasks, this study aims to investigate the feasibility of imparting PCQA knowledge to LMMs through text supervision. To achieve this, we transform quality labels into textual descriptions during the fine-tuning phase, enabling LMMs to derive quality rating logits from 2D projections of point clouds. To compensate for the loss of perception in the 3D domain, structural features are extracted as well. These quality logits and structural features are then combined and regressed into quality scores. Our experimental results affirm the effectiveness of our approach, showcasing a novel integration of LMMs into PCQA that enhances model understanding and assessment accuracy. We hope our contributions can inspire subsequent investigations into the fusion of LMMs with PCQA, fostering advancements in 3D visual quality analysis and beyond.
尽管大型多模态模型(LMMs)已经在各种质量评估研究中得到了广泛探索和应用,但将LMM集成到点云质量评估(PCQA)中仍然是一个未被探索的问题。鉴于LMM在低级视觉和质量评估任务中的卓越表现和稳健性,本研究旨在调查通过文本监督将PCQA知识传递给LMM的可行性。为了实现这一目标,我们在微调阶段将质量标签转换为文本描述,使LMM可以从点云的二维投影中提取质量评分逻辑。为了弥补在3D领域中感知到的损失,我们还提取了结构特征。然后将这些质量评分和结构特征进行结合并回归到质量分数。我们的实验结果证实了我们的方法的有效性,展示了将LMM与PCQA相结合的新颖之处,提高了模型理解和评估准确性。我们希望我们的研究可以激励后续对LMM与PCQA融合的研究,促进在3D视觉质量分析和 beyond方面的进步。
https://arxiv.org/abs/2404.18203
Perceptual image quality assessment (IQA) is the task of predicting the visual quality of an image as perceived by a human observer. Current state-of-the-art techniques are based on deep representations trained in discriminative manner. Such representations may ignore visually important features, if they are not predictive of class labels. Recent generative models successfully learn low-dimensional representations using auto-encoding and have been argued to preserve better visual features. Here we leverage existing auto-encoders and propose VAE-QA, a simple and efficient method for predicting image quality in the presence of a full-reference. We evaluate our approach on four standard benchmarks and find that it significantly improves generalization across datasets, has fewer trainable parameters, a smaller memory footprint and faster run time.
感知图像质量评估(IQA)是预测人类观察者对图像视觉质量的感知。 目前的最佳技术基于在区分性方式下训练的深度表示。 这样的表示可能忽略视觉上重要的特征,如果它们不能预测类标签。 最近采用自动编码学习的生成模型成功学习低维表示,并被认为是保留更好视觉特征的有效方法。 因此,我们利用现有的自动编码器,并提出了VAE-QA,一种简单而有效的在完整参考下预测图像质量的方法。我们在四个标准基准上评估我们的方法,发现它显著提高了数据集的泛化,具有更少的训练参数,更小的内存开销和更快的运行时间。
https://arxiv.org/abs/2404.18178
Accurate representation of medical information is crucial for patient safety, yet artificial intelligence (AI) systems, such as Large Language Models (LLMs), encounter challenges in error-free clinical text interpretation. This paper presents a novel approach submitted to the MEDIQA-CORR 2024 shared task (Ben Abacha et al., 2024a), focusing on the automatic correction of single-word errors in clinical notes. Unlike LLMs that rely on extensive generic data, our method emphasizes extracting contextually relevant information from available clinical text data. Leveraging an ensemble of extractive and abstractive question-answering approaches, we construct a supervised learning framework with domain-specific feature engineering. Our methodology incorporates domain expertise to enhance error correction accuracy. By integrating domain expertise and prioritizing meaningful information extraction, our approach underscores the significance of a human-centric strategy in adapting AI for healthcare.
准确地描述医学信息对患者安全至关重要,然而大型语言模型(LLMs)等人工智能系统在无错误地解释临床文本方面遇到了挑战。本文提交给MEDIQA-CORR 2024共享任务(Ben Abacha等人,2024a),重点关注自动纠正临床笔记中单个单词错误的创新方法。与依赖广泛通用数据的大语言模型不同,我们的方法强调从可用临床文本数据中提取相关信息。利用提取式和抽象式问题回答方法的集成,我们构建了一个领域特定特征工程的超集学习框架。我们的方法结合了领域专业知识以提高错误纠正准确性。通过将领域专业知识和强调有意义信息提取,我们的方法突出了在将AI适应 healthcare 时采用以人为中心的策略的重要性。
https://arxiv.org/abs/2404.17999
Machine Reading Comprehension (MRC) poses a significant challenge in the field of Natural Language Processing (NLP). While mainstream MRC methods predominantly leverage extractive strategies using encoder-only models such as BERT, generative approaches face the issue of out-of-control generation -- a critical problem where answers generated are often incorrect, irrelevant, or unfaithful to the source text. To address these limitations in generative models for MRC, we introduce the Question-Attended Span Extraction (QASE) module. Integrated during the fine-tuning phase of pre-trained generative language models (PLMs), QASE significantly enhances their performance, allowing them to surpass the extractive capabilities of advanced Large Language Models (LLMs) such as GPT-4. Notably, these gains in performance do not come with an increase in computational demands. The efficacy of the QASE module has been rigorously tested across various datasets, consistently achieving or even surpassing state-of-the-art (SOTA) results.
机器阅读理解(MRC)在自然语言处理(NLP)领域是一个具有挑战性的问题。虽然主流的MRC方法主要依赖于仅使用编码器模型的提取策略,如BERT,但生成方法面临着生成过控制的问题——这是一个关键问题,因为生成的答案往往是不准确的、无关的或与原文不符。为了解决这些限制,我们引入了问题关注区间提取(QASE)模块。在预训练生成语言模型(PLMs)的微调阶段集成QASE模块,显著增强了它们的性能,使它们能够超越高级大型语言模型(LLMs) such as GPT-4的提取能力。值得注意的是,这些性能提升并没有增加计算负担。QASE模块的有效性已在各种数据集上进行了严格的测试,并始终达到或甚至超过了最先进的水平(SOTA)。
https://arxiv.org/abs/2404.17991
Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the lack of domain-specific knowledge and the intricacy of real-world scenarios. In this study, we explore LLMs with RAG framework for knowledge-intensive tasks in the medical field. To evaluate the capabilities of LLMs, we introduce MedicineQA, a multi-round dialogue benchmark that simulates the real-world medication consultation scenario and requires LLMs to answer with retrieved evidence from the medicine database. MedicineQA contains 300 multi-round question-answering pairs, each embedded within a detailed dialogue history, highlighting the challenge posed by this knowledge-intensive task to current LLMs. We further propose a new \textit{Distill-Retrieve-Read} framework instead of the previous \textit{Retrieve-then-Read}. Specifically, the distillation and retrieval process utilizes a tool calling mechanism to formulate search queries that emulate the keyword-based inquiries used by search engines. With experimental results, we show that our framework brings notable performance improvements and surpasses the previous counterparts in the evidence retrieval process in terms of evidence retrieval accuracy. This advancement sheds light on applying RAG to the medical domain.
大规模语言模型(LLMs)在各种语言任务中取得了显著的成功,但存在幻觉和时间错位等缺陷。为了减轻这些不足,检索增强生成(RAG)已被用于提供外部知识以促进答案生成。然而,将这种模型应用于医疗领域面临着多项挑战,因为缺乏领域特定知识和真实世界场景的复杂性。在这项研究中,我们探讨了具有RAG框架的大规模语言模型在医疗领域的知识密集型任务中的应用。为了评估LLMs的性能,我们引入了MedicineQA,一个多轮对话基准,模拟了真实世界药物咨询场景,并要求LLMs根据从药品数据库中检索到的证据回答问题。MedicineQA包含300个多轮问题-答案对,每个都嵌入在一个详细的对话历史中,突出了这种知识密集型任务对现有LLMs所提出的挑战。我们进一步提出了新的\textit{Distill-Retrieve-Read}框架,代替了前面的\textit{Retrieve-then-Read}框架。具体来说,差分和检索过程利用调用机制形成搜索查询,模拟搜索引擎使用的关键词基础查询。通过实验结果,我们证明了我们的框架在证据检索过程中带来了显著的性能提升,并且在证据检索准确性方面超越了前人。这一进步阐明了将RAG应用于医疗领域的重要性。
https://arxiv.org/abs/2404.17897
Traditional deep neural network (DNN)-based image quality assessment (IQA) models leverage convolutional neural networks (CNN) or Transformer to learn the quality-aware feature representation, achieving commendable performance on natural scene images. However, when applied to AI-Generated images (AGIs), these DNN-based IQA models exhibit subpar performance. This situation is largely due to the semantic inaccuracies inherent in certain AGIs caused by uncontrollable nature of the generation process. Thus, the capability to discern semantic content becomes crucial for assessing the quality of AGIs. Traditional DNN-based IQA models, constrained by limited parameter complexity and training data, struggle to capture complex fine-grained semantic features, making it challenging to grasp the existence and coherence of semantic content of the entire image. To address the shortfall in semantic content perception of current IQA models, we introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model, which utilizes semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. Moreover, it employs a mixture of experts (MoE) structure to dynamically integrate the semantic information with the quality-aware features extracted by traditional DNN-based IQA models. Comprehensive experiments conducted on two AI-generated content datasets, AIGCQA-20k and AGIQA-3k show that MA-AGIQA achieves state-of-the-art performance, and demonstrate its superior generalization capabilities on assessing the quality of AGIs. Code is available at this https URL.
传统的基于深度神经网络(DNN)的图像质量评估(IQA)模型利用卷积神经网络(CNN)或Transformer来学习质量感知特征表示,在自然场景图像上取得出色的表现。然而,当应用于人工智能生成的图像(AGIs)时,这些DNN-based IQA模型表现不佳。这种情况很大程度上是因为某些AGI中存在语义不准确的原因,导致生成过程的无控制性。因此,辨别语义内容对于评估AGI的质量至关重要。传统的DNN-based IQA模型,由于参数复杂性和训练数据有限,很难捕捉到复杂的精细语义特征,使得整个图像的语义内容难以理解。为了弥补现有IQA模型在语义内容感知方面的不足,我们引入了一个大型多模态模型辅助人工智能生成图像质量评估(MA-AGIQA)模型,该模型利用语义指导来感知语义信息,并通过精心设计的文本提示提取语义向量。此外,它采用专家结构(MoE)来动态地整合传统DNN-based IQA模型提取的质量感知特征和语义信息。在两个AI生成内容数据集AIGCQA-20k和AGIQA-3k上进行全面的实验发现,MA-AGIQA达到最先进的性能,并证明了其在评估AGI质量方面的优越通用能力。代码可以从该链接获取。
https://arxiv.org/abs/2404.17762
This paper presents our team's participation in the MEDIQA-ClinicalNLP2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show its superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.
本文介绍了我们团队在MEDIQA-ClinicalNLP2024共享任务B中的参与。我们提出了一种通过整合大型多模态模型来诊断临床皮肤病的新方法,特别是通过利用GPT-4V在检索器和排序器框架下的能力。我们的研究结果表明,当GPT-4V用作检索器时,85%的皮肤病病例可以准确地检索到正确的皮肤状况。此外,我们通过实验实证研究了Naive Chain-of-Thought(CoT)在检索方面的效果,同时指出需要医疗指南 grounded CoT 来进行准确的皮肤病诊断。进一步,我们引入了 Multi-Agent Conversation(MAC)框架,并证明了其在批判性诊断方面的卓越表现和潜力超过最佳CoT策略。实验结果表明,使用naive CoT进行检索和multi-agent conversation用于批评性诊断,GPT-4V可以实现对皮肤病情况的早期准确诊断。本工作的影响范围还在于改善诊断工作流程,支持皮肤病教育,以及通过提供可扩展、易用且准确的诊断工具来提高患者护理。
https://arxiv.org/abs/2404.17749
Recent advancements in Large Language Models (LLMs) have enhanced the efficacy of agent communication and social interactions. Despite these advancements, building LLM-based agents for reasoning in dynamic environments involving competition and collaboration remains challenging due to the limitations of informed graph-based search methods. We propose PLAYER*, a novel framework based on an anytime sampling-based planner, which utilises sensors and pruners to enable a purely question-driven searching framework for complex reasoning tasks. We also introduce a quantifiable evaluation method using multiple-choice questions and construct the WellPlay dataset with 1,482 QA pairs. Experiments demonstrate PLAYER*'s efficiency and performance enhancements compared to existing methods in complex, dynamic environments with quantifiable results.
近年来,大型语言模型(LLMs)的进步增强了智能体通信和社交互动的有效性。然而,由于基于信息图的搜索方法的局限性,为推理涉及竞争和合作的动态环境构建LLM代理仍然具有挑战性。我们提出了PLAYER*,一种基于时间采样基于规划器的全新框架,利用传感器和剪枝器实现了一个完全基于问题的搜索框架,以解决复杂推理任务。我们还引入了使用多项选择问题进行定量评估的方法,并构建了WellPlay数据集,其中包括1,482个QA对。实验证明了PLAYER*与现有方法在具有定量结果的复杂、动态环境中相比的效率和性能提升。
https://arxiv.org/abs/2404.17662
Conversational tutoring systems (CTSs) offer learning experiences through interactions based on natural language. They are recognized for promoting cognitive engagement and improving learning outcomes, especially in reasoning tasks. Nonetheless, the cost associated with authoring CTS content is a major obstacle to widespread adoption and to research on effective instructional design. In this paper, we discuss and evaluate a novel type of CTS that leverages recent advances in large language models (LLMs) in two ways: First, the system enables AI-assisted content authoring by inducing an easily editable tutoring script automatically from a lesson text. Second, the system automates the script orchestration in a learning-by-teaching format via two LLM-based agents (Ruffle&Riley) acting as a student and a professor. The system allows for free-form conversations that follow the ITS-typical inner and outer loop structure. We evaluate Ruffle&Riley's ability to support biology lessons in two between-subject online user studies (N = 200) comparing the system to simpler QA chatbots and reading activity. Analyzing system usage patterns, pre/post-test scores and user experience surveys, we find that Ruffle&Riley users report high levels of engagement, understanding and perceive the offered support as helpful. Even though Ruffle&Riley users require more time to complete the activity, we did not find significant differences in short-term learning gains over the reading activity. Our system architecture and user study provide various insights for designers of future CTSs. We further open-source our system to support ongoing research on effective instructional design of LLM-based learning technologies.
谈话辅助系统(CTS)通过自然语言交互提供学习体验。它们因促进认知参与和改进学习成果而受到认可,特别是在推理任务中。然而,为创建CTS内容而付出的成本是推广和有效教学设计研究的一个主要障碍。在本文中,我们讨论并评估了一种新型的CTS,它通过两种方式利用了大型语言模型(LLMs)的最近进展:首先,系统通过从课文诱导易于编辑的指导脚本来自动化AI辅助内容创作。其次,系统通过两个LLM代理(Ruffle&Riley)作为学生和教授自动编排脚本,实现了学习-以教模式。系统允许进行自由格式对话,遵循ITS典型的内循环和外循环结构。我们通过在两个同时进行的在线用户研究(N=200)比较系统与简单的问答聊天机器人和阅读活动来评估Ruffle&Riley的生物学课程支持能力。通过分析系统使用模式、前/后测试分数和用户体验调查,我们发现Ruffle&Riley用户报告了很高的参与度、理解和认为提供的支持很有帮助。尽管Ruffle&Riley用户需要更多时间来完成活动,但我们没有在阅读活动中发现短期的学习增长差异。我们的系统架构和用户研究为未来CTS的设计提供了各种洞见。我们进一步开源我们的系统,以支持关于LLM基于学习技术的有效教学设计的研究。
https://arxiv.org/abs/2404.17460
Quantum Convolutional Layer (QCL) is considered as one of the core of Quantum Convolutional Neural Networks (QCNNs) due to its efficient data feature extraction capability. However, the current principle of QCL is not as mathematically understandable as Classical Convolutional Layer (CCL) due to its black-box structure. Moreover, classical data mapping in many QCLs is inefficient. To this end, firstly, the Quantum Adjoint Convolution Operation (QACO) consisting of a quantum amplitude encoding and its inverse is theoretically shown to be equivalent to the quantum normalization of the convolution operation based on the Frobenius inner product while achieving an efficient characterization of the data. Subsequently, QACO is extended into a Quantum Adjoint Convolutional Layer (QACL) by Quantum Phase Estimation (QPE) to compute all Frobenius inner products in parallel. At last, comparative simulation experiments are carried out on PennyLane and TensorFlow platforms, mainly for the two cases of kernel fixed and unfixed in QACL. The results demonstrate that QACL with the insight of special quantum properties for the same images, provides higher training accuracy in MNIST and Fashion MNIST classification experiments, but sacrifices the learning performance to some extent. Predictably, our research lays the foundation for the development of efficient and interpretable quantum convolutional networks and also advances the field of quantum machine vision.
量子卷积层(QCL)被认为是量子卷积神经网络(QCNNs)的核心,因为其高效的特征提取能力。然而,由于其黑盒结构,QCL的当前原理不如经典卷积层(CCL)具有数学可理解性。此外,许多QCL中的经典数据映射效率较低。为此,首先,基于量子幅度的编码及其逆的量子Adjoint Convolution操作(QACO)被理论证明与基于Frobenius内积的卷积操作等价,从而实现对数据的有效特征表示。随后,通过量子相估计(QPE)将QACO扩展为量子Adjoint Convolutional Layer(QACL),用于并行计算所有Frobenius内积。最后,在PennyLane和TensorFlow平台上进行了针对QACL中内核固定和 unfixed 的比较性模拟实验。结果表明,具有特殊量子特性的QACL在相同图像上的训练精度在MNIST和Fashion MNIST分类实验中较高,但一定程度上牺牲了学习性能。预计,我们的研究为开发高效且可解释的量子卷积网络奠定了基础,同时也推动了量子机器视觉领域的发展。
https://arxiv.org/abs/2404.17378
The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.
自然语言处理(NLP)的快速发展为英语等主要语言带来了优势,导致其他语言资源有限,形成了一个显著的缺口。这在数据注释等任务上尤其明显,这些任务的重要性不容忽视,但却需要花费大量时间和金钱。因此,对于资源较少的语言来说,任何数据集都是宝贵的,尤其是当它是针对特定任务时。在这里,我们探讨了将现有数据集用于新NLP任务的潜力:我们将Belebele数据集(Bandarkar等人,2023)重新用于多项选择问题(MCQA),以实现机器阅读理解风格的提取性问答(EQA)。我们还为英语和现代标准阿拉伯语(MSA)提供了注释指南和并行EQA数据集。我们还包括英语、MSA和五处阿拉伯语方言在内的多个单语和跨语种QA对。我们的目标是,让其他人能够适应我们的方法,为Belebele中的120多种语言变体提供支持,其中许多被认为资源不足。我们还进行了详细的分析,并分享了从过程中得出的见解,希望这有助于对NLP研究中的任务重塑所带来的挑战和机遇有更深入的理解。
https://arxiv.org/abs/2404.17342
The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.
缺乏为教育目的明确定制的、易于访问的注释数据集,为资源有限的语言中自然语言处理任务造成了显著的障碍。这项研究最初探讨了使用机器翻译(MT)将现有数据集转换为SQuAD格式的Tigrinya数据集的可行性。结果,我们提出了TIGQA,一个由2.68K个问题-答案对组成的专家注释教育数据集,涵盖了122个不同的主题,如气候、水和交通。这些对来自537个公开可用的Tigrinya和生物学书籍的上下文段落。通过全面的分析,我们证明了TIGQA数据集需要超过简单的单词匹配的技能,需要同时具备单句和多句推理能力。我们使用最先进的MRC方法进行了实验,这是对TIGQA的首次探索。此外,我们还估计了数据集中的人类表现,并将它与预训练模型的结果进行了比较。TIGQA中人类表现和最佳模型表现之间的显著差异凸出了通过持续研究进一步增强TIGQA的潜力。我们的数据集可以通过提供的链接免费获取,以鼓励研究社区关注Tigrinya MRC中的挑战。
https://arxiv.org/abs/2404.17194
No-Reference Image Quality Assessment (IQA) aims at estimating image quality in accordance with subjective human perception. However, most existing NR-IQA methods focus on exploring increasingly complex networks or components to improve the final performance. Such practice imposes great limitations and complexity on IQA methods, especially when they are applied to high-resolution (HR) images in the real world. Actually, most images own high spatial redundancy, especially for those HR data. To further exploit the characteristic and alleviate the issue above, we propose a new framework for Image Quality Assessment with compressive Sampling (dubbed S-IQA), which consists of three components: (1) The Flexible Sampling Module (FSM) samples the image to obtain measurements at an arbitrary ratio. (2) Vision Transformer with the Adaptive Embedding Module (AEM) makes measurements of uniform size and extracts deep features (3) Dual Branch (DB) allocates weight for every patch and predicts the final quality score. Experiments show that our proposed S-IQA achieves state-of-the-art result on various datasets with less data usage.
No-Reference Image Quality Assessment (IQA) aims to estimate image quality based on subjective human perception. However, most existing NR-IQA methods focus on exploring increasingly complex networks or components to improve final performance. This practice imposes great limitations and complexity on IQA methods, especially when they are applied to high-resolution (HR) images in the real world. Actually, most images have high spatial redundancy, especially for those HR data. To further exploit the characteristic and alleviate the issue above, we propose a new framework for Image Quality Assessment with compressive Sampling (dubbed S-IQA), which consists of three components: 1. The Flexible Sampling Module (FSM) samples the image to obtain measurements at an arbitrary ratio. 2. Vision Transformer with the Adaptive Embedding Module (AEM) makes measurements of uniform size and extracts deep features. 3. Dual Branch (DB) allocates weight for every patch and predicts the final quality score. Experiments show that our proposed S-IQA achieves state-of-the-art result on various datasets with less data usage.
https://arxiv.org/abs/2404.17170
While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.
尽管文本到图像(T2I)生成模型已经变得无处不在,但它们并不一定生成与给定提示相符的图像。之前的工作已经通过提出指标、基准和模板来评估T2I的准确性,但这些组件的质量和系统的评估并未进行系统性的测量。人类评分集通常较小,而且用于比较模型的提示集的可靠性并未进行评估。为了填补这个空白,我们通过评估自监督指标和人类模板来进行了广泛的研究。我们提供了三个主要贡献:(1)我们引入了一个全面技能为基础的基准,可以区分不同的人类模板中的模型。这个技能基准将提示分为子技能,使得实践者不仅可以确定哪些技能具有挑战性,而且还可以确定技能变得具有挑战性的程度。(2)我们收集了四个人类模板和四个T2I模型的所有人类评分,共计超过10万条注释。这使我们能够了解由于提示固有的歧义而产生的差异,以及由于指标和模型质量的差异而产生的差异。(3)最后,我们引入了一种新的基于问答的自监督指标,该指标比我们新数据集中的现有指标与人类评分之间的相关性更高。这种指标在不同的人类模板和TIFA160上都有所表现。
https://arxiv.org/abs/2404.16820
While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: this https URL.
虽然许多当代大型语言模型(LLMs)可以处理长输入,但它们仍然很难在长上下文中完全利用信息,这被称为迷失在中间的挑战。我们假设这源于在长上下文训练期间缺乏明确的监督,这没有强调任何长上下文中的位置都可能持有关键信息。根据这个直觉,我们的研究提出了信息密集型(IN2)训练,这是一种完全数据驱动的解决方案来克服迷失在中间的挑战。具体来说,IN2训练利用合成长上下文问题-答案数据集,其中答案需要(1)在合成长上下文(4K-32K个词)中的短片段(~128个词)进行精细信息意识,以及(2)来自两个或更多短片段的信息整合和推理。通过在Mistral-7B上应用这一信息密集型训练,我们提出了FILM-7B(FILM-在中间)。为了全面评估FILM-7B在利用长上下文方面的能力,我们设计了一个涵盖各种上下文风格(文档、代码和结构化数据)和信息检索模式(前向、后向和双向检索)的三个探针任务。探针结果表明,FILM-7B可以稳健地从其32K个上下文窗口中的不同位置检索信息。除了这些探针任务之外,FILM-7B在现实世界的长上下文任务中的性能显著提高(例如,在NarrativeQA上的23.5->26.9 F1得分),同时它在短上下文任务中的性能与预相当(例如,在MMLU上的59.3->59.2准确率)。Github链接:https://github.com/。
https://arxiv.org/abs/2404.16811