While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.
尽管文本到图像(T2I)生成模型已经变得无处不在,但它们并不一定生成与给定提示相符的图像。之前的工作已经通过提出指标、基准和模板来评估T2I的准确性,但这些组件的质量和系统的评估并未进行系统性的测量。人类评分集通常较小,而且用于比较模型的提示集的可靠性并未进行评估。为了填补这个空白,我们通过评估自监督指标和人类模板来进行了广泛的研究。我们提供了三个主要贡献:(1)我们引入了一个全面技能为基础的基准,可以区分不同的人类模板中的模型。这个技能基准将提示分为子技能,使得实践者不仅可以确定哪些技能具有挑战性,而且还可以确定技能变得具有挑战性的程度。(2)我们收集了四个人类模板和四个T2I模型的所有人类评分,共计超过10万条注释。这使我们能够了解由于提示固有的歧义而产生的差异,以及由于指标和模型质量的差异而产生的差异。(3)最后,我们引入了一种新的基于问答的自监督指标,该指标比我们新数据集中的现有指标与人类评分之间的相关性更高。这种指标在不同的人类模板和TIFA160上都有所表现。
https://arxiv.org/abs/2404.16820
While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: this https URL.
虽然许多当代大型语言模型(LLMs)可以处理长输入,但它们仍然很难在长上下文中完全利用信息,这被称为迷失在中间的挑战。我们假设这源于在长上下文训练期间缺乏明确的监督,这没有强调任何长上下文中的位置都可能持有关键信息。根据这个直觉,我们的研究提出了信息密集型(IN2)训练,这是一种完全数据驱动的解决方案来克服迷失在中间的挑战。具体来说,IN2训练利用合成长上下文问题-答案数据集,其中答案需要(1)在合成长上下文(4K-32K个词)中的短片段(~128个词)进行精细信息意识,以及(2)来自两个或更多短片段的信息整合和推理。通过在Mistral-7B上应用这一信息密集型训练,我们提出了FILM-7B(FILM-在中间)。为了全面评估FILM-7B在利用长上下文方面的能力,我们设计了一个涵盖各种上下文风格(文档、代码和结构化数据)和信息检索模式(前向、后向和双向检索)的三个探针任务。探针结果表明,FILM-7B可以稳健地从其32K个上下文窗口中的不同位置检索信息。除了这些探针任务之外,FILM-7B在现实世界的长上下文任务中的性能显著提高(例如,在NarrativeQA上的23.5->26.9 F1得分),同时它在短上下文任务中的性能与预相当(例如,在MMLU上的59.3->59.2准确率)。Github链接:https://github.com/。
https://arxiv.org/abs/2404.16811
Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.
在人工智能领域(AI4Medicine)的研究者中,开发通用基础模型最近引起了巨大的关注。这些模型的关键在于它们对数据集扩大的依赖,强调开发包含各种成像模式下不同监督信号的开放医疗图像数据集。在本文中,我们介绍了RadGenome-Chest CT,一个基于CT-RATE的全面、大规模、区域指导的3D chest CT解释数据集。具体来说,我们利用最先进的强大通用分割和大型语言模型,从以下方面扩展了原始数据集:(一)覆盖197个类别的器官级别分割掩码,为解释提供中间推理的视觉提示;(二)665K个多粒度 grounded 报告,其中每个报告的句子都与相应的 CT 体积的解剖区域通过分割掩码链接;(三)1.3M个 grounded VQA 对,其中问题及其答案都与参考分割掩码链接,使模型能够将视觉证据与文本解释相关联。所有验证集中的 grounded 报告和 VQA 对都经过手动验证,以确保数据集质量。我们相信,RadGenome-Chest CT 可以通过根据给定分割区域生成文本,从而显著推动多模态医疗基础模型的开发,这是之前相关数据集无法实现的。我们将释放所有分割掩码、 grounded 报告和 VQA 对,以促进该领域进一步的研究和发展。
https://arxiv.org/abs/2404.16754
This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.
这篇论文报告了NTIRE 2024人工智能生成内容挑战赛,该挑战赛将与CVPR 2024中的图像修复和增强研讨会(NTIRE)同时举办。这项挑战的目标是解决图像和视频处理领域的一个重大挑战,即人工智能生成内容(AIGC)的图像质量和视频质量评估(VQA)。挑战分为图像赛道和视频赛道。图像赛道使用了AIGIQA-20K,它包含了由15个流行生成模型生成的20,000个AI生成图像(AIGIs)。图像赛道共有318名注册参与者。在开发阶段共收到1,646篇提交,测试阶段收到了221篇提交。最后,16支参赛队伍提交了他们的模型和报告。视频赛道使用了T2VQA-DB,它包含了由9个流行文本转视频(T2V)模型生成的10,000个AI生成视频(AIGVs)。共有196名参与者登记注册在视频赛道上。在开发阶段共收到991篇提交,测试阶段收到了185篇提交。最后,12支参赛队伍提交了他们的模型和报告。有些方法取得了比基线方法更好的效果,两赛道获胜的方法在AIGC上表现出卓越的预测性能。
https://arxiv.org/abs/2404.16687
Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at this https URL.
图表对于呈现和解释复杂数据关系非常重要。最近,多模态大型语言模型(MLLMs)在各种图表理解任务中表现出非凡的能力。然而,这些模型在参数和计算需求方面的庞大规模限制了其在资源受限环境中的应用。在本文中,我们提出了TinyChart,一个仅包含3B参数的高效的MLLM,用于图表理解。TinyChart克服了高效图表理解的两个关键挑战:(1)通过程序化思考(PoT)学习策略减少学习数值计算的负担,该策略训练模型生成用于数值计算的Python程序,(2)通过视觉词表合并模块减少高分辨率图像中产生的长视觉特征序列,该模块逐渐合并最相似的视觉词。大量实验证明,我们的3B TinyChart在包括ChartQA、Chart-to-Text、Chart-to-Table、OpenCQA和ChartX在内的各种图表理解基准测试中实现了最先进的性能。它优于拥有多达13B参数的ChartLlama和ChartAst等几个图表理解MLLM,并在ChartQA上的性能优于基于闭源通用MLLM GPT-4V。它还证明了其在推理过程中由于模型规模较小和视觉编码更高效而具有优越的效率。我们的代码和模型可以从该链接下载:https://url.com/
https://arxiv.org/abs/2404.16635
In the realm of Medical Visual Language Models (Med-VLMs), the quest for universal efficient fine-tuning mechanisms remains paramount, especially given researchers in interdisciplinary fields are often extremely short of training resources, yet largely unexplored. Given the unique challenges in the medical domain, such as limited data scope and significant domain-specific requirements, evaluating and adapting Parameter-Efficient Fine-Tuning (PEFT) methods specifically for Med-VLMs is essential. Most of the current PEFT methods on Med-VLMs have yet to be comprehensively investigated but mainly focus on adding some components to the model's structure or input. However, fine-tuning intrinsic model components often yields better generality and consistency, and its impact on the ultimate performance of Med-VLMs has been widely overlooked and remains understudied. In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning LayerNorm layers, FFNs and Attention layers on the Med-VLMs. Our comprehensive studies span both small-scale and large-scale Med-VLMs, evaluating their performance under various fine-tuning paradigms across tasks such as Medical Visual Question Answering and Medical Imaging Report Generation. The findings reveal unique insights into the effects of intrinsic parameter fine-tuning methods on fine-tuning Med-VLMs to downstream tasks and expose fine-tuning solely the LayerNorm layers not only surpasses the efficiency of traditional PEFT methods but also retains the model's accuracy and generalization capabilities across a spectrum of medical downstream tasks. The experiments show LayerNorm fine-tuning's superior adaptability and scalability, particularly in the context of large-scale Med-VLMs.
在医疗可视语言模型(Med-VLMs)领域,寻求通用的有效微调方法仍然是至关重要的,尤其是在跨学科领域的研究者通常缺乏训练资源的情况下,而这一领域也往往被广泛探索。考虑到医疗领域的独特挑战,如有限的数据范围和显著的领域特定要求,专门为Med-VLMs评估和适应参数高效的微调(PEFT)方法至关重要。目前,大多数关于Med-VLMs的PEFT方法尚未进行全面的调查,但主要集中在向模型结构或输入中添加一些组件。然而,微调固有模型组件通常会产生更好的泛化能力和一致性,对其在Med-VLMs最终性能的影响却被广泛忽视和未研究。在本文中,我们力求探讨一种不同于传统PEFT方法的新型选择,特别是对LayerNorm层、FFN和Attention层的微调对Med-VLMs的影响。我们全面的研究跨越了小规模和大型Med-VLMs,在各种任务上评估它们在不同微调范式下的性能,例如医疗视觉问答和医疗图像报告生成。研究结果揭示了在微调固有参数方法对微调Med-VLMs的影响以及仅对LayerNorm层进行微调不仅超越了传统PEFT方法的效率,而且保留了模型的准确性和泛化能力。实验表明,LayerNorm微调的适应性和可扩展性在大型Med-VLMs方面具有优势。
https://arxiv.org/abs/2404.16385
This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed methods must process 30 FHD frames under 1 second. In the challenge, a total of 102 participants registered, and 15 submitted code and models. The performance of the top-5 submissions is reviewed and provided here as a survey of diverse deep models for efficient video quality assessment of user-generated content.
本文回顾了 AIS 2024 视频质量评估(VQA)挑战,重点关注用户生成内容(UGC)。这一挑战的目标是收集基于深度学习的估算 UGC 视频感知质量的方法。来自 YouTube UGC 数据集的用户生成视频包括各种内容(体育、游戏、歌词、动漫等),质量和分辨率。所提出的方法必须在 1 秒内处理 30 FHD 帧。在挑战中,共有 102 名参与者注册,其中 15 名提交了代码和模型。对前五名提交者的性能进行了审查,并提供了一个调查不同深度模型用于有效评估用户生成内容视频质量的调查结果。
https://arxiv.org/abs/2404.16205
Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
视觉语言模型在一般领域通常都表现出良好的效果,并且在多样化的多模态应用中如视觉问答(VQA)中表现出强大的性能。然而,在更 specialized的领域,如医学领域,这些模型很难保持同样的效果。为了克服这一问题,我们提出了一个医学视觉语言模型,该模型整合了适用于医学领域的较大视觉和语言模型。该模型通过使用三个分开的生物医学和放射学多模态视觉和文本数据集进行参数高效的训练,分别训练三个阶段。所提出的模型在SLAKE 1.0医疗VQA(MedVQA)数据集上实现了最先进的性能, overall accuracy 达到了87.5%,同时在另一个MedVQA数据集VQA-RAD上表现出强大的性能, overall accuracy 达到了73.2%。
https://arxiv.org/abs/2404.16192
Large language models (LLMs) have demonstrated impressive generalization capabilities on specific tasks with human-written instruction data. However, the limited quantity, diversity, and professional expertise of such instruction data raise concerns about the performance of LLMs in psychotherapy tasks when provided with domain-specific instructions. To address this, we firstly propose Domain-Specific Assistant Instructions based on AlexanderStreet therapy, and secondly, we use an adaption fine-tuning method and retrieval augmented generation method to improve pre-trained LLMs. Through quantitative evaluation of linguistic quality using automatic and human evaluation, we observe that pre-trained LLMs on Psychotherapy Assistant Instructions outperform state-of-the-art LLMs response baselines. Our Assistant-Instruction approach offers a half-annotation method to align pre-trained LLMs with instructions and provide pre-trained LLMs with more psychotherapy knowledge.
大语言模型(LLMs)已经在特定任务上展示了令人印象深刻的泛化能力,这些任务使用人类编写的指令数据。然而,这种指令数据的数量有限,多样性较小,专业能力有限,这使得当LLMs获得特定领域的指导时,在心理治疗任务上的表现引起了人们的担忧。为解决这个问题,我们首先提出了基于AlexanderStreet治疗的领域特定辅助指令,然后使用自监督和人工评估来改进预训练LLMs。通过使用自动和人工评估对语言质量进行定量评估,我们观察到,使用心理治疗助手指令预训练的LLMs超越了最先进的LLMs响应基线。我们的辅助指令方法提供了一种半注释方法,使预训练的LLM与指令对齐,并为预训练的LLM提供更多的心理治疗知识。
https://arxiv.org/abs/2404.16160
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.
大语言模型,如GPT-4和Med-PaLM,在临床任务上表现出色;然而,它们需要访问计算资源,是闭源的,且无法在设备上部署。中大型模型,如BioGPT-large、BioMedLM、LLaMA 2和Mistral 7B,避免了这些缺点,但它们在临床任务上的能力仍被低估。为了帮助评估它们在临床应用中的潜力,并帮助研究人员决定应使用哪种模型,我们比较了它们在两个临床问答(QA)任务上的表现:MedQA和消费者问题回答。我们发现,Mistral 7B是表现最好的模型,在基准测试和专门针对生物医学领域的模型上均获胜。虽然Mistral 7B的MedQA得分为63.0%接近原始的Med-PaLM,但它经常只能对消费者健康问题提供合理的回答,仍有改进的空间。这项研究是开源中大型模型在临床任务上首次直接的比较。
https://arxiv.org/abs/2404.15894
Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise information and impair the performance of large language models. To tackle this problem, we propose a novel Knowledge Selection of Large Language Models (KS-LLM) method, aiming to identify valuable information from evidence documents. The KS-LLM approach utilizes triples to effectively select knowledge snippets from evidence documents that are beneficial to answering questions. Specifically, we first generate triples based on the input question, then select the evidence sentences most similar to triples from the evidence document, and finally combine the evidence sentences and triples to assist large language models in generating answers. Experimental comparisons on several question answering datasets, such as TriviaQA, WebQ, and NQ, demonstrate that the proposed method surpasses the baselines and achieves the best results.
大语言模型(LLMs)在知识密集型任务中存在幻觉问题,并且当应用于知识密集型任务时,面临着显著的挑战。一个有前途的方法是利用证据文档作为额外的支持知识,这是通过检索或生成获得的。然而,现有的方法直接利用证据文档的整个内容,这可能会引入噪声信息并损害大型语言模型的性能。为解决这个问题,我们提出了一种名为知识选择大型语言模型(KS-LLM)的方法,旨在从证据文档中识别有价值的信息。KS-LLM方法利用三元组有效地选择对回答问题有益的证据句子。具体来说,我们首先根据输入问题生成三元组,然后从证据文档中选择与三元组最相似的证据句子,最后将证据句子和三元组结合以帮助大型语言模型生成答案。在多个问题回答数据集(如TriviaQA、WebQ和NQ)上的实验比较表明,与基线相比,所提出的方法超过了基线,并取得了最佳结果。
https://arxiv.org/abs/2404.15660
Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.
近年来,直接使用大型语言模型(LLMs)来评估自然语言问答(QA)模型一直是最可靠的方法。然而,这种方法存在解释性有限、成本高昂和环境危害等问题。为了应对这些问题,我们提出了一种使用软EM的实体驱动答案集扩展方法。我们的方法基于观察到表面形式通常根据实体类型遵循特定模式的结论,将金答案集扩展到包括各种表面形式。实验结果表明,我们的方法在传统评估方法的基础上取得了很大的优势。此外,我们的评估方法的可靠性与基于LLM的评估方法相当,同时提供了高解释性和降低环境危害的优点。
https://arxiv.org/abs/2404.15650
In natural language processing applied to the clinical domain, utilizing large language models has emerged as a promising avenue for error detection and correction on clinical notes, a knowledge-intensive task for which annotated data is scarce. This paper presents MedReAct'N'MedReFlex, which leverages a suite of four LLM-based medical agents. The MedReAct agent initiates the process by observing, analyzing, and taking action, generating trajectories to guide the search to target a potential error in the clinical notes. Subsequently, the MedEval agent employs five evaluators to assess the targeted error and the proposed correction. In cases where MedReAct's actions prove insufficient, the MedReFlex agent intervenes, engaging in reflective analysis and proposing alternative strategies. Finally, the MedFinalParser agent formats the final output, preserving the original style while ensuring the integrity of the error correction process. One core component of our method is our RAG pipeline based on our ClinicalCorp corpora. Among other well-known sources containing clinical guidelines and information, we preprocess and release the open-source MedWiki dataset for clinical RAG application. Our results demonstrate the central role of our RAG approach with ClinicalCorp leveraged through the MedReAct'N'MedReFlex framework. It achieved the ninth rank on the MEDIQA-CORR 2024 final leaderboard.
在将自然语言处理应用于临床领域时,利用大型语言模型在临床笔记中检测和纠正错误是一个有前景的途径。由于注释数据有限,这是一个知识密集型任务。本文介绍了一种名为MedReAct'N'MedReFlex的方法,它利用了一组基于LLM的医疗代理。MedReAct代理通过观察、分析和采取行动来启动过程,生成轨迹以指导搜索以定位临床笔记中的潜在错误。然后,MedEval代理采用五个评估者来评估所针对的错误及其提出的纠正措施。在MedReAct代理的行动证明不够充分的情况下,MedReFlex代理介入,进行反思分析并提出替代策略。最后,MedFinalParser代理格式化最终输出,保留原始风格,同时确保错误纠正过程的完整性。我们方法的一个核心组成部分是我们的RAG管道,基于我们的临床Corp语料库。与其他包含临床指南和信息的知名来源相比,我们预处理并发布了用于临床RAG应用的开放式源码MedWiki数据集。我们的结果表明,通过MedReAct'N'MedReFlex框架,我们RAG方法的临床Corp优势得到了充分发挥。它在地MEDIQA-CORR 2024最终排行榜上获得了第九名。
https://arxiv.org/abs/2404.15488
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.
上下文学习(ICL)方法通常利用提示来条件解码器-仅语言模型生成,基于参考信息。然而,由于自注意操作的二次成本,即时处理上下文会变得低效,而缓存是可取的。然而,缓存变压器状态可能需要几乎与模型参数相同的空间。当不知道预先确定的正确上下文时,缓存ICL可能会具有挑战性。本文通过引入具有编码器-解码器架构灵感的消息,来解决这些限制。更具体地说,我们利用预训练的解码器-仅模型,并只训练了很少的附加层。我们使用问答(QA)作为测试平台来评估我们的模型的条件生成能力,观察到它们的表现优于ICL,与 fine-tuned 提示的 LLM 相当,并且相对标准 KV 缓存,空间足迹减少了两 orders of magnitude。
https://arxiv.org/abs/2404.15420
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
多模态LLM是LLM的自然演变,并扩大其功能以实现超越纯文本模态。在设计新颖架构和视觉与语言适配器的研究过程中,本文重点关注为这样的模型赋予回答需要外部知识的问题的能力。我们称之为Wiki-LLaVA的方法旨在通过分层检索管道访问外部知识源,为LLM提供额外的上下文,提高生成对话的有效性和精确度。我们在针对视觉问题回答的外部数据集上进行广泛的实验,证明了我们的方法的合适性。
https://arxiv.org/abs/2404.15406
With the increasing maturity of the text-to-image and image-to-image generative models, AI-generated images (AGIs) have shown great application potential in advertisement, entertainment, education, social media, etc. Although remarkable advancements have been achieved in generative models, very few efforts have been paid to design relevant quality assessment models. In this paper, we propose a novel blind image quality assessment (IQA) network, named AMFF-Net, for AGIs. AMFF-Net evaluates AGI quality from three dimensions, i.e., "visual quality", "authenticity", and "consistency". Specifically, inspired by the characteristics of the human visual system and motivated by the observation that "visual quality" and "authenticity" are characterized by both local and global aspects, AMFF-Net scales the image up and down and takes the scaled images and original-sized image as the inputs to obtain multi-scale features. After that, an Adaptive Feature Fusion (AFF) block is used to adaptively fuse the multi-scale features with learnable weights. In addition, considering the correlation between the image and prompt, AMFF-Net compares the semantic features from text encoder and image encoder to evaluate the text-to-image alignment. We carry out extensive experiments on three AGI quality assessment databases, and the experimental results show that our AMFF-Net obtains better performance than nine state-of-the-art blind IQA methods. The results of ablation experiments further demonstrate the effectiveness of the proposed multi-scale input strategy and AFF block.
随着文本到图像和图像到图像生成模型的成熟,AI生成的图像(AGIs)在广告、娱乐、教育、社交媒体等领域的应用潜力得到了很大的提升。尽管在生成模型方面取得了显著的进步,但很少有精力致力于设计相关的质量评估模型。在本文中,我们提出了一个名为AMFF-Net的新颖的盲图像质量评估(IQA)网络,用于AGIs。AMFF-Net从“视觉质量”、“真实性和一致性”三个维度评估AGI的质量。具体来说,为了模仿人视觉系统的特点,并受到观察到“视觉质量和真实性”既具有局部又具有全局特征的启发,AMFF-Net上下文扩展图像并获取多尺度特征。然后,使用自适应特征融合(AFF)块将多尺度特征与可学习权重进行自适应融合。此外,考虑到图像和提示之间的相关性,AMFF-Net将文本编码器和解码器中的语义特征与图像编码器中的语义特征进行比较,以评估文本到图像的对齐效果。我们在三个AGI质量评估数据库上进行了广泛的实验,实验结果表明,我们的AMFF-Net的性能优于九个最先进的盲IQA方法。消融实验的结果进一步证明了所提出的多尺度输入策略和AFF块的有效性。
https://arxiv.org/abs/2404.15163
This paper presents a novel exploration into the regressive side effects of training Large Language Models (LLMs) to mimic student misconceptions for personalized education. We highlight the problem that as LLMs are trained to more accurately mimic student misconceptions, there is a compromise in the factual integrity and reasoning ability of the models. Our work involved training an LLM on a student-tutor dialogue dataset to predict student responses. The results demonstrated a decrease in the model's performance across multiple benchmark datasets, including the ARC reasoning challenge and TruthfulQA, which evaluates the truthfulness of model's generated responses. Furthermore, the HaluEval Dial dataset, used for hallucination detection, and MemoTrap, a memory-based task dataset, also reported a decline in the model accuracy. To combat these side effects, we introduced a "hallucination token" technique. This token, appended at the beginning of each student response during training, instructs the model to switch between mimicking student misconceptions and providing factually accurate responses. Despite the significant improvement across all datasets, the technique does not completely restore the LLM's baseline performance, indicating the need for further research in this area. This paper contributes to the ongoing discussion on the use of LLMs for student modeling, emphasizing the need for a balance between personalized education and factual accuracy.
本文提出了对将大型语言模型(LLMs)用于个性化教育时产生的退化性副作用的深入探索。我们强调了LLMs在更准确地模仿学生错误观念的同时,模型事实准确性和推理能力之间存在妥协的问题。我们的工作包括在学生-导师对话数据集上训练一个LLM,预测学生回答。结果表明,模型在多个基准数据集上的表现都下降了,包括ARC推理挑战和TruthfulQA,这些数据集评估了模型生成的回答的准确性。此外,用于幻觉检测的HaluEval Dial数据集和基于记忆的任务数据集MemoTrap也报告了模型准确性的下降。为了应对这些副作用,我们引入了一种“幻觉标记”技术。这个标记附加在每个学生回答的开头,指示模型在模仿学生错误观念和提供准确回答之间进行切换。尽管在所有数据集上都取得了显著的改进,但这种技术并没有完全恢复LLM的基线性能,表明需要进一步研究这一领域。本文为LLM在学生建模中的应用提供了进一步的讨论,强调了在个性化教育和事实准确性之间需要保持平衡的重要性。
https://arxiv.org/abs/2404.15156
Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.
大语言模型(LLMs)已成为告知临床决策过程的有影响力的候选人。尽管这些模型在塑造数字格局方面扮演越来越重要的角色,但在医疗领域应用中出现了两个不断增长的关注点:1)基于患者受保护属性(如种族)的社交偏见程度有多大,以及2)设计选择(如建筑设计和提示策略)如何影响观察到的偏见?为了回答这些问题,我们通过临床案例(患者描述)标准化评估了三个问题回答(QA)数据集中的八个流行LLM。我们采用红队策略分析 demographic(人口统计学)如何影响LLM输出,比较了通用模型和临床训练模型。我们广泛的实验揭示了各种差异(有些非常显著)。我们还观察到了几个反直觉的模式,例如更大模型不一定更不偏见,经过微调的模型在医学数据上不一定比通用模型更好。此外,我们的研究证明了提示设计对偏见模式的影响,表明了具体措辞可以影响偏见模式,并且类比思考方法(如 Chain of Thought)可以有效降低有偏见的结果。与之前的研究一致,我们呼吁对用于临床决策支持应用的LLM进行进一步评估、审查和增强。
https://arxiv.org/abs/2404.15149
The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.
大规模视觉语言模型的快速发展在各种任务中展示了令人印象深刻的性能。然而,在医学领域中缺乏大量高质量的图像-文本数据,大大阻碍了大规模医疗视觉语言模型的开发。在这项工作中,我们提出了一个指导下的bootstrap策略,该策略利用图像和标签信息来构建视觉语言数据集。基于构建的数据集,我们开发了MedDr,一种通用医疗数据处理模型,能够处理各种医疗数据模式,包括放射学、病理学、皮肤病学、眼科和内窥镜。此外,在推理过程中,我们提出了一种简单但有效的检索增强医疗诊断策略,可以增强模型的泛化能力。在视觉问答、医学报告生成和医学图像诊断等大量实验中,证明了我们方法的优势。
https://arxiv.org/abs/2404.15127