Abstract
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: this https URL and this https URL
Abstract (translated)
医疗视觉问答(MedVQA)是开发临床决策支持系统的一个有前景的领域,然而进展常常受到可用数据集的限制,这些数据集可能缺乏临床复杂性和视觉多样性。为了解决这些问题,我们介绍了Kvasir-VQA-x1,这是一个新的大规模数据集,用于胃肠内镜检查。我们的工作在原有基础上大幅扩展了Kvasir-VQA,新增加了159,549个问题-答案对,旨在测试更深层次的临床推理能力。我们使用大型语言模型开发了一种系统化的方法来生成这些问题,并按复杂度进行了分层以更好地评估模型的推断能力。为了确保我们的数据集能够使模型为现实世界的临床场景做好准备,我们也引入了多种视觉增强措施,模拟常见的成像伪影。该数据集结构化支持两个主要的评价途径:一个是标准VQA性能评价,另一个是测试模型面对这些视觉干扰时的稳健性。通过提供一个更具挑战性和临床相关的基准,Kvasir-VQA-x1旨在加速开发更可靠和有效的多模态AI系统在临床上的应用。数据集完全开放并遵守FAIR数据原则,使其成为广大研究社区的重要资源。 代码和数据: [链接] 和 [链接]
URL
https://arxiv.org/abs/2506.09958