Paper Reading AI Learner

Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

2025-06-11 17:31:38
Sushant Gautam, Michael A. Riegler, P{\aa}l Halvorsen

Abstract

Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: this https URL and this https URL

Abstract (translated)

医疗视觉问答(MedVQA)是开发临床决策支持系统的一个有前景的领域,然而进展常常受到可用数据集的限制,这些数据集可能缺乏临床复杂性和视觉多样性。为了解决这些问题,我们介绍了Kvasir-VQA-x1,这是一个新的大规模数据集,用于胃肠内镜检查。我们的工作在原有基础上大幅扩展了Kvasir-VQA,新增加了159,549个问题-答案对,旨在测试更深层次的临床推理能力。我们使用大型语言模型开发了一种系统化的方法来生成这些问题,并按复杂度进行了分层以更好地评估模型的推断能力。为了确保我们的数据集能够使模型为现实世界的临床场景做好准备,我们也引入了多种视觉增强措施,模拟常见的成像伪影。该数据集结构化支持两个主要的评价途径:一个是标准VQA性能评价,另一个是测试模型面对这些视觉干扰时的稳健性。通过提供一个更具挑战性和临床相关的基准,Kvasir-VQA-x1旨在加速开发更可靠和有效的多模态AI系统在临床上的应用。数据集完全开放并遵守FAIR数据原则,使其成为广大研究社区的重要资源。 代码和数据: [链接] 和 [链接]

URL

https://arxiv.org/abs/2506.09958

PDF

https://arxiv.org/pdf/2506.09958.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot