Paper Reading AI Learner

Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns

2024-04-03 00:09:05
Yunsoo Kim, Jinge Wu, Yusuf Abdulle, Yue Gao, Honghan Wu

Abstract

Recent advancements in Computer Assisted Diagnosis have shown promising performance in medical imaging tasks, particularly in chest X-ray analysis. However, the interaction between these models and radiologists has been primarily limited to input images. This work proposes a novel approach to enhance human-computer interaction in chest X-ray analysis using Vision-Language Models (VLMs) enhanced with radiologists' attention by incorporating eye gaze data alongside textual prompts. Our approach leverages heatmaps generated from eye gaze data, overlaying them onto medical images to highlight areas of intense radiologist's focus during chest X-ray evaluation. We evaluate this methodology in tasks such as visual question answering, chest X-ray report automation, error detection, and differential diagnosis. Our results demonstrate the inclusion of eye gaze information significantly enhances the accuracy of chest X-ray analysis. Also, the impact of eye gaze on fine-tuning was confirmed as it outperformed other medical VLMs in all tasks except visual question answering. This work marks the potential of leveraging both the VLM's capabilities and the radiologist's domain knowledge to improve the capabilities of AI models in medical imaging, paving a novel way for Computer Assisted Diagnosis with a human-centred AI.

Abstract (translated)

近年来,计算机辅助诊断(Computer Assisted Diagnosis,CAD)技术在医学影像任务中的表现已经取得了显著的进展,特别是在胸部X光片分析方面。然而,这些模型与放射科医生的交互主要局限于输入图像。本文提出了一种利用视觉语言模型(VLMs)增强放射科医生注意力的方法,以实现与放射科医生协同工作,从而在胸部X光片分析中提高人机交互。我们的方法将目光数据生成的热图叠加在医学图像上,突出显示了胸部X光片评估过程中放射科医生关注区域的强度。我们对这种方法在诸如视觉问答、胸部X光片报告自动化、错误检测和差异诊断等任务中进行了评估。我们的研究结果表明,包括目光信息可以显著提高胸部X光片分析的准确性。此外,目光对细粒度调整的影响得到了证实,因为在除视觉问答外的所有任务中,它的表现优于其他医疗VLM。这项工作为将VLM的功能和放射科医生的专业知识相结合,改进医疗影像领域的人工智能模型提供了可能,为以人为中心的AI辅助诊断铺平了新的道路。

URL

https://arxiv.org/abs/2404.02370

PDF

https://arxiv.org/pdf/2404.02370.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot