Paper Reading AI Learner

WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models

2024-04-22 20:29:58
Ronald Xie, Steven Palayew, Augustin Toma, Gary Bader, Bo Wang

Abstract

This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification. These two solutions scored 1st and 2nd place respectively on the competition leaderboard, substantially outperforming the next best solution. Additionally, we discuss insights gained from post-competition experiments. While the performance of these two solutions have significant room for improvement due to the difficulty of the shared task and the challenging nature of medical visual question answering in general, we identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.

Abstract (translated)

本文概述了我们向MEDIQA2024多语言多模态医疗答案生成(M3G)共享任务提交的论文。我们在任务的英语类别下报告了两个独立解决方案的结果,其中第一个涉及两次连续的API调用到Claude 3 Opus API,第二个涉及以CLIP风格训练图像疾病标签联合嵌入进行图像分类。这两个解决方案在竞赛排行榜上分别获得第一和第二名,远远超过了下一个最好的解决方案。此外,我们讨论了从比赛实验中获得的见解。虽然这两个解决方案由于共享任务的难度和医疗视觉问题回答的挑战性而性能还有很大的提升空间,但我们认为多级LLM方法和CLIP图像分类方法是进一步研究的有前途的途径。

URL

https://arxiv.org/abs/2404.14567

PDF

https://arxiv.org/pdf/2404.14567.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot