Abstract
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
Abstract (translated)
在医疗应用中卓越表现对AI提出了相当大的挑战,需要先进的推理能力、最新的医学知识和理解复杂多模态数据的能力。Gemini模型,在多模态和长语境推理方面具有强大的通用能力,在医学领域具有令人兴奋的可能性。在Gemini模型的核心优势的基础上,我们引入了Med-Gemini系列高度 capable的多模态模型,具有使用网络搜索平滑地使用医疗多模态数据的能力,并且可以采用自定义编码器将其定制为新颖的模态。我们在14个医疗基准上评估了Med-Gemini,其中10个基准建立了与GPT-4模型家族匹敌的新标杆性能,并在每个可进行直接比较的基准上超过了GPT-4。在热门的MedQA(USMLE)基准中,我们表现最佳的Med-Gemini模型实现了SoTA性能的91.1%,采用了一种新颖的不确定性指导搜索策略。在包括NEJM图像挑战和MMMU(健康与医学)在内的7个多模态基准上,Med-Gemini比GPT-4V提高了平均相对分数44.5%。我们通过在长匿名健康记录和医疗视频问答中的 needle-in-a-haystack 检索任务等长语境推理任务上的SoTA表现,展示了Med-Gemini长语境能力的效果。最后,Med-Gemini的表现表明,在诸如医疗文本摘要、多模态医疗对话、医学研究和教育等领域具有实际应用价值。尽管在实际部署前还需要进行进一步的严谨评估,但我们的结果确实为Med-Gemini的潜力提供了有力的证据。
URL
https://arxiv.org/abs/2404.18416