Abstract
Vision-language models have become increasingly powerful for tasks that require an understanding of both visual and linguistic elements, bridging the gap between these modalities. In the context of multimodal clinical AI, there is a growing need for models that possess domain-specific knowledge, as existing models often lack the expertise required for medical applications. In this paper, we take brain abnormalities as an example to demonstrate how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset from case reports and published journals and subsequently constructing a high-performance vision-language model tailored to specific medical tasks. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain. We evaluated the resulting model with quantitative and qualitative intrinsic evaluations. The resulting dataset and our code can be found here this https URL
Abstract (translated)
视觉语言模型在需要理解和掌握视觉和语言元素的任务中变得越来越强大,填平了这些模态之间的差距。在多模态临床人工智能背景下,需要具有特定领域知识的模型,因为现有的模型通常缺乏医疗应用所需的专业知识。在本文中,我们以脑部异常为例,展示了如何通过从公共资源如PubMed等处自动收集医学图像文本数据进行预训练。特别是,我们提出了一个通过首先收集大量的脑部图像文本数据病例报告和学术期刊,后续针对特定医疗任务构建高性能视觉语言模型的流程。我们还研究了在医学领域中映射子图到子句的独特挑战。我们用数量和定性内在评估评估了所得到的模型。最终的数据集和我们的代码可以在https:// this URL找到。
URL
https://arxiv.org/abs/2404.17779