Abstract
AI in dermatology is evolving at a rapid pace but the major limitation to training trustworthy classifiers is the scarcity of data with ground-truth concept level labels, which are meta-labels semantically meaningful to humans. Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge by leveraging vast amounts of image-caption pairs available on the internet. CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance. However, CLIP's pre-training data is not well-aligned with the medical jargon that clinicians use to perform diagnoses. The development of large language models (LLMs) in recent years has led to the possibility of leveraging the expressive nature of these models to generate rich text. Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and with the natural human language used in CLIP's pre-training data. Starting with captions used for images in PubMed articles, we extend them by passing the raw captions through an LLM fine-tuned on the field's several textbooks. We find that using captions generated by an expressive fine-tuned LLM like GPT-3.5 improves downstream zero-shot concept classification performance.
Abstract (translated)
翻译:皮肤病学中的人工智能正在以快速发展的速度不断演变,但训练可靠分类器的主要限制是缺乏具有真实概念级别标签的数据,这些标签在人类中具有语义意义。像CLIP这样的基础模型提供零散能力,通过利用互联网上大量可用的图像标题对数据进行微调,可以缓解这一挑战。CLIP可以通过针对特定领域的图像标题进行微调来提高分类性能。然而,CLIP的预训练数据与医生在诊断过程中使用的医学术语并不完全对齐。近年来,大型语言模型(LLMs)的发展使得利用这些模型的表达性特征生成丰富文本的可能性成为可能。我们的目标是使用这些模型生成与临床词汇和CLIP预训练数据中自然人类语言相符的文本。从PubMed文章中使用的图像的摘要开始,我们通过在几个教材上微调的LLM对原始摘要进行扩展。我们发现,使用像GPT-3.5这样的具有表达性的微调的LLM生成captions可以提高下游的零散概念分类性能。
URL
https://arxiv.org/abs/2404.13043