Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.
Abstract (translated)
多模态大型语言模型(MLLMs)在通用视觉基准测试中表现优异,但在医学成像等特定领域中的分布外(OOD)任务上却面临挑战。这些领域的标注数据有限且成本高昂。为此,我们提出了LEAML——一种标签高效适应框架,它利用了稀疏的带标签视觉问答样本和大量的未标记图像。我们的方法通过基于描述符蒸馏的QA生成器为未标记数据生成相关领域的问题-答案对。特别地,我们在知识蒸馏过程中仅更新与问题回答最相关的神经元,从而使QA生成器能够高效地获取特定领域的知识。 在胃肠内窥镜和体育视觉问答上的实验表明,在最小监督下,LEAML框架始终优于标准微调方法,突显了我们所提出框架的有效性。
URL
https://arxiv.org/abs/2510.03232