Abstract
Report generation models offer fine-grained textual interpretations of medical images like chest X-rays, yet they often lack interactivity (i.e. the ability to steer the generation process through user queries) and localized interpretability (i.e. visually grounding their predictions), which we deem essential for future adoption in clinical practice. While there have been efforts to tackle these issues, they are either limited in their interactivity by not supporting textual queries or fail to also offer localized interpretability. Therefore, we propose a novel multitask architecture and training paradigm integrating textual prompts and bounding boxes for diverse aspects like anatomical regions and pathologies. We call this approach the Chest X-Ray Explainer (ChEX). Evaluations across a heterogeneous set of 9 chest X-ray tasks, including localized image interpretation and report generation, showcase its competitiveness with SOTA models while additional analysis demonstrates ChEX's interactive capabilities.
Abstract (translated)
报告生成模型可以提供对医学图像(如胸部X光片)的细粒度文本解释,但通常缺乏交互性(即通过用户查询引导生成过程的能力)和本地解释性(即通过视觉将预测与图像相结合的能力),我们认为这对于未来在临床实践中采用是至关重要的。尽管已经做出了努力来解决这些问题,但它们要么因不支持文本查询而限制了交互性,要么未能提供本地解释性。因此,我们提出了一个新颖的多任务架构和训练范式,结合文本提示和边界框来处理各种方面(如解剖区域和疾病)。我们将这种方法称为胸部X光片解释器(ChEX)。在包括局部图像解释和报告生成的异质集合9个胸部X光片任务中进行的评估显示,与最先进的模型相比,ChEX具有竞争力,而其他分析则展示了ChEX的交互能力。
URL
https://arxiv.org/abs/2404.15770