Abstract
Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.
Abstract (translated)
过去十年间,生成模型在眼底图像增强领域已展现出显著成效。然而,这些模型的评估仍面临挑战。构建眼底图像增强基准测试主要基于三点需求:(1) 传统去噪指标(如PSNR与SSIM)无法捕捉病变保留、血管形态一致性等临床相关特征,限制了其在真实场景中的适用性;(2) 缺乏统一评估协议来同时处理有监督与无监督增强方法,特别是那些受临床经验指导的方法;(3) 评估框架应提供可操作的见解,以指导未来临床对齐增强模型的发展。为填补这些空白,我们推出EyeBench-V2基准测试,旨在弥合增强模型性能与临床效用之间的差距。本研究提供三项关键贡献:(1) 通过下游任务实现多维临床对齐:除标准增强指标外,我们在血管分割、糖尿病视网膜病变(DR)分级、对未知噪声模式的泛化能力及病变分割等临床相关任务中评估性能;(2) 专家引导的评估设计:我们整理 novel 数据集以实现有监督与无监督增强方法的公平对比,并配套医疗专家制定的结构化人工评估协议,该协议重点评估病变结构改变、背景色偏及人工结构引入等临床关键维度;(3) 可操作的见解:我们的基准测试对现有生成模型进行了严格的任务导向分析,既为临床研究人员提供决策依据,也揭示当前方法局限以指引下一代增强模型设计。
URL
https://arxiv.org/abs/2604.03806