LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Abstract
Abstract (translated)
URL
PDF

Abstract

Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.

Abstract (translated)

扩散模型在文本到图像生成方面的表现引人注目。然而,在图像到文本生成方面,特别是图像标题生成,它们的性能已经落后于自回归模型。在这项工作中,我们重新审视了扩散模型,突出了它们整体上下文建模和并行解码的能力。借助这些优点,扩散模型可以减轻AR方法固有的限制,包括其缓慢的推理速度、错误传播和单向约束。此外,我们指出了扩散模型由于缺乏有效的图像文本对齐的潜在空间而表现出的先前的低性能,以及连续扩散过程和离散文本数据之间的差异。为了应对这些问题,我们引入了一种名为LaDiC的新架构,它利用分裂的BERT创建了专用的潜在空间,并包括一个正则化模块来管理不同的文本长度。我们的框架还包括一个扩散器用于语义图像到文本转换和 Back&Refine 技术,用于在推理过程中增强标记交互。LaDiC 在基于扩散的方法在 MS COCO 数据集上实现了最先进的性能,达到38.2 BLEU@4 和126.2 CIDEr,这表明 LaDiC 在没有预训练或辅助模块的情况下具有出色的性能。这揭示了扩散模型在图像到文本生成方面的潜力,这是 AR 模型所无法匹敌的。

URL

https://arxiv.org/abs/2404.10763

PDF

https://arxiv.org/pdf/2404.10763.pdf

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Abstract

Abstract (translated)

URL

PDF Copy

PDF