Abstract
Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image. Existing methods are based on a strong vision encoder and a cross-modal fusion model to integrate cross-modal features. However, these methods allocate most of the computation to visual encoding, while light computation on modeling modality interactions. Moreover, they take cross-modal fusion for depth features, which ignores a fine-grained alignment between text and image. Recently, vision-language pre-trained models (VLPM), encapsulating rich cross-modal alignment knowledge, have advanced in most multimodal tasks. In this work, we propose a novel model for TGII by improving cross-modal alignment (CMA). CMA model consists of a VLPM as a vision-language encoder, an image generator and global-local discriminators. To explore cross-modal alignment knowledge for image restoration, we introduce cross-modal alignment distillation and in-sample distribution distillation. In addition, we employ adversarial training to enhance the model to fill the missing region in complicated structures effectively. Experiments are conducted on two popular vision-language datasets. Results show that our model achieves state-of-the-art performance compared with other strong competitors.
Abstract (translated)
文本引导图像修复(TGII)的目标是在损坏的图像中根据给定文本恢复缺失区域。现有的方法基于强大的视觉编码器和跨modal融合模型来整合跨modal特征。然而,这些方法将大部分计算分配给视觉编码,而忽略模型modal方式交互的轻微计算。此外,它们采用跨modal融合来获取深度特征,而忽视了文本和图像的精细对齐。最近,视觉语言预训练模型(VLPM)在大多数多模态任务中取得了进展。在本文中,我们提出了一种新的模型,通过改进跨modal对齐(CMA)来为TGII提出。CMA模型由VLPM作为视觉语言编码器,生成图像和全局局部分类器。为了探索用于图像恢复的跨modal对齐知识,我们引入了跨modal对齐蒸馏和样本分布蒸馏。此外,我们采用对抗训练来提高模型,有效地填充复杂的结构中的缺失区域。实验研究了两个流行的视觉语言数据集。结果表明,我们的模型相比其他强大的竞争对手取得了先进的性能。
URL
https://arxiv.org/abs/2301.11362