Paper Reading AI Learner

Improving Cross-modal Alignment for Text-Guided Image Inpainting

2023-01-26 19:18:27
Yucheng Zhou, Guodong Long

Abstract

Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image. Existing methods are based on a strong vision encoder and a cross-modal fusion model to integrate cross-modal features. However, these methods allocate most of the computation to visual encoding, while light computation on modeling modality interactions. Moreover, they take cross-modal fusion for depth features, which ignores a fine-grained alignment between text and image. Recently, vision-language pre-trained models (VLPM), encapsulating rich cross-modal alignment knowledge, have advanced in most multimodal tasks. In this work, we propose a novel model for TGII by improving cross-modal alignment (CMA). CMA model consists of a VLPM as a vision-language encoder, an image generator and global-local discriminators. To explore cross-modal alignment knowledge for image restoration, we introduce cross-modal alignment distillation and in-sample distribution distillation. In addition, we employ adversarial training to enhance the model to fill the missing region in complicated structures effectively. Experiments are conducted on two popular vision-language datasets. Results show that our model achieves state-of-the-art performance compared with other strong competitors.

Abstract (translated)

文本引导图像修复(TGII)的目标是在损坏的图像中根据给定文本恢复缺失区域。现有的方法基于强大的视觉编码器和跨modal融合模型来整合跨modal特征。然而,这些方法将大部分计算分配给视觉编码,而忽略模型modal方式交互的轻微计算。此外,它们采用跨modal融合来获取深度特征,而忽视了文本和图像的精细对齐。最近,视觉语言预训练模型(VLPM)在大多数多模态任务中取得了进展。在本文中,我们提出了一种新的模型,通过改进跨modal对齐(CMA)来为TGII提出。CMA模型由VLPM作为视觉语言编码器,生成图像和全局局部分类器。为了探索用于图像恢复的跨modal对齐知识,我们引入了跨modal对齐蒸馏和样本分布蒸馏。此外,我们采用对抗训练来提高模型,有效地填充复杂的结构中的缺失区域。实验研究了两个流行的视觉语言数据集。结果表明,我们的模型相比其他强大的竞争对手取得了先进的性能。

URL

https://arxiv.org/abs/2301.11362

PDF

https://arxiv.org/pdf/2301.11362.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot