Paper Reading AI Learner

Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder

2024-04-07 10:57:54
Yiyang Ma, Wenhan Yang, Jiaying Liu

Abstract

The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.

Abstract (translated)

扩散模型的输出图像可以达到出色的感知质量。然而,扩散模型很难保证失真,因此将扩散模型与图像压缩模型集成还需要更全面的探索。本文提出了一种基于扩散的图像压缩方法,采用有偏的端到端解码器模型作为校正,在保证失真程度的同时实现更好的感知质量。我们构建了一个扩散模型,并设计了一个新范式,将扩散模型和端到端解码器相结合,其中后者的任务是在编码器侧提取的有偏信息进行传输。具体来说,我们理论分析了扩散模型在原始图像可见的情况下进行编码的重建过程。根据分析,我们引入了一个端到端的卷积解码器,在编码器侧提供对得分函数 $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ 的更好近似,并有效传输组合。实验证明,与之前的所有感知压缩方法相比,我们的方法在失真和感知方面都具有优越性。

URL

https://arxiv.org/abs/2404.04916

PDF

https://arxiv.org/pdf/2404.04916.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot