Paper Reading AI Learner

Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate

2025-12-06 08:20:25
Kaile Wang, Lijun He, Haisheng Fu, Haixia Bi, Fan Li

Abstract

Generative image compression has recently shown impressive perceptual quality, but often suffers from semantic deviations caused by generative hallucinations at ultra-low bitrate (bpp < 0.05), limiting its reliable deployment in bandwidth-constrained 6G semantic communication scenarios. In this work, we reassess the positioning and role of of multimodal guidance, and propose a Multimodal-Guided Task-Aware Generative Image Compression (MTGC) framework. Specifically, MTGC integrates three guidance modalities to enhance semantic consistency: a concise but robust text caption for global semantics, a highly compressed image (HCI) retaining low-level visual information, and Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. The SPWs are generated by our designed Task-Aware Semantic Compression Module (TASCM), which operates in a task-oriented manner to drive the multi-head self-attention mechanism to focus on and extract semantics relevant to the generation task while filtering out redundancy. Subsequently, to facilitate the synergistic guidance of these modalities, we design a Multimodal-Guided Diffusion Decoder (MGDD) employing a dual-path cooperative guidance mechanism that synergizes cross-attention and ControlNet additive residuals to precisely inject these three guidance into the diffusion process, and leverages the diffusion model's powerful generative priors to reconstruct the image. Extensive experiments demonstrate that MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on the DIV2K dataset) while also achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.

Abstract (translated)

最近,生成式图像压缩技术在感知质量方面表现出色,但在极低比特率(bpp < 0.05)下却常常因为生成性幻觉而导致语义偏差,这限制了其在带宽受限的6G语义通信场景中的可靠部署。为此,在这项工作中,我们重新评估了多模态引导的作用和定位,并提出了一种名为“多模态引导任务感知生成图像压缩”(MTGC)框架。具体而言,MTGC整合了三种增强语义一致性的引导模式:一种简洁但稳健的文本描述用于全局语义,一张高度压缩且保留低级视觉信息的图像(HCI),以及一些细粒度的任务相关语义伪词(SPWs)。这些SPWs通过我们设计的任务感知语义压缩模块(TASCM)生成。该模块以任务为导向运行,促使多头自注意力机制聚焦并提取与生成任务相关的语义,并过滤掉冗余信息。 随后,为了促进这三种模式的协同引导,我们设计了一种名为“多模态引导扩散解码器”(MGDD),它采用双路径协作引导机制,通过交叉注意和ControlNet加性残差,将上述三个指导因素精确注入到扩散过程中。此外,该框架还利用了扩散模型的强大生成先验来重构图像。 广泛的实验表明,MTGC在保持语义一致性的同时(例如,在DIV2K数据集上DISTS指标下降10.59%),还在极低比特率下实现了显著的感知质量和像素级保真度提升。

URL

https://arxiv.org/abs/2512.06344

PDF

https://arxiv.org/pdf/2512.06344.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot