Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-offs between ROI and non-ROI. Existing ROI-based image compression schemes predefine the ROI, making it unchangeable, and lack effective mechanisms to balance reconstruction quality between ROI and non-ROI. This work proposes a paradigm for customizable ROI-based deep image compression. First, we develop a Text-controlled Mask Acquisition (TMA) module, which allows users to easily customize their ROI for compression by just inputting the corresponding semantic \emph{text}. It makes the encoder controlled by text. Second, we design a Customizable Value Assign (CVA) mechanism, which masks the non-ROI with a changeable extent decided by users instead of a constant one to manage the reconstruction quality trade-off between ROI and non-ROI. Finally, we present a Latent Mask Attention (LMA) module, where the latent spatial prior of the mask and the latent Rate-Distortion Optimization (RDO) prior of the image are extracted and fused in the latent space, and further used to optimize the latent representation of the source image. Experimental results demonstrate that our proposed customizable ROI-based deep image compression paradigm effectively addresses the needs of customization for ROI definition and mask acquisition as well as the reconstruction quality trade-off management between the ROI and non-ROI.
基于感兴趣区域(Region of Interest,简称ROI)的图像压缩通过优先处理ROI来优化比特分配,从而实现高质量重建。然而,随着用户(包括人类客户和下游机器任务)需求的多样化,基于ROI的图像压缩需要具备可定制性以支持各种偏好。例如,不同的用户可能会定义不同的ROI或对ROI与非ROI之间的质量权衡有不同的要求。现有的基于ROI的图像压缩方案预设了固定的ROI,并缺乏有效的机制来平衡ROI和非ROI的重建质量。 这项工作提出了一种可定制化的基于ROI的深度图像压缩范式。首先,我们开发了一个文本控制的掩码获取(Text-controlled Mask Acquisition,简称TMA)模块,允许用户通过输入相应的语义文本轻松地自定义其用于压缩的ROI。这使得编码器可以通过文本进行控制。其次,我们设计了一种可定制的价值分配(Customizable Value Assign,简称CVA)机制,该机制利用由用户决定而非固定程度屏蔽非ROI来管理ROI与非ROI之间的重建质量权衡。最后,我们提出了一种潜在掩码注意力(Latent Mask Attention,简称LMA)模块,在该模块中,从掩码和图像的隐空间率失真优化(Rate-Distortion Optimization,简称RDO)先验中提取并融合了潜在的空间优先级,并进一步用于优化源图像的潜在表示。 实验结果表明,我们提出的可定制化的基于ROI的深度图像压缩范式有效地满足了自定义ROI定义和掩码获取的需求以及管理ROI与非ROI之间重建质量权衡的需求。
https://arxiv.org/abs/2507.00373
Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoder and decoder, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable to mainstream transform coding schemes. All source code are available at this https URL.
基于扩散的图像压缩技术展现出了实现超低比特率编码(每像素小于0.05位)并保持高度真实性的巨大潜力,通过利用大型预训练的文本到图像扩散模型生成先验知识。然而,当前的方法需要在解码器中进行大量去噪步骤才能在极端比特率限制下生成逼真的结果,这限制了它们在实时压缩场景中的应用。此外,这些方法通常会牺牲重建保真度,因为扩散模型通常无法保证像素级别的连贯性。为了解决这些问题,我们引入了StableCodec,它支持一步式扩散,用于高保真和高度逼真的极端图像压缩,并提高了编码效率。 为了实现超低比特率,我们首先开发了一个高效的深度压缩潜在编解码器(Deep Compression Latent Codec),以传输一个单步去噪过程所需的噪声潜表示。然后,我们提出了一种双分支编码结构,由一对辅助编解码器组成,用于增强重建保真度。此外,我们采用了端到端优化,并联合比特率和像素级别约束进行调整。 在CLIC 2020、DIV2K和Kodak数据集上的广泛实验表明,即使在低至每像素0.005位的比特率下,StableCodec也显著超越了现有的方法,在FID(Frechet Inception Distance)、KID(Kernel Inception Distance)和DISTS(Distortion-aware Image Similarity Metric)等指标上表现优异。同时保持了较高的保真度。此外,StableCodec实现了与主流变换编码方案相当的推理速度。 所有源代码均可在此链接获取:[提供具体URL]
https://arxiv.org/abs/2506.21977
RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in various applications like intelligent surveillance. However, as the number of modalities increases, the required data storage and transmission costs also double. Therefore, efficient RGB-IR data compression is essential. This work proposes a joint compression framework for RGB-IR image pair. Specifically, to fully utilize cross-modality prior information for accurate context probability modeling within and between modalities, we propose a Channel-wise Cross-modality Entropy Model (CCEM). Among CCEM, a Low-frequency Context Extraction Block (LCEB) and a Low-frequency Context Fusion Block (LCFB) are designed for extracting and aggregating the global low-frequency information from both modalities, which assist the model in predicting entropy parameters more accurately. Experimental results demonstrate that our approach outperforms existing RGB-IR image pair and single-modality compression methods on LLVIP and KAIST datasets. For instance, the proposed framework achieves a 23.1% bit rate saving on LLVIP dataset compared to the state-of-the-art RGB-IR image codec presented at CVPR 2022.
RGB-IR(RGB和红外)图像对在智能监控等众多应用中经常被同时使用。然而,随着模式数量的增加,所需的数据存储和传输成本也会翻倍。因此,高效的RGB-IR数据压缩变得至关重要。这项工作提出了一种针对RGB-IR图像对的联合压缩框架。具体而言,为了充分利用跨模态先验信息进行准确的上下文概率建模(在同一模式内及不同模式之间),我们提出了通道级跨模态熵模型(CCEM)。在CCEM中,设计了低频上下文提取模块(LCEB)和低频上下文融合模块(LCFB),用于从两种模式中提取并聚合全局低频信息。这些模块有助于模型更准确地预测熵参数。 实验结果表明,在LLVIP和KAIST数据集上,我们提出的方法优于现有的RGB-IR图像对压缩方法和单模态压缩方法。例如,在LLVIP数据集中,所提出的框架相比于CVPR 2022年展示的最佳现有RGB-IR图像编解码器实现了23.1%的比特率节省。
https://arxiv.org/abs/2506.21851
Scalable image compression is a technique that progressively reconstructs multiple versions of an image for different requirements. In recent years, images have increasingly been consumed not only by humans but also by image recognition models. This shift has drawn growing attention to scalable image compression methods that serve both machine and human vision (ICMH). Many existing models employ neural network-based codecs, known as learned image compression, and have made significant strides in this field by carefully designing the loss functions. In some cases, however, models are overly reliant on their learning capacity, and their architectural design is not sufficiently considered. In this paper, we enhance the coding efficiency and interpretability of ICMH framework by integrating an explicit residual compression mechanism, which is commonly employed in resolution scalable coding methods such as JPEG2000. Specifically, we propose two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH). These proposed methods are applicable to various machine vision tasks. Moreover, they provide flexibility to choose between encoder complexity and compression performance, making it adaptable to diverse application requirements. Experimental results demonstrate the effectiveness of our proposed methods, with PR-ICMH achieving up to 29.57% BD-rate savings over the previous work.
可扩展图像压缩是一种技术,它能够根据不同的需求逐步重建多版本的图像。近年来,随着图像被人类和图像识别模型共同消费的趋势日益明显,为机器视觉(machine vision)和人眼视觉(human vision)同时服务的可扩展图像压缩方法得到了越来越多的关注。许多现有的模型使用基于神经网络的编解码器,称为“学习型图像压缩”,并在这一领域取得了显著进展,通过精心设计损失函数来实现这一点。然而,在某些情况下,这些模型过于依赖其学习能力,并且其架构设计考虑不足。 在本文中,我们通过整合一种显式的残差压缩机制(通常用于JPEG2000等分辨率可扩展编码方法)来提高ICMH框架的编码效率和解释性。具体来说,我们提出了两种互补的方法:基于特征残差的可伸缩编码(FR-ICMH)和基于像素残差的可伸缩编码(PR-ICMH)。这些提议的方法适用于各种机器视觉任务,并提供了一种选择编码器复杂度与压缩性能之间的灵活性,使其适应多样化的应用需求。实验结果证明了我们所提出方法的有效性,其中PR-ICMH相比先前的工作最多实现了29.57%的BD率节省(注:BD-rate是指双向失真率,用于衡量不同视频或图像编码方案在同等质量下的效率差异)。
https://arxiv.org/abs/2506.19297
Adversarial robustness of neural networks is an increasingly important area of research, combining studies on computer vision models, large language models (LLMs), and others. With the release of JPEG AI -- the first standard for end-to-end neural image compression (NIC) methods -- the question of evaluating NIC robustness has become critically significant. However, previous research has been limited to a narrow range of codecs and attacks. To address this, we present \textbf{NIC-RobustBench}, the first open-source framework to evaluate NIC robustness and adversarial defenses' efficiency, in addition to comparing Rate-Distortion (RD) performance. The framework includes the largest number of codecs among all known NIC libraries and is easily scalable. The paper demonstrates a comprehensive overview of the NIC-RobustBench framework and employs it to analyze NIC robustness. Our code is available online at this https URL.
神经网络的对抗鲁棒性是一个日益重要的研究领域,结合了计算机视觉模型、大型语言模型(LLM)等的研究。随着JPEG AI——首个端到端神经图像压缩(NIC)方法标准的发布,评估NIC鲁棒性的需求变得至关重要。然而,之前的研究所涵盖的编解码器和攻击范围相对有限。为了解决这一问题,我们提出了**NIC-RobustBench**,这是一个开源框架,用于评估NIC的鲁棒性和对抗防御的有效性,并且可以与率失真(RD)性能进行比较。该框架包含所有已知NIC库中数量最多的编解码器,并且易于扩展。论文详细介绍了NIC-RobustBench框架,并使用它来分析NIC的鲁棒性。我们的代码可以在以下网址获取:[此URL](https://this-url.com)。
https://arxiv.org/abs/2506.19051
Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at this https URL.
自回归初始比特是一种框架,它结合了子图像的自回归和潜在变量建模,在无损医学图像压缩方面表现出优势。然而,现有方法中的图像分割过程导致每个子图像中潜在变量信息分布均匀,进而引发后验坍缩并造成潜在变量利用效率低下。为解决这些问题,我们提出了一种基于预测的端到端无损医学图像压缩方法LVPNet,该方法利用全局潜在变量来预测像素值,并对预测的概率进行编码以实现无损压缩。具体而言,我们引入了全局多尺度感知模块(GMSM),从整个图像中提取紧凑且具有信息量的潜在表示,有效捕捉到潜在空间中的空间依赖关系。此外,为了减轻量化过程中引入的信息损失,我们提出了量化补偿模块(QCM),该模块学习量化误差分布并精炼量化特征以弥补量化损失。 在具有挑战性的基准测试上的大量实验表明,与现有的无损图像压缩方法相比,我们的方法实现了更优的压缩效率,并且保持了有竞争力的推理速度。代码在此[URL]提供。(注:请将方括号内的文本替换为实际提供的链接地址)
https://arxiv.org/abs/2506.17983
Although image compression is fundamental to visual data processing and has inspired numerous standard and learned codecs, these methods still suffer severe quality degradation at extremely low bits per pixel. While recent diffusion based models provided enhanced generative performance at low bitrates, they still yields limited perceptual quality and prohibitive decoding latency due to multiple denoising steps. In this paper, we propose the first single step diffusion model for image compression (DiffO) that delivers high perceptual quality and fast decoding at ultra low bitrates. DiffO achieves these goals by coupling two key innovations: (i) VQ Residual training, which factorizes a structural base code and a learned residual in latent space, capturing both global geometry and high frequency details; and (ii) rate adaptive noise modulation, which tunes denoising strength on the fly to match the desired bitrate. Extensive experiments show that DiffO surpasses state of the art compression performance while improving decoding speed by about 50x compared to prior diffusion-based methods, greatly improving the practicality of generative codecs. The code will be available at this https URL.
尽管图像压缩是视觉数据处理的基础,并且已经激发了许多标准和学习型编解码器的开发,但这些方法在每像素极低比特率下仍然遭受严重的质量下降。虽然最近基于扩散的方法在低比特率下的生成性能得到了增强,但由于多步骤去噪过程的存在,它们仍存在感知质量有限和解码延迟过高的问题。在这篇论文中,我们提出了首个单步扩散模型用于图像压缩(DiffO),该模型可以在极低比特率下提供高感知质量和快速解码。 DiffO通过结合两个关键创新实现了这些目标:(i) VQ残差训练,在潜在空间中将结构基码和学习到的残差因子化,捕捉全局几何形状和高频细节;以及(ii)速率自适应噪声调制,根据所需的比特率动态调整去噪强度。广泛的实验表明,DiffO在压缩性能上超越了当前最先进的方法,并且相比之前的基于扩散的方法,解码速度提高了约50倍,大大提升了生成编解码器的实用性。 该代码将在[此处](此URL)提供。请注意替换 "this https URL" 为实际可用链接地址。
https://arxiv.org/abs/2506.16572
Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any existing codec with theoretical guarantee. We further propose different implementations for optimal perceptual quality when decoding time budget is $\approx 0.1$s, $0.1-10$s and $\ge 10$s. Our approach: 1). improves the decoding time of training-free codec from 1 min to $0.1-10$s with comparable perceptual quality. 2). can be applied to non-differentiable codec such as VTM. 3). can be used to improve previous perceptual codecs, such as MS-ILLM. 4). can easily achieve perception-distortion trade-off. Empirically, we show that our approach successfully improves the perceptual quality of ELIC, VTM and MS-ILLM with fast decoding. Our approach achieves comparable FID to previous training-free codec with significantly less decoding time. And our approach still outperforms previous conditional generative model based codecs such as HiFiC and MS-ILLM in terms of FID. The source code is provided in the supplementary material.
无需训练的感知图像编解码器在解码时采用预训练的无条件生成模型,以避免为新条件生成模型进行训练。然而,它们严重依赖于扩散逆过程或样本通信,在解码单个图像时需要1分钟到难以接受的时间。在这篇论文中,我们提出了一种无需训练的算法,该算法可以提高现有编解码器在理论保证下的感知质量。此外,针对不同的解码时间预算(约为0.1秒、0.1-10秒和≥10秒),我们提出了多种实现方案以达到最优感知质量。 我们的方法: 1) 将无需训练的编解码器的解码时间从1分钟缩短到0.1-10秒,同时保持类似的感知质量。 2) 可应用于非可微分编码器,如VTM。 3) 可用于改进先前的感知编码器,例如MS-ILLM。 4) 能够轻松实现感知失真权衡。 实验结果显示,我们的方法成功地在快速解码的同时提高了ELIC、VTM和MS-ILLM的感知质量。与之前的无需训练的编解码器相比,我们的方法实现了相当的FID(Frechet Inception Distance),但解码时间显著减少。而且,在FID方面,我们的方式仍然优于基于条件生成模型的先前编码器如HiFiC和MS-ILLM。 源代码可以在补充材料中找到。
https://arxiv.org/abs/2506.16102
The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users' privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model's uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.
视觉语言预训练(VLP)模型在语义理解上的改进,使得保护公开发布的图片免受搜索引擎和其他类似工具的利用变得越来越困难。在此背景下,本文旨在通过在图像压缩阶段实施防御措施来保护用户的隐私,防止其被滥用。具体来说,我们提出了一种灵活的编码方法,称为隐私防护图像压缩(PSIC),该方法可以生成具有多种解码选项的比特流。默认情况下,比特流会被解码以保留满意的感知质量,并且阻止VLP模型进行解释。此外,我们的方法还保持了原始图像压缩的功能性。通过可定制的输入条件,所提出的方案可以在重建时保留图像的所有语义信息。我们提出了一种条件潜在触发生成(CLTG)模块,根据自定义条件产生偏差信息以引导解码过程转向不同的重构版本,并设计了一个基于目标VLP模型在训练数据上的不确定性推断出软标签的不确定性感知加密导向优化函数(UAEO)。此外,本文进一步整合了自适应多目标优化策略,在统一的训练过程中同时获得改进的加密性能和感知质量。所提出的方案是即插即用的,并可以无缝集成到大多数现有的学习图像压缩(LIC)模型中。跨多个下游任务进行广泛的实验验证了我们设计的有效性。
https://arxiv.org/abs/2506.15201
The continuous improvements on image compression with variational autoencoders have lead to learned codecs competitive with conventional approaches in terms of rate-distortion efficiency. Nonetheless, taking the quantization into account during the training process remains a problem, since it produces zero derivatives almost everywhere and needs to be replaced with a differentiable approximation which allows end-to-end optimization. Though there are different methods for approximating the quantization, none of them model the quantization noise correctly and thus, result in suboptimal networks. Hence, we propose an additional finetuning training step: After conventional end-to-end training, parts of the network are retrained on quantized latents obtained at the inference stage. For entropy-constraint quantizers like Trellis-Coded Quantization, the impact of the quantizer is particularly difficult to approximate by rounding or adding noise as the quantized latents are interdependently chosen through a trellis search based on both the entropy model and a distortion measure. We show that retraining on correctly quantized data consistently yields additional coding gain for both uniform scalar and especially for entropy-constraint quantization, without increasing inference complexity. For the Kodak test set, we obtain average savings between 1% and 2%, and for the TecNick test set up to 2.2% in terms of Bjøntegaard-Delta bitrate.
基于变分自编码器的图像压缩持续改进,已经使得学习编解码器在率失真效率方面与传统方法相当。然而,在训练过程中考虑量化仍然是一个问题,因为这会产生几乎处处为零的导数,需要使用可微逼近来替代,从而实现端到端优化。尽管有许多不同的方法可以近似量化,但没有一种能够正确建模量化噪声,因此导致网络次优。为此,我们提出了一种额外的微调训练步骤:在传统的端到端训练之后,通过推理阶段获得的量化潜在变量对网络的部分进行再训练。对于熵约束量化器(如网格编码量化),量化的效应难以仅通过舍入或添加噪声来近似,因为在基于熵模型和失真度量的网格搜索过程中,量化后的潜在变量是相互依赖地选择的。我们展示了在正确量化数据上重新训练可以持续为均匀标量以及特别是熵约束量化带来额外的编码增益,而不增加推理复杂性。对于Kodak测试集,我们获得了平均1%到2%的节省,在Bjøntegaard-Delta比特率方面;而对于TecNick测试集,则最高可达到2.2%的节约。
https://arxiv.org/abs/2506.08662
Multispectral satellite images play a vital role in agriculture, fisheries, and environmental monitoring. However, their high dimensionality, large data volumes, and diverse spatial resolutions across multiple channels pose significant challenges for data compression and analysis. This paper presents ImpliSat, a unified framework specifically designed to address these challenges through efficient compression and reconstruction of multispectral satellite data. ImpliSat leverages Implicit Neural Representations (INR) to model satellite images as continuous functions over coordinate space, capturing fine spatial details across varying spatial resolutions. Furthermore, we introduce a Fourier modulation algorithm that dynamically adjusts to the spectral and spatial characteristics of each band, ensuring optimal compression while preserving critical image details.
多光谱卫星图像在农业、渔业和环境监测中扮演着至关重要的角色。然而,这些图像的高维度、大量数据以及多个通道间不同的空间分辨率给数据压缩和分析带来了巨大挑战。本文提出了ImpliSat框架,旨在通过高效地对多光谱卫星数据进行压缩和重建来解决这些问题。ImpliSat利用隐式神经表示(INR)将卫星图像建模为在坐标空间上的连续函数,能够捕捉不同空间分辨率下的细微空间细节。此外,我们还引入了一种傅里叶调制算法,该算法可以动态适应每个波段的光谱和空间特征,确保最佳压缩效果的同时保留关键图像细节。
https://arxiv.org/abs/2506.01234
Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model's capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method.
最近的语义信息理论研究揭示了语义和句法信息之间的集合-元素关系,这些关系以同义关系的形式表现出来。在本文中,我们基于这种同义性视角提出了一种同义变分推理(SVI)方法,用于重新分析感知图像压缩问题。该方法将感知相似性作为典型的同义标准来建立理想的同义集(Synset),并通过最小化部分语义KL散度来近似其潜在同义表示的后验概率密度分布。这一分析从理论上证明了感知图像压缩的优化方向遵循一个三重权衡,这可以涵盖现有的率失真-感知方案。 此外,我们介绍了一种新的图像压缩方案——同义图像压缩(SIC),它对应于SVI方法的推导过程,并实现了一个渐进式的SIC编解码器以充分发挥模型的能力。实验结果表明,使用单个渐进式SIC编解码器可以达到与现有技术相当的率失真-感知性能,从而验证了我们提出分析方法的有效性。
https://arxiv.org/abs/2505.22438
While learned image compression (LIC) focuses on efficient data transmission, generative image compression (GIC) extends this framework by integrating generative modeling to produce photo-realistic reconstructed images. In this paper, we propose a novel diffusion-based generative modeling framework tailored for generative image compression. Unlike prior diffusion-based approaches that indirectly exploit diffusion modeling, we reinterpret the compression process itself as a forward diffusion path governed by stochastic differential equations (SDEs). A reverse neural network is trained to reconstruct images by reversing the compression process directly, without requiring Gaussian noise initialization. This approach achieves smooth rate adjustment and photo-realistic reconstructions with only a minimal number of sampling steps. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing generative image compression approaches across a range of metrics, including perceptual distortion, statistical fidelity, and no-reference quality assessments.
虽然学习型图像压缩(LIC)主要关注高效的数据传输,生成式图像压缩(GIC)则通过整合生成模型来扩展这一框架,以生成逼真的重构图像。在本文中,我们提出了一种新型的基于扩散的生成建模框架,专门用于生成式图像压缩。与以往基于扩散的方法间接利用扩散模型不同,我们将压缩过程本身重新解释为由随机微分方程(SDEs)控制的正向扩散路径。一个反向神经网络被训练来通过直接反转压缩过程重构图像,而无需进行高斯噪声初始化。这一方法实现了平滑的速率调整,并且仅需少量采样步骤就能生成逼真的重构图像。在基准数据集上的广泛实验表明,我们的方法在包括感知失真、统计保真度和无参考质量评估等各项指标上均优于现有的生成式图像压缩方法。
https://arxiv.org/abs/2505.20984
As learned image compression (LIC) methods become increasingly computationally demanding, enhancing their training efficiency is crucial. This paper takes a step forward in accelerating the training of LIC methods by modeling the neural training dynamics. We first propose a Sensitivity-aware True and Dummy Embedding Training mechanism (STDET) that clusters LIC model parameters into few separate modes where parameters are expressed as affine transformations of reference parameters within the same mode. By further utilizing the stable intra-mode correlations throughout training and parameter sensitivities, we gradually embed non-reference parameters, reducing the number of trainable parameters. Additionally, we incorporate a Sampling-then-Moving Average (SMA) technique, interpolating sampled weights from stochastic gradient descent (SGD) training to obtain the moving average weights, ensuring smooth temporal behavior and minimizing training state variances. Overall, our method significantly reduces training space dimensions and the number of trainable parameters without sacrificing model performance, thus accelerating model convergence. We also provide a theoretical analysis on the Noisy quadratic model, showing that the proposed method achieves a lower training variance than standard SGD. Our approach offers valuable insights for further developing efficient training methods for LICs.
随着学习型图像压缩(LIC)方法计算需求的增加,提高其训练效率变得至关重要。本文通过建模神经网络训练动力学来加速LIC方法的训练过程。我们首先提出了一个基于敏感度感知的真实和虚拟嵌入训练机制(STDET),该机制将LIC模型参数聚类成少数几种不同的模式,在每个模式内,参数可以表示为参考参数的仿射变换。通过进一步利用各模式内部在训练过程中稳定的关联关系及参数敏感性,我们逐步嵌入非参考参数,从而减少可训练参数的数量。 此外,我们还引入了采样后移动平均(SMA)技术,该技术从随机梯度下降(SGD)训练中插值出采样的权重,以此来获取移动平均权重,确保时间行为的平滑性并最小化训练状态方差。总体而言,我们的方法在不牺牲模型性能的前提下显著减少了训练空间维度和可训练参数的数量,从而加速了模型收敛速度。 我们还对噪声二次模型进行了理论分析,表明所提出的方法比标准SGD实现了更低的训练方差。本文的研究为开发更高效的LIC训练方法提供了有价值的见解。
https://arxiv.org/abs/2505.18107
While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.
尽管最近基于扩散的生成图像编解码器展示了令人印象深刻的性能,但其迭代采样过程引入了不愉快的延迟。在这项工作中,我们重新审视了基于扩散的编解码器的设计,并认为多步采样对于生成式压缩并非必要。根据这一洞察,我们提出了OneDC——一种一步式的基于扩散的生成图像编解码器,它将隐变量压缩模块与一步式扩散生成器集成在一起。 认识到语义引导在一步式扩散中的关键作用,我们提议使用超先验作为语义信号,从而克服文本提示在表示复杂视觉内容方面的局限性。为了进一步增强超先验的语义能力,我们引入了一种知识蒸馏机制,将预训练的生成标记器的知识转移到超先验编解码器中。此外,我们采用了一种混合像素域和隐变量域优化方法,以同时提升重建保真度和感知逼真度。 广泛的实验表明,即使在一步式生成的情况下,OneDC也能达到SOTA(State-of-the-Art)的感知质量,并且与之前的多步扩散编解码器相比,在比特率方面减少了超过40%,解码速度提高了20倍。代码将在稍后发布。
https://arxiv.org/abs/2505.16687
Most existing approaches for image and video compression perform transform coding in the pixel space to reduce redundancy. However, due to the misalignment between the pixel-space distortion and human perception, such schemes often face the difficulties in achieving both high-realism and high-fidelity at ultra-low bitrate. To solve this problem, we propose \textbf{G}enerative \textbf{L}atent \textbf{C}oding (\textbf{GLC}) models for image and video compression, termed GLC-image and GLC-Video. The transform coding of GLC is conducted in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE). Compared to the pixel-space, such a latent space offers greater sparsity, richer semantics and better alignment with human perception, and show its advantages in achieving high-realism and high-fidelity compression. To further enhance performance, we improve the hyper prior by introducing a spatial categorical hyper module in GLC-image and a spatio-temporal categorical hyper module in GLC-video. Additionally, the code-prediction-based loss function is proposed to enhance the semantic consistency. Experiments demonstrate that our scheme shows high visual quality at ultra-low bitrate for both image and video compression. For image compression, GLC-image achieves an impressive bitrate of less than $0.04$ bpp, achieving the same FID as previous SOTA model MS-ILLM while using $45\%$ fewer bitrate on the CLIC 2020 test set. For video compression, GLC-video achieves 65.3\% bitrate saving over PLVC in terms of DISTS.
大多数现有的图像和视频压缩方法在像素空间中进行变换编码以减少冗余。然而,由于像素空间失真与人类感知之间的不一致,这些方案往往难以同时实现超低比特率下的高真实感和高保真度。为了解决这个问题,我们提出了用于图像和视频压缩的生成式潜在编码(Generative Latent Coding, GLC)模型,即GLC-Image和GLC-Video。GLC的变换编码在生成矢量量化变分自动编码器(VQ-VAE)的潜在空间中进行。与像素空间相比,这种潜在空间提供了更大的稀疏性、更丰富的语义以及更好的人类感知一致性,并且在实现高真实感和高保真度压缩方面表现出优势。 为了进一步提高性能,在GLC-image中引入了空间类别超模块以改进超级先验,在GLC-video中则引入了时空类别超模块。此外,我们还提出了基于代码预测的损失函数来增强语义一致性。实验表明,我们的方案在图像和视频的超低比特率下均表现出较高的视觉质量。对于图像压缩,GLC-Image实现了小于0.04bpp的惊人比特率,在CLIC 2020测试集上达到了与前一代最佳模型MS-ILLM相同的FID分数,但使用了比后者少45%的比特率。对于视频压缩,GLC-Video在DISTS标准下相较于PLVC实现了65.3%的比特率节省。
https://arxiv.org/abs/2505.16177
Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at this https URL.
预训练的潜在扩散模型在有损图像压缩方面展现出了强大的潜力,这得益于其强大的生成先验。大多数现有的基于扩散的方法通过迭代去噪从随机噪声重建图像,并受到压缩潜表示的指导。虽然这些方法已经实现了高质量的重构效果,但它们的多步骤采样过程带来了大量的计算开销。此外,通常需要为不同的压缩比特率训练单独的模型,导致了高昂的训练和存储成本。 为了应对这些挑战,我们提出了一种跨多个比特率的一步扩散编解码器OSCAR。具体来说,我们的方法将压缩潜变量视为原始潜变量的噪声变体,其失真程度取决于比特率。这一视角使它们可以作为扩散轨迹上的中间状态建模。通过建立从压缩比特率到伪扩散时间步长的映射,我们可以对单个生成模型进行条件设定,使其能够支持多个比特率下的重构。同时,我们主张压缩潜变量保留了丰富的结构信息,从而使一步去噪成为可能。因此,OSCAR用一次去噪过程取代了迭代采样过程,显著提高了推理效率。 广泛的实验表明,无论是定量还是视觉质量指标,OSCAR都取得了卓越的性能表现。代码和模型将在以下链接发布:[此URL](请将此占位符替换为实际的网址)。
https://arxiv.org/abs/2505.16091
Denoising diffusion models achieved impressive results on several image generation tasks often outperforming GAN based models. Recently, the generative capabilities of diffusion models have been employed for perceptual image compression, such as in CDC. A major drawback of these diffusion-based methods is that, while producing impressive perceptual quality images they are dropping in fidelity/increasing the distortion to the original uncompressed images when compared with other traditional or learned image compression schemes aiming for fidelity. In this paper, we propose a hybrid compression scheme optimized for perceptual quality, extending the approach of the CDC model with a decoder network in order to reduce the impact on distortion metrics such as PSNR. After using the decoder network to generate an initial image, optimized for distortion, the latent conditioned diffusion model refines the reconstruction for perceptual quality by predicting the residual. On standard benchmarks, we achieve up to +2dB PSNR fidelity improvements while maintaining comparable LPIPS and FID perceptual scores when compared with CDC. Additionally, the approach is easily extensible to video compression, where we achieve similar results.
去噪扩散模型在多项图像生成任务中取得了令人印象深刻的成绩,通常优于基于GAN的模型。最近,扩散模型的生成能力被用于感知图像压缩领域,例如CDC(Compression with Denoising Diffusion)。然而,这些基于扩散的方法的主要缺点是,在产生高质量视觉效果的同时,与旨在保持保真度的传统或学习型图像压缩方案相比,它们在保真度上有所下降/增加了对原始未压缩图像的失真。 在这篇论文中,我们提出了一种优化感知质量的混合压缩方案,通过添加解码网络扩展了CDC模型的方法,以减少诸如PSNR等失真指标的影响。使用解码网络生成一个初始图像(该图像经过保真度优化),然后利用条件扩散模型根据预测残差来改进重构图像,从而提高感知质量。 在标准基准测试中,我们的方法实现了高达+2dB的PSNR保真度提升,同时与CDC相比保持了相当的LPIPS和FID感知评分。此外,该方法易于扩展到视频压缩领域,在这一领域我们也取得了类似的结果。
https://arxiv.org/abs/2505.13152
The exponential growth of visual data in digital communications has intensified the need for efficient compression techniques that balance rate-distortion performance with computational feasibility. While recent neural compression approaches have shown promise, they still struggle with fundamental challenges: preserving perceptual quality at high compression ratios, computational efficiency, and adaptability to diverse visual content. This paper introduces GANCompress, a novel neural compression framework that synergistically combines Binary Spherical Quantization (BSQ) with Generative Adversarial Networks (GANs) to address these challenges. Our approach employs a transformer-based autoencoder with an enhanced BSQ bottleneck that projects latent representations onto a hypersphere, enabling efficient discretization with bounded quantization error. This is followed by a specialized GAN architecture incorporating frequency-domain attention and color consistency optimization. Experimental results demonstrate that GANCompress achieves substantial improvement in compression efficiency -- reducing file sizes by up to 100x with minimal visual distortion. Our method outperforms traditional codecs like H.264 by 12-15% in perceptual metrics while maintaining comparable PSNR/SSIM values, with 2.4x faster encoding and decoding speeds. On standard benchmarks including ImageNet-1k and COCO2017, GANCompress sets a new state-of-the-art, reducing FID from 0.72 to 0.41 (43% improvement) compared to previous methods while maintaining higher throughput. This work presents a significant advancement in neural compression technology with promising applications for real-time visual communication systems.
数字通信中视觉数据的指数增长加剧了对高效压缩技术的需求,这些技术需要在比特率失真性能和计算可行性之间取得平衡。尽管最近基于神经网络的压缩方法显示出前景,但它们仍然面临着一些基本挑战:在高压缩比下保持感知质量、计算效率以及适应多样化的视觉内容。本文介绍了一种名为GANCompress的新颖神经压缩框架,该框架巧妙地结合了二进制球面量化(BSQ)与生成对抗网络(GAN),以解决这些挑战。 我们的方法采用基于变压器的自动编码器,并引入了一个增强版的BSQ瓶颈,它将潜在表示投影到超球体上,从而实现高效离散化并控制量化误差。随后是一个专门设计的GAN架构,该架构融合了频域注意力和色彩一致性优化机制。实验结果表明,GANCompress在压缩效率方面取得了显著提升——可以减少文件大小高达100倍,同时几乎不产生视觉失真。与传统的H.264编解码器相比,我们的方法在感知度量上提高了12-15%,同时保持了相当的PSNR/SSIM值,并且编码和解码速度分别快2.4倍。 在包括ImageNet-1k和COCO2017在内的标准基准测试中,GANCompress设定了新的最先进水平,与先前的方法相比,在不牺牲吞吐量的情况下将FID从0.72降低到0.41(提高了43%)。这项工作代表了神经网络压缩技术的重要进步,并为实时视觉通信系统提供了令人振奋的应用前景。
https://arxiv.org/abs/2505.13542
With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.
随着水下世界的探索和开发不断深入,水下图像已成为人类与海洋环境互动的重要媒介,推动了其高效传输和存储方面的广泛研究。然而,现有的水下图像压缩算法未能充分利用区分水下场景与陆地图像的独特特征,导致性能不佳。为解决这一局限性,我们提出了HQUIC(Hybrid Quality Underwater Image Compression),旨在利用特定于水下图像的特性来提高压缩效率。 HQUIC采用了一种ALTC模块,能够自适应预测图像的衰减系数和全局光照信息,有效缓解了由于光线差异和色调变化所导致的问题。随后,HQUIC使用代码本作为辅助分支,提取水下图像中的常见对象,并增强主分支的表现。此外,HQUIC动态调整多尺度频率成分的权重,优先保留对失真质量关键的信息,同时丢弃冗余细节。 在多样化的水下数据集上的广泛评估表明,HQUIC优于现有的最先进的压缩方法。
https://arxiv.org/abs/2505.09986