Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. To address these limitations, we propose an asymmetric adaptor framework that supports multi-task adaptation within a single model. Our method introduces a shared adaptor to learn general semantic features and task-specific adaptors to preserve task-level distinctions. With only lightweight plug-in modules and a frozen base codec, our method achieves strong performance across multiple tasks while maintaining compression efficiency. Experiments on the PASCAL-Context benchmark demonstrate that our method outperforms both Fully Fine-Tuned and other Parameter Efficient Fine-Tuned (PEFT) baselines, and validating the effectiveness of multi-vision transferring.
将学习到的图像压缩(LIC)模型从人类感知高效转移到机器感知是视觉中心表示学习中一个新兴的挑战。现有的方法通常以单任务的方式适应LIC,这种方式效率低下、缺乏任务间的交互,并且会产生多个特定于任务的比特流。为了解决这些限制,我们提出了一种不对称适配器框架,该框架支持在一个单一模型内进行多任务适应。我们的方法引入了一个共享适配器来学习通用语义特征和特定于任务的适配器以保持任务级别的区别。通过仅使用轻量级插件模块和冻结的基础编解码器,我们的方法在多个任务中实现了强大的性能,并且保持了压缩效率。 在PASCAL-Context基准上的实验表明,我们提出的方法超过了完全微调(Fully Fine-Tuned)和其他参数高效微调(PEFT)基线的性能,验证了多视觉转移的有效性。
https://arxiv.org/abs/2504.12997
In recent years, image compression for high-level vision tasks has attracted considerable attention from researchers. Given that object information in images plays a far more crucial role in downstream tasks than background information, some studies have proposed semantically structuring the bitstream to selectively transmit and reconstruct only the information required by these tasks. However, such methods structure the bitstream after encoding, meaning that the coding process still relies on the entire image, even though much of the encoded information will not be transmitted. This leads to redundant computations. Traditional image compression methods require a two-dimensional image as input, and even if the unimportant regions of the image are set to zero by applying a semantic mask, these regions still participate in subsequent computations as part of the image. To address such limitations, we propose an image compression method based on a position-indexed self-attention mechanism that encodes and decodes only the visible parts of the masked image. Compared to existing semantic-structured compression methods, our approach can significantly reduce computational costs.
近年来,针对高级视觉任务的图像压缩技术引起了研究人员的广泛关注。鉴于图像中的对象信息在下游任务中比背景信息扮演着更为关键的角色,一些研究提出通过语义结构化码流来选择性地传输和重建这些任务所需的信息。然而,这类方法是在编码后对码流进行结构化的,这意味着即使许多已编码的信息不会被传输,编码过程仍然依赖于整个图像,从而导致了冗余计算的存在。 传统的图像压缩方法需要二维图像作为输入,并且即便通过应用语义掩膜将图像中的不重要区域设为零,在后续的计算中这些区域依然被视为图像的一部分而参与其中。为了克服上述限制,我们提出了一种基于位置索引自注意力机制的图像压缩方法,该方法只对遮罩后的图像可见部分进行编码和解码。相较于现有的语义结构化压缩方法,我们的方法能够显著减少计算成本。
https://arxiv.org/abs/2504.12923
Learning-based image compression methods have recently emerged as promising alternatives to traditional codecs, offering improved rate-distortion performance and perceptual quality. JPEG AI represents the latest standardized framework in this domain, leveraging deep neural networks for high-fidelity image reconstruction. In this study, we present a comprehensive subjective visual quality assessment of JPEG AI-compressed images using the JPEG AIC-3 methodology, which quantifies perceptual differences in terms of Just Noticeable Difference (JND) units. We generated a dataset of 50 compressed images with fine-grained distortion levels from five diverse sources. A large-scale crowdsourced experiment collected 96,200 triplet responses from 459 participants. We reconstructed JND-based quality scales using a unified model based on boosted and plain triplet comparisons. Additionally, we evaluated the alignment of multiple objective image quality metrics with human perception in the high-fidelity range. The CVVDP metric achieved the overall highest performance; however, most metrics including CVVDP were overly optimistic in predicting the quality of JPEG AI-compressed images. These findings emphasize the necessity for rigorous subjective evaluations in the development and benchmarking of modern image codecs, particularly in the high-fidelity range. Another technical contribution is the introduction of the well-known Meng-Rosenthal-Rubin statistical test to the field of Quality of Experience research. This test can reliably assess the significance of difference in performance of quality metrics in terms of correlation between metrics and ground truth. The complete dataset, including all subjective scores, is publicly available at this https URL.
基于学习的图像压缩方法最近作为传统编解码器的有希望替代方案出现,提供了改进的率失真性能和感知质量。JPEG AI代表了该领域最新的标准化框架,利用深度神经网络进行高保真图像重建。在这项研究中,我们使用JPEG AIC-3方法对JPEG AI压缩图像进行了全面的主观视觉质量评估,这种方法以刚刚可察觉差异(JND)单位量化感知差异。我们生成了一个包含50张来自五个不同来源且具有细微失真级别的压缩图像的数据集。一项大规模众包实验收集了459名参与者提供的96,200个三元组响应。我们使用基于提升和普通三元组比较的统一模型重建了JND为基础的质量尺度。此外,我们在高保真范围内评估了几种客观图像质量度量与人类感知的一致性。CVVDP指标整体表现最佳;然而,包括CVVDP在内的大多数指标在预测JPEG AI压缩图像质量方面过于乐观。这些发现强调,在现代图像编解码器的开发和基准测试中进行严格的主观评价的重要性,尤其是在高保真范围内的评价。另一个技术贡献是将著名的Meng-Rosenthal-Rubin统计检验引入到用户体验质量研究领域,该检验能够可靠地评估度量性能差异在指标与真实值之间相关性的显著性。完整的数据集(包括所有主观评分)可以在以下网址公开获取:[提供的URL]。
https://arxiv.org/abs/2504.06301
Learned image compression (LIC) has recently made significant progress, surpassing traditional methods. However, most LIC approaches operate mainly in the spatial domain and lack mechanisms for reducing frequency-domain correlations. To address this, we propose a novel framework that integrates low-complexity 3D multi-level Discrete Wavelet Transform (DWT) into convolutional layers and entropy coding, reducing both spatial and channel correlations to improve frequency selectivity and rate-distortion (R-D) performance. Our proposed 3D multi-level wavelet-domain convolution (3DM-WeConv) layer first applies 3D multi-level DWT (e.g., 5/3 and 9/7 wavelets from JPEG 2000) to transform data into the wavelet domain. Then, different-sized convolutions are applied to different frequency subbands, followed by inverse 3D DWT to restore the spatial domain. The 3DM-WeConv layer can be flexibly used within existing CNN-based LIC models. We also introduce a 3D wavelet-domain channel-wise autoregressive entropy model (3DWeChARM), which performs slice-based entropy coding in the 3D DWT domain. Low-frequency (LF) slices are encoded first to provide priors for high-frequency (HF) slices. A two-step training strategy is adopted: first balancing LF and HF rates, then fine-tuning with separate weights. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art CNN-based LIC methods in R-D performance and computational complexity, with larger gains for high-resolution images. On the Kodak, Tecnick 100, and CLIC test sets, our method achieves BD-Rate reductions of -12.24%, -15.51%, and -12.97%, respectively, compared to H.266/VVC.
最近,学习图像压缩(LIC)取得了显著进展,并已超越传统方法。然而,大多数LIC方法主要在空间域中操作,并缺乏减少频域相关性的机制。为了解决这个问题,我们提出了一种新框架,该框架将低复杂度的三维多级离散小波变换(DWT)集成到卷积层和熵编码中,从而减少了空间和通道的相关性,提高了频率选择性和率失真(R-D)性能。 我们的提出的3D多级小波域卷积(3DM-WeConv)层首先应用了三维多级离散小波变换(例如来自JPEG 2000的5/3和9/7小波),将数据转换到小波域中。然后,对不同的频率子带使用不同大小的卷积操作,最后通过逆向三维DWT恢复到空间域。 3DM-WeConv层可以灵活地应用在现有的基于CNN的LIC模型中。我们还引入了三维小波域通道自回归熵模型(3DWeChARM),该模型在三维DWT域中进行分片式熵编码。低频(LF)切片首先被编码,以提供高频(HF)切片的先验信息。采用两步训练策略:首先是平衡LF和HF速率,然后使用单独的权重进行微调。 大量的实验表明,在R-D性能和计算复杂度方面,我们的框架在最先进的基于CNN的LIC方法中始终表现出色,并且对于高分辨率图像具有更大的收益。在Kodak、Tecnick 100和CLIC测试集上,与H.266/VVC相比,我们的方法分别实现了-12.24%、-15.51%和-12.97%的BD-Rate降低。
https://arxiv.org/abs/2504.04658
Learned image compression methods have attracted great research interest and exhibited superior rate-distortion performance to the best classical image compression standards of the present. The entropy model plays a key role in learned image compression, which estimates the probability distribution of the latent representation for further entropy coding. Most existing methods employed hyper-prior and auto-regressive architectures to form their entropy models. However, they only aimed to explore the internal dependencies of latent representation while neglecting the importance of extracting prior from training data. In this work, we propose a novel entropy model named Dictionary-based Cross Attention Entropy model, which introduces a learnable dictionary to summarize the typical structures occurring in the training dataset to enhance the entropy model. Extensive experimental results have demonstrated that the proposed model strikes a better balance between performance and latency, achieving state-of-the-art results on various benchmark datasets.
学习到的图像压缩方法已经引起了广泛的研究兴趣,并且在率失真性能上超越了当前最好的经典图像压缩标准。熵模型在学习型图像压缩中扮演着关键角色,它估计潜在表示的概率分布以便于进一步的熵编码。目前大多数现有方法采用超先验和自回归架构来构建其熵模型。然而,这些方法仅致力于探索潜在表示的内部依赖性,而忽略了从训练数据中提取先验信息的重要性。 在此工作中,我们提出了一种新颖的熵模型——基于字典的交叉注意力熵模型(Dictionary-based Cross Attention Entropy model),该模型引入了一个可学习的字典来总结在训练数据集中出现的典型结构,从而增强熵模型。大量的实验结果表明,所提出的模型在性能和延迟之间达到了更好的平衡,并且在多个基准数据集上取得了最先进的成果。
https://arxiv.org/abs/2504.00496
Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression efficiency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low- and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for effective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating efficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. Code and models are available at: this https URL.
数字病理图像在医学诊断中扮演着关键角色,但其超高清分辨率和大文件大小给存储、传输和实时可视化带来了重大挑战。为了解决这些问题,我们提出了CLERIC——一种专门针对全片扫描图像(WSIs)设计的基于深度学习的新型图像压缩框架。CLERIC融合了一个可学习提升方案以及先进的卷积技术来提高压缩效率的同时保留关键病理细节。 我们的框架在分析阶段采用了一种提升方案变换,将图像分解为低频和高频成分,使得能够生成更结构化的潜在表示形式。这些成分通过包含变形残差块(DRB)和递归残差块(R2B)的并行编码器进行处理,从而改进特征提取及空间适应性。在合成阶段应用逆提升变换以有效重构图像,并确保细粒度组织结构的高保真恢复。 我们在数字病理图像数据集上评估了CLERIC的表现,并将其性能与当前最先进的学习型图像压缩(LIC)模型进行了比较。实验结果表明,CLERIC实现了卓越的率失真(RD)性能,在减少存储需求的同时保持了高质量诊断图像。我们的研究强调了基于深度学习的压缩技术在数字病理学中的潜力,促进了有效数据管理和长期存储,并确保无缝集成到临床工作流程和AI辅助诊断系统中。 代码及模型可在以下网址获得:[此处应为实际链接,请参阅原文或访问相关网页以获取准确链接]。
https://arxiv.org/abs/2503.23862
Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit this http URL source code is available at \href{this https URL}{this https URL}.
基于自编码器的结构在最近的学习型图像压缩方法中占据主导地位。然而,自编码器固有的信息损失限制了它们在高比特率下的速率失真性能,并且限制了其速率适应性。在这篇论文中,我们提出了一种基于可逆变换的可变比特率图像压缩模型,以克服这些局限性。具体而言,我们设计了一个轻量级多尺度可逆神经网络,该网络将输入图像双射映射到多尺度潜在表示中。为了提高压缩效率,我们设计了具有扩展增益单元的多尺度空间-通道上下文模型,从高层次到低层次估计潜在表示的熵。实验结果表明,所提出的方法在与现有可变比特率方法相比时达到了最先进的性能,并且仍然能够与最近的多模型方法竞争。值得注意的是,我们的方法是第一个使用单一模型即可在整个非常宽泛的比特率范围内(尤其是在高比特率下)超越VVC的学习型图像压缩解决方案。本研究的源代码可在[此处](https://this https URL)获得。 请注意,为了确保文本连贯性与准确度,请核实并替换掉类似"this https URL"这样的占位符链接为实际有效链接。
https://arxiv.org/abs/2503.21284
Neural image compression (NIC) has emerged as a promising alternative to classical compression techniques, offering improved compression ratios. Despite its progress towards standardization and practical deployment, there has been minimal exploration into it's robustness and security. This study reveals an unexpected vulnerability in NIC - bitstream collisions - where semantically different images produce identical compressed bitstreams. Utilizing a novel whitebox adversarial attack algorithm, this paper demonstrates that adding carefully crafted perturbations to semantically different images can cause their compressed bitstreams to collide exactly. The collision vulnerability poses a threat to the practical usability of NIC, particularly in security-critical applications. The cause of the collision is analyzed, and a simple yet effective mitigation method is presented.
神经图像压缩(NIC)作为一种有前景的替代传统压缩技术的方法,提供了更好的压缩比率。尽管它在标准化和实际部署方面取得了进展,但对其鲁棒性和安全性的研究仍然很少。这项研究表明了NIC中一个意想不到的安全漏洞——比特流碰撞,在这种情况下,语义不同的图片会产生相同的压缩比特流。通过使用一种新颖的白盒对抗攻击算法,本文展示了对语义不同的图像添加精心设计的扰动可以导致它们的压缩比特流完全相同的情况。这一碰撞漏洞威胁到了NIC在安全关键应用中的实际可用性。文章还分析了碰撞的原因,并提出了一种简单而有效的缓解方法。
https://arxiv.org/abs/2503.19817
Traditional image compression methods aim to faithfully reconstruct images for human perception. In contrast, Coding for Machines focuses on compressing images to preserve information relevant to a specific machine task. In this paper, we present an image compression system designed to retain text-specific features for subsequent Optical Character Recognition (OCR). Our encoding process requires half the time needed by the OCR module, making it especially suitable for devices with limited computational capacity. In scenarios where on-device OCR is computationally prohibitive, images are compressed and later processed to recover the text content. Experimental results demonstrate that our method achieves significant improvements in text extraction accuracy at low bitrates, even improving over the accuracy of OCR performed on uncompressed images, thus acting as a local pre-processing step.
传统的图像压缩方法旨在忠实于人类感知来重建图像,而机器编码则侧重于为了特定的机器任务而压缩图像以保留相关信息。本文提出了一种图像压缩系统,该系统设计用于保留专为后续光学字符识别(OCR)所需的文本特征。我们的编码过程所需时间仅为OCR模块的一半,因此特别适合计算能力有限的设备。在设备上执行OCR计算成本过高的情况下,可以先对图像进行压缩,在需要时再解压处理以恢复文本内容。实验结果表明,即使在低比特率的情况下,我们的方法也能显著提高文本提取的准确性,并且甚至超过了未压缩图像上进行OCR后的准确度,从而作为局部预处理步骤发挥重要作用。
https://arxiv.org/abs/2503.19495
Feature Coding for Machines (FCM) aims to compress intermediate features effectively for remote intelligent analytics, which is crucial for future intelligent visual applications. In this paper, we propose a Multiscale Feature Importance-based Bit Allocation (MFIBA) for end-to-end FCM. First, we find that the importance of features for machine vision tasks varies with the scales, object size, and image instances. Based on this finding, we propose a Multiscale Feature Importance Prediction (MFIP) module to predict the importance weight for each scale of features. Secondly, we propose a task loss-rate model to establish the relationship between the task accuracy losses of using compressed features and the bitrate of encoding these features. Finally, we develop a MFIBA for end-to-end FCM, which is able to assign coding bits of multiscale features more reasonably based on their importance. Experimental results demonstrate that when combined with a retained Efficient Learned Image Compression (ELIC), the proposed MFIBA achieves an average of 38.202% bitrate savings in object detection compared to the anchor ELIC. Moreover, the proposed MFIBA achieves an average of 17.212% and 36.492% feature bitrate savings for instance segmentation and keypoint detection, respectively. When the proposed MFIBA is applied to the LIC-TCM, it achieves an average of 18.103%, 19.866% and 19.597% bit rate savings on three machine vision tasks, respectively, which validates the proposed MFIBA has good generalizability and adaptability to different machine vision tasks and FCM base codecs.
特征编码为机器(FCM)的目标是有效地压缩中间特征,这对于远程智能分析至关重要,并且对未来智能视觉应用来说非常关键。在本文中,我们提出了一种基于多尺度特征重要性分配的位分配方法(MFIBA),用于端到端的FCM系统。首先,我们发现对于机器视觉任务而言,不同尺度、物体大小和图像实例下的特征的重要性是不同的。基于这一发现,我们提出了一个预测每个尺度特征重要性的模块——多尺度特征重要性预测(MFIP)。其次,我们提出了一种任务损失率模型来建立使用压缩后特征进行机器视觉任务时的精度损失与编码这些特征所需的比特数之间的关系。最后,我们开发了一个端到端的MFIBA方法,能够根据每个特征的重要性更合理地分配多尺度特征的编码位。 实验结果表明,在结合保留的有效学习图像压缩(ELIC)的情况下,所提出的MFIBA在对象检测中相较于基准的ELIC达到了平均38.202%比特率节省。此外,所提出的MFIBA在实例分割和关键点检测方面分别实现了17.212%和36.492%特征比特率节约。当应用于LIC-TCM时,所提出的MFIBA在三个机器视觉任务上分别达到了平均18.103%,19.866%和19.597%的比特节省,这验证了MFIBA具有良好的普适性和适应不同机器视觉任务及FCM基础编码器的能力。
https://arxiv.org/abs/2503.19278
Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model's potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.
图像压缩技术通过消除冗余信息,使图像的传输和存储更为高效,既满足了机器视觉的需求也服务于人类视觉感知。多年来,针对人类感知进行研究的图像编码技术已得到了深入探讨,并促成了多种图像压缩标准的发展。与此同时,随着图像识别模型的迅速进步,面向AI任务的图像压缩(即机器图像编码ICM)的重要性日益凸显。因此,同时满足机器和人类需求的可扩展图像编码技术已成为一个重要的研究领域。 此外,在将扩散模型应用于从少量数据生成可供人类观看的图像的研究中,对于为人类视觉开发图像压缩方法的需求也在不断增加。使用扩散模型的图像压缩方法可以通过少量条件信息指导生成过程来部分重建目标图像。受到扩散模型潜力的启发,我们提出了一种通过引导式扩散将机器视觉扩展到人类视觉感知的方法。利用由ICM方法输出引导的扩散模型,我们可以从随机噪声中生成供人类感知的图像。 引导式扩散充当了机器视觉与人类视觉之间的桥梁,在不增加额外比特率开销的情况下实现了两者之间的转换。所生成的图像随后会根据比特率和图像质量进行评估,并与其他面向人机视觉的可扩展图像编码方法的压缩性能进行比较。
https://arxiv.org/abs/2503.17907
With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency while maintaining identical task performance, compared with traditional image compression methods.
随着大规模多模态模型(LMMs)的快速发展,多模态理解应用正逐渐兴起。由于大多数LMM推理请求来源于计算能力有限的边缘设备,主流的推理管道是直接将输入数据转发到边缘服务器进行所有计算处理。然而,这种做法会因为边缘设备的上行带宽限制而引入高传输延迟,并且由于大量视觉标记造成的显著计算延迟,阻碍了对延迟敏感的任务并降低了用户体验。为解决这一挑战,我们提出了一种面向任务的功能压缩(TOFC)方法,在设备-边缘协同推理框架中用于多模态理解。这种方法通过聚类合并视觉特征,并在特征投影之前使用可学习和选择性的熵模型进行编码。具体来说,我们采用了基于K最近邻的密度峰值聚类技术来减少视觉特征的数量,从而最小化数据传输和计算复杂度。随后,利用带有超先验的可学习熵模型对合并后的特征进行编码和解码,进一步降低传输开销。为了提高压缩效率,根据视觉特征的特点自适应选择多个熵模型,实现更准确的概率分布估计。 在七个视觉问答基准测试中的综合实验验证了所提出的TOFC方法的有效性。结果显示,与传统的图像压缩方法相比,TOFC方法能够将数据传输开销降低高达60%,同时将系统延迟减少50%,并且保持相同的任务性能。
https://arxiv.org/abs/2503.12926
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we integrate the advantages of SSMs for better efficiency-performance trade-off and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code is released at this https URL.
高性能图像压缩算法对于众多领域的实时信息传输至关重要。尽管在图像压缩方面取得了迅速进展,但计算效率低下和冗余模型不佳仍然构成重大瓶颈,限制了实际应用。受状态空间模型(SSMs)在捕捉长程依赖性方面的有效性启发,我们利用 SSMs 来解决现有方法中的计算低效问题,并从多个角度改进图像压缩技术。在这篇论文中,我们将 SSM 的优势整合进来以实现更好的效率-性能权衡,并通过细化上下文建模提出一种增强的图像压缩方法,即 MambaIC。 具体而言,我们探索了上下文建模来自适应地精炼隐藏状态表示。此外,我们在通道空间熵建模中引入基于窗口的局部注意力机制,以减少压缩过程中的潜在空间冗余,从而提高效率。全面的定性和定量结果验证了我们的方法在有效性和效率方面的优越性,特别是在高分辨率图像压缩方面。代码可在 [此处](https://example.com) 获取。
https://arxiv.org/abs/2503.12461
The growing volume of high-resolution Whole Slide Images in digital histopathology poses significant storage, transmission, and computational efficiency challenges. Standard compression methods, such as JPEG, reduce file sizes but often fail to preserve fine-grained phenotypic details critical for downstream tasks. In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images. We systematically benchmark three AE models with varying compression levels and evaluate their reconstruction ability using pathology foundation models. We introduce a fine-tuning strategy to further enhance reconstruction fidelity that optimizes a pathology-specific learned perceptual metric. We validate our approach on downstream tasks, including segmentation, patch classification, and multiple instance learning, showing that replacing images with AE-compressed reconstructions leads to minimal performance degradation. Additionally, we propose a K-means clustering-based quantization method for AE latents, improving storage efficiency while maintaining reconstruction quality. We provide the weights of the fine-tuned autoencoders at this https URL.
在数字病理学中,高分辨率的全切片图像(Whole Slide Images, WSI)的数量不断增加,这带来了显著的数据存储、传输和计算效率方面的挑战。标准压缩方法如JPEG可以减小文件大小,但往往无法保留下游任务所需的精细表型细节。在这项工作中,我们重新利用了为潜在扩散模型设计的自编码器(Autoencoders, AEs),将其作为病理图像的一种高效学习压缩框架。我们系统地对三种不同压缩级别的AE模型进行了基准测试,并使用病理基础模型对其重建能力进行了评估。我们引入了一种微调策略,通过优化特定于病理学的学习感知度量来进一步提高重建的准确性。我们在下游任务(包括分割、切片分类和多实例学习)上验证了我们的方法,结果表明用AE压缩后的图像替换原始图像会导致性能下降很小甚至没有下降。此外,我们还提出了一种基于K-means聚类的量化方法以改进AE潜在变量的存储效率,同时保持重建质量。 我们提供了微调过的自编码器权重,可在以下网址获取:[此处提供URL]。
https://arxiv.org/abs/2503.11591
By optimizing the rate-distortion-realism trade-off, generative image compression approaches produce detailed, realistic images instead of the only sharp-looking reconstructions produced by rate-distortion-optimized models. In this paper, we propose a novel deep learning-based generative image compression method injected with diffusion knowledge, obtaining the capacity to recover more realistic textures in practical scenarios. Efforts are made from three perspectives to navigate the rate-distortion-realism trade-off in the generative image compression task. First, recognizing the strong connection between image texture and frequency-domain characteristics, we design a Fractal Frequency-Aware Band Image Compression (FFAB-IC) network to effectively capture the directional frequency components inherent in natural images. This network integrates commonly used fractal band feature operations within a neural non-linear mapping design, enhancing its ability to retain essential given information and filter out unnecessary details. Then, to improve the visual quality of image reconstruction under limited bandwidth, we integrate diffusion knowledge into the encoder and implement diffusion iterations into the decoder process, thus effectively recovering lost texture details. Finally, to fully leverage the spatial and frequency intensity information, we incorporate frequency- and content-aware regularization terms to regularize the training of the generative image compression network. Extensive experiments in quantitative and qualitative evaluations demonstrate the superiority of the proposed method, advancing the boundaries of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.
通过优化率失真现实性的权衡,生成式图像压缩方法能够产生详细且真实的图像,而不仅仅是像那些仅以率失真优化模型所产出的那样仅仅外观清晰的重建图像。在这篇论文中,我们提出了一种新颖的基于深度学习的生成式图像压缩方法,并注入了扩散知识,使其能够在实际场景中恢复更逼真的纹理。 为了在生成式图像压缩任务中导航率-失真-现实性的权衡,我们在三个方面做出了努力: 首先,认识到图像纹理与频域特征之间的强关联性,我们设计了一种分形频率感知带图像压缩(Fractal Frequency-Aware Band Image Compression, FFAB-IC)网络来有效地捕捉自然图像中存在的方向频率成分。该网络将常用的分形带特征操作整合到神经非线性映射的设计中,从而增强了其保留关键信息并过滤掉不必要的细节的能力。 其次,为了在有限的带宽下提高图像重建的视觉质量,我们将扩散知识集成到编码器中,并在解码过程中实现了扩散迭代,从而有效地恢复了丢失的纹理细节。 最后,为充分利用空间和频率强度信息,我们引入了频域感知和内容感知的正则化项来对生成式图像压缩网络进行训练调节。 广泛的定量和定性实验表明所提出方法的优势:它扩展了可实现的失真-现实性组合的边界——即我们的方法在高真实度下实现了更好的失真,在低失真下达到了前所未有的真实度。
https://arxiv.org/abs/2503.11321
Learning-based lossless image compression employs pixel-based or subimage-based auto-regression for probability estimation, which achieves desirable performances. However, the existing works only consider context dependencies in one direction, namely, those symbols that appear before the current symbol in raster order. We believe that the dependencies between the current and future symbols should be further considered. In this work, we propose a deep lossless image compression via masked sampling and coarse-to-fine auto-regression. It combines lossy reconstruction and progressive residual compression, which fuses contexts from various directions and is more consistent with human perception. Specifically, the residuals are decomposed via $T$ iterative masked sampling, and each sampling consists of three steps: 1) probability estimation, 2) mask computation, and 3) arithmetic coding. The iterative process progressively refines our prediction and gradually presents a real image. Extensive experimental results show that compared with the existing traditional and learned lossless compression, our method achieves comparable compression performance on extensive datasets with competitive coding speed and more flexibility.
基于学习的无损图像压缩方法采用像素级或子图级的自回归模型进行概率估计,从而达到较好的性能。然而,现有的研究仅考虑了一个方向上的上下文依赖关系,即在光栅顺序中出现在当前符号之前的那些符号。我们认为应该进一步考虑当前符号与未来符号之间的依赖关系。 在这项工作中,我们提出了一种通过掩码采样和从粗到细的自回归过程实现的深度无损图像压缩方法。这种方法结合了有损重建和渐进残差压缩,并融合了多方向的上下文信息,更符合人类感知的特点。具体而言,该方法通过对$T$次迭代的掩码采样来分解残差,每次采样包括三个步骤:1)概率估计;2)掩码计算;3)算术编码。通过这个迭代过程逐步细化预测,并逐渐呈现出真实图像。 大量的实验结果表明,与现有的传统和基于学习的方法相比,在广泛的测试数据集上,我们的方法实现了可比的压缩性能、具有竞争力的编码速度以及更高的灵活性。
https://arxiv.org/abs/2503.11231
Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at this http URL .
自从出现像VQGAN这样的流行视觉生成框架和潜在扩散模型以来,最先进的图像生成系统通常都是两阶段的系统,首先将视觉数据标记化或压缩为低维潜在空间,然后再学习生成模型。标记器训练通常遵循标准方案,在该方案中,图片在均方误差(MSE)、感知和对抗损失的组合约束下被压缩并重建。 此前的工作提出了扩散自编码器作为端到端感知导向图像压缩的学习方法,但尚未在ImageNet-1K重构这一竞争性任务上展示出最先进的性能。我们提出了一种基于变压器的扩散自编码器FlowMo,在不使用卷积、对抗损失、空间对齐的二维潜在码或从其他标记器蒸馏的情况下,实现了多种压缩率下的图像标记化的新最佳性能。我们的关键见解是,FlowMo训练应该分为模式匹配预训练阶段和模式寻找后训练阶段。 此外,我们还进行了广泛的分析,并探索了在FlowMo标记器之上训练生成模型的方法。代码和模型将在[此链接](http://this http URL)提供。
https://arxiv.org/abs/2503.11056
Deep Neural Networks (DNNs) have become an integral part of our daily lives, especially in vision-related applications. However, the conventional lossy image compression algorithms are primarily designed for the Human Vision System (HVS), which can non-trivially compromise the DNNs' validation accuracy after compression, as noted in \cite{liu2018deepn}. Thus developing an image compression algorithm for both human and machine (DNNs) is on the horizon. To address the challenge mentioned above, in this paper, we first formulate the image compression as a multi-objective optimization problem which take both human and machine prespectives into account, then we solve it by linear combination, and proposed a novel distortion measure for both human and machine, dubbed Human and Machine-Oriented Error (HMOE). After that, we develop Human And Machine Oriented Soft Decision Quantization (HMOSDQ) based on HMOE, a lossy image compression algorithm for both human and machine (DNNs), and fully complied with JPEG format. In order to evaluate the performance of HMOSDQ, finally we conduct the experiments for two pre-trained well-known DNN-based image classifiers named Alexnet \cite{Alexnet} and VGG-16 \cite{simonyan2014VGG} on two subsets of the ImageNet \cite{deng2009imagenet} validation set: one subset included images with shorter side in the range of 496 to 512, while the other included images with shorter side in the range of 376 to 384. Our results demonstrate that HMOSDQ outperforms the default JPEG algorithm in terms of rate-accuracy and rate-distortion performance. For the Alexnet comparing with the default JPEG algorithm, HMOSDQ can improve the validation accuracy by more than $0.81\%$ at $0.61$ BPP, or equivalently reduce the compression rate of default JPEG by $9.6\times$ while maintaining the same validation accuracy.
深度神经网络(DNN)已经成为我们日常生活中不可或缺的一部分,特别是在与视觉相关的应用中。然而,传统的有损图像压缩算法主要是为人类视觉系统(HVS)设计的,在经过压缩后可能会显著影响DNNs的验证准确性,如文献\cite{liu2018deepn}所述。因此,开发一种既能满足人类又能满足机器(DNNs)需求的图像压缩算法已经成为一个重要方向。 为了应对上述挑战,本文首先将图像压缩问题表述为一个多目标优化问题,同时考虑了人眼和机器视角的需求,然后通过线性组合的方法解决了这个问题,并提出了一种新的用于衡量人眼和机器双重标准的失真度量方法——人类与机器导向误差(HMOE)。在此基础上,我们根据HMOE开发了一个名为“人类与机器导向软决策量化”(HMOSDQ)的新图像压缩算法,该算法既满足了人类视觉的需求也适用于深度神经网络,并且完全兼容JPEG格式。 为了评估HMOSDQ的性能,我们在ImageNet \cite{deng2009imagenet}验证集的两个子集中进行实验,使用两个著名的基于DNN的图像分类器——Alexnet \cite{Alexnet}和VGG-16 \cite{simonyan2014VGG}。其中一个子集包含了短边长度在496到512之间的图像,另一个则包含短边长度在376到384之间的图像。我们的实验结果显示,HMOSDQ在比特率和准确度、压缩效率与失真度方面的性能均优于默认的JPEG算法。 对于Alexnet模型,在0.61 BPP(每像素位)的情况下,使用HMOSDQ可以比默认JPEG算法提高验证准确性超过$0.81\%$;或者等效地,保持相同的准确率的同时,将默认JPEG算法的压缩率降低9.6倍。
https://arxiv.org/abs/2503.10912
We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at this https URL.
我们介绍了PerCoV2,这是一种新颖且开放的超低比特率感知图像压缩系统,专为带宽和存储受限的应用程序设计。基于Careil等人的先前工作,PerCoV2将原始方法扩展到了Stable Diffusion 3生态系统,并通过显式建模离散超潜变量(hyper-latent)图象分布来提高熵编码效率。为此,我们对最近的自回归方法(VAR和MaskGIT)在熵建模方面的性能进行了全面比较,并在大规模MSCOCO-30k基准上评估了我们的方法。 与先前的工作相比,PerCoV2 (i) 在更低比特率下实现了更高的图像保真度,同时保持了竞争性的感知质量;(ii) 具备混合生成模式以进一步节省比特率;以及(iii) 完全基于公共组件构建。代码和训练好的模型将在以下网址发布:[此URL](https://this.url.com/)。
https://arxiv.org/abs/2503.09368
With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
随着基于转换器的模型和预训练微调范式的流行,为多个任务单独部署微调模型所带来的高存储成本和部署成本成为了一个关键问题。Delta压缩试图通过减少增量参数(即微调模型与预训练模型之间的权重差异)中的冗余来降低这些成本。然而,现有方法通常面临数据访问性和训练要求等问题。为了应对这一挑战,我们引入了Delta-DCT,这是首个基于经典JPEG图像压缩技术,并利用离散余弦变换(DCT)的数据无关增量压缩方法。 具体来说,我们的方法包括以下步骤: 1. 将每一层中的增量参数划分为不同的块或“补丁”。 2. 评估每个“补丁”的重要性并为其分配不同的量化位宽。 3. 在将这些“补丁”转换到DCT域后,根据所分配的位宽对每个“补丁”进行量化处理。 提出的Delta-DCT方法无需任何训练或数据校准即可实现,在不同压缩比率下(包括1比特等效增量压缩比),其性能可与原始微调模型相媲美甚至超越。这适用于各种类型的模型,例如: - 近期发布的规模从7B到13B的大型语言模型(LLMs); - 包括RoBERTa和T5在内的相对较小的语言模型; - 视觉转换器模型的各种变体; - 多模态BEiT-3模型。
https://arxiv.org/abs/2503.06676