With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency while maintaining identical task performance, compared with traditional image compression methods.
随着大规模多模态模型(LMMs)的快速发展,多模态理解应用正逐渐兴起。由于大多数LMM推理请求来源于计算能力有限的边缘设备,主流的推理管道是直接将输入数据转发到边缘服务器进行所有计算处理。然而,这种做法会因为边缘设备的上行带宽限制而引入高传输延迟,并且由于大量视觉标记造成的显著计算延迟,阻碍了对延迟敏感的任务并降低了用户体验。为解决这一挑战,我们提出了一种面向任务的功能压缩(TOFC)方法,在设备-边缘协同推理框架中用于多模态理解。这种方法通过聚类合并视觉特征,并在特征投影之前使用可学习和选择性的熵模型进行编码。具体来说,我们采用了基于K最近邻的密度峰值聚类技术来减少视觉特征的数量,从而最小化数据传输和计算复杂度。随后,利用带有超先验的可学习熵模型对合并后的特征进行编码和解码,进一步降低传输开销。为了提高压缩效率,根据视觉特征的特点自适应选择多个熵模型,实现更准确的概率分布估计。 在七个视觉问答基准测试中的综合实验验证了所提出的TOFC方法的有效性。结果显示,与传统的图像压缩方法相比,TOFC方法能够将数据传输开销降低高达60%,同时将系统延迟减少50%,并且保持相同的任务性能。
https://arxiv.org/abs/2503.12926
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we integrate the advantages of SSMs for better efficiency-performance trade-off and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code is released at this https URL.
高性能图像压缩算法对于众多领域的实时信息传输至关重要。尽管在图像压缩方面取得了迅速进展,但计算效率低下和冗余模型不佳仍然构成重大瓶颈,限制了实际应用。受状态空间模型(SSMs)在捕捉长程依赖性方面的有效性启发,我们利用 SSMs 来解决现有方法中的计算低效问题,并从多个角度改进图像压缩技术。在这篇论文中,我们将 SSM 的优势整合进来以实现更好的效率-性能权衡,并通过细化上下文建模提出一种增强的图像压缩方法,即 MambaIC。 具体而言,我们探索了上下文建模来自适应地精炼隐藏状态表示。此外,我们在通道空间熵建模中引入基于窗口的局部注意力机制,以减少压缩过程中的潜在空间冗余,从而提高效率。全面的定性和定量结果验证了我们的方法在有效性和效率方面的优越性,特别是在高分辨率图像压缩方面。代码可在 [此处](https://example.com) 获取。
https://arxiv.org/abs/2503.12461
The growing volume of high-resolution Whole Slide Images in digital histopathology poses significant storage, transmission, and computational efficiency challenges. Standard compression methods, such as JPEG, reduce file sizes but often fail to preserve fine-grained phenotypic details critical for downstream tasks. In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images. We systematically benchmark three AE models with varying compression levels and evaluate their reconstruction ability using pathology foundation models. We introduce a fine-tuning strategy to further enhance reconstruction fidelity that optimizes a pathology-specific learned perceptual metric. We validate our approach on downstream tasks, including segmentation, patch classification, and multiple instance learning, showing that replacing images with AE-compressed reconstructions leads to minimal performance degradation. Additionally, we propose a K-means clustering-based quantization method for AE latents, improving storage efficiency while maintaining reconstruction quality. We provide the weights of the fine-tuned autoencoders at this https URL.
在数字病理学中,高分辨率的全切片图像(Whole Slide Images, WSI)的数量不断增加,这带来了显著的数据存储、传输和计算效率方面的挑战。标准压缩方法如JPEG可以减小文件大小,但往往无法保留下游任务所需的精细表型细节。在这项工作中,我们重新利用了为潜在扩散模型设计的自编码器(Autoencoders, AEs),将其作为病理图像的一种高效学习压缩框架。我们系统地对三种不同压缩级别的AE模型进行了基准测试,并使用病理基础模型对其重建能力进行了评估。我们引入了一种微调策略,通过优化特定于病理学的学习感知度量来进一步提高重建的准确性。我们在下游任务(包括分割、切片分类和多实例学习)上验证了我们的方法,结果表明用AE压缩后的图像替换原始图像会导致性能下降很小甚至没有下降。此外,我们还提出了一种基于K-means聚类的量化方法以改进AE潜在变量的存储效率,同时保持重建质量。 我们提供了微调过的自编码器权重,可在以下网址获取:[此处提供URL]。
https://arxiv.org/abs/2503.11591
By optimizing the rate-distortion-realism trade-off, generative image compression approaches produce detailed, realistic images instead of the only sharp-looking reconstructions produced by rate-distortion-optimized models. In this paper, we propose a novel deep learning-based generative image compression method injected with diffusion knowledge, obtaining the capacity to recover more realistic textures in practical scenarios. Efforts are made from three perspectives to navigate the rate-distortion-realism trade-off in the generative image compression task. First, recognizing the strong connection between image texture and frequency-domain characteristics, we design a Fractal Frequency-Aware Band Image Compression (FFAB-IC) network to effectively capture the directional frequency components inherent in natural images. This network integrates commonly used fractal band feature operations within a neural non-linear mapping design, enhancing its ability to retain essential given information and filter out unnecessary details. Then, to improve the visual quality of image reconstruction under limited bandwidth, we integrate diffusion knowledge into the encoder and implement diffusion iterations into the decoder process, thus effectively recovering lost texture details. Finally, to fully leverage the spatial and frequency intensity information, we incorporate frequency- and content-aware regularization terms to regularize the training of the generative image compression network. Extensive experiments in quantitative and qualitative evaluations demonstrate the superiority of the proposed method, advancing the boundaries of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.
通过优化率失真现实性的权衡,生成式图像压缩方法能够产生详细且真实的图像,而不仅仅是像那些仅以率失真优化模型所产出的那样仅仅外观清晰的重建图像。在这篇论文中,我们提出了一种新颖的基于深度学习的生成式图像压缩方法,并注入了扩散知识,使其能够在实际场景中恢复更逼真的纹理。 为了在生成式图像压缩任务中导航率-失真-现实性的权衡,我们在三个方面做出了努力: 首先,认识到图像纹理与频域特征之间的强关联性,我们设计了一种分形频率感知带图像压缩(Fractal Frequency-Aware Band Image Compression, FFAB-IC)网络来有效地捕捉自然图像中存在的方向频率成分。该网络将常用的分形带特征操作整合到神经非线性映射的设计中,从而增强了其保留关键信息并过滤掉不必要的细节的能力。 其次,为了在有限的带宽下提高图像重建的视觉质量,我们将扩散知识集成到编码器中,并在解码过程中实现了扩散迭代,从而有效地恢复了丢失的纹理细节。 最后,为充分利用空间和频率强度信息,我们引入了频域感知和内容感知的正则化项来对生成式图像压缩网络进行训练调节。 广泛的定量和定性实验表明所提出方法的优势:它扩展了可实现的失真-现实性组合的边界——即我们的方法在高真实度下实现了更好的失真,在低失真下达到了前所未有的真实度。
https://arxiv.org/abs/2503.11321
Learning-based lossless image compression employs pixel-based or subimage-based auto-regression for probability estimation, which achieves desirable performances. However, the existing works only consider context dependencies in one direction, namely, those symbols that appear before the current symbol in raster order. We believe that the dependencies between the current and future symbols should be further considered. In this work, we propose a deep lossless image compression via masked sampling and coarse-to-fine auto-regression. It combines lossy reconstruction and progressive residual compression, which fuses contexts from various directions and is more consistent with human perception. Specifically, the residuals are decomposed via $T$ iterative masked sampling, and each sampling consists of three steps: 1) probability estimation, 2) mask computation, and 3) arithmetic coding. The iterative process progressively refines our prediction and gradually presents a real image. Extensive experimental results show that compared with the existing traditional and learned lossless compression, our method achieves comparable compression performance on extensive datasets with competitive coding speed and more flexibility.
基于学习的无损图像压缩方法采用像素级或子图级的自回归模型进行概率估计,从而达到较好的性能。然而,现有的研究仅考虑了一个方向上的上下文依赖关系,即在光栅顺序中出现在当前符号之前的那些符号。我们认为应该进一步考虑当前符号与未来符号之间的依赖关系。 在这项工作中,我们提出了一种通过掩码采样和从粗到细的自回归过程实现的深度无损图像压缩方法。这种方法结合了有损重建和渐进残差压缩,并融合了多方向的上下文信息,更符合人类感知的特点。具体而言,该方法通过对$T$次迭代的掩码采样来分解残差,每次采样包括三个步骤:1)概率估计;2)掩码计算;3)算术编码。通过这个迭代过程逐步细化预测,并逐渐呈现出真实图像。 大量的实验结果表明,与现有的传统和基于学习的方法相比,在广泛的测试数据集上,我们的方法实现了可比的压缩性能、具有竞争力的编码速度以及更高的灵活性。
https://arxiv.org/abs/2503.11231
Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at this http URL .
自从出现像VQGAN这样的流行视觉生成框架和潜在扩散模型以来,最先进的图像生成系统通常都是两阶段的系统,首先将视觉数据标记化或压缩为低维潜在空间,然后再学习生成模型。标记器训练通常遵循标准方案,在该方案中,图片在均方误差(MSE)、感知和对抗损失的组合约束下被压缩并重建。 此前的工作提出了扩散自编码器作为端到端感知导向图像压缩的学习方法,但尚未在ImageNet-1K重构这一竞争性任务上展示出最先进的性能。我们提出了一种基于变压器的扩散自编码器FlowMo,在不使用卷积、对抗损失、空间对齐的二维潜在码或从其他标记器蒸馏的情况下,实现了多种压缩率下的图像标记化的新最佳性能。我们的关键见解是,FlowMo训练应该分为模式匹配预训练阶段和模式寻找后训练阶段。 此外,我们还进行了广泛的分析,并探索了在FlowMo标记器之上训练生成模型的方法。代码和模型将在[此链接](http://this http URL)提供。
https://arxiv.org/abs/2503.11056
Deep Neural Networks (DNNs) have become an integral part of our daily lives, especially in vision-related applications. However, the conventional lossy image compression algorithms are primarily designed for the Human Vision System (HVS), which can non-trivially compromise the DNNs' validation accuracy after compression, as noted in \cite{liu2018deepn}. Thus developing an image compression algorithm for both human and machine (DNNs) is on the horizon. To address the challenge mentioned above, in this paper, we first formulate the image compression as a multi-objective optimization problem which take both human and machine prespectives into account, then we solve it by linear combination, and proposed a novel distortion measure for both human and machine, dubbed Human and Machine-Oriented Error (HMOE). After that, we develop Human And Machine Oriented Soft Decision Quantization (HMOSDQ) based on HMOE, a lossy image compression algorithm for both human and machine (DNNs), and fully complied with JPEG format. In order to evaluate the performance of HMOSDQ, finally we conduct the experiments for two pre-trained well-known DNN-based image classifiers named Alexnet \cite{Alexnet} and VGG-16 \cite{simonyan2014VGG} on two subsets of the ImageNet \cite{deng2009imagenet} validation set: one subset included images with shorter side in the range of 496 to 512, while the other included images with shorter side in the range of 376 to 384. Our results demonstrate that HMOSDQ outperforms the default JPEG algorithm in terms of rate-accuracy and rate-distortion performance. For the Alexnet comparing with the default JPEG algorithm, HMOSDQ can improve the validation accuracy by more than $0.81\%$ at $0.61$ BPP, or equivalently reduce the compression rate of default JPEG by $9.6\times$ while maintaining the same validation accuracy.
深度神经网络(DNN)已经成为我们日常生活中不可或缺的一部分,特别是在与视觉相关的应用中。然而,传统的有损图像压缩算法主要是为人类视觉系统(HVS)设计的,在经过压缩后可能会显著影响DNNs的验证准确性,如文献\cite{liu2018deepn}所述。因此,开发一种既能满足人类又能满足机器(DNNs)需求的图像压缩算法已经成为一个重要方向。 为了应对上述挑战,本文首先将图像压缩问题表述为一个多目标优化问题,同时考虑了人眼和机器视角的需求,然后通过线性组合的方法解决了这个问题,并提出了一种新的用于衡量人眼和机器双重标准的失真度量方法——人类与机器导向误差(HMOE)。在此基础上,我们根据HMOE开发了一个名为“人类与机器导向软决策量化”(HMOSDQ)的新图像压缩算法,该算法既满足了人类视觉的需求也适用于深度神经网络,并且完全兼容JPEG格式。 为了评估HMOSDQ的性能,我们在ImageNet \cite{deng2009imagenet}验证集的两个子集中进行实验,使用两个著名的基于DNN的图像分类器——Alexnet \cite{Alexnet}和VGG-16 \cite{simonyan2014VGG}。其中一个子集包含了短边长度在496到512之间的图像,另一个则包含短边长度在376到384之间的图像。我们的实验结果显示,HMOSDQ在比特率和准确度、压缩效率与失真度方面的性能均优于默认的JPEG算法。 对于Alexnet模型,在0.61 BPP(每像素位)的情况下,使用HMOSDQ可以比默认JPEG算法提高验证准确性超过$0.81\%$;或者等效地,保持相同的准确率的同时,将默认JPEG算法的压缩率降低9.6倍。
https://arxiv.org/abs/2503.10912
We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at this https URL.
我们介绍了PerCoV2,这是一种新颖且开放的超低比特率感知图像压缩系统,专为带宽和存储受限的应用程序设计。基于Careil等人的先前工作,PerCoV2将原始方法扩展到了Stable Diffusion 3生态系统,并通过显式建模离散超潜变量(hyper-latent)图象分布来提高熵编码效率。为此,我们对最近的自回归方法(VAR和MaskGIT)在熵建模方面的性能进行了全面比较,并在大规模MSCOCO-30k基准上评估了我们的方法。 与先前的工作相比,PerCoV2 (i) 在更低比特率下实现了更高的图像保真度,同时保持了竞争性的感知质量;(ii) 具备混合生成模式以进一步节省比特率;以及(iii) 完全基于公共组件构建。代码和训练好的模型将在以下网址发布:[此URL](https://this.url.com/)。
https://arxiv.org/abs/2503.09368
With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
随着基于转换器的模型和预训练微调范式的流行,为多个任务单独部署微调模型所带来的高存储成本和部署成本成为了一个关键问题。Delta压缩试图通过减少增量参数(即微调模型与预训练模型之间的权重差异)中的冗余来降低这些成本。然而,现有方法通常面临数据访问性和训练要求等问题。为了应对这一挑战,我们引入了Delta-DCT,这是首个基于经典JPEG图像压缩技术,并利用离散余弦变换(DCT)的数据无关增量压缩方法。 具体来说,我们的方法包括以下步骤: 1. 将每一层中的增量参数划分为不同的块或“补丁”。 2. 评估每个“补丁”的重要性并为其分配不同的量化位宽。 3. 在将这些“补丁”转换到DCT域后,根据所分配的位宽对每个“补丁”进行量化处理。 提出的Delta-DCT方法无需任何训练或数据校准即可实现,在不同压缩比率下(包括1比特等效增量压缩比),其性能可与原始微调模型相媲美甚至超越。这适用于各种类型的模型,例如: - 近期发布的规模从7B到13B的大型语言模型(LLMs); - 包括RoBERTa和T5在内的相对较小的语言模型; - 视觉转换器模型的各种变体; - 多模态BEiT-3模型。
https://arxiv.org/abs/2503.06676
Learned image compression (LIC) methods have recently outperformed traditional codecs such as VVC in rate-distortion performance. However, their large models and high computational costs have limited their practical adoption. In this paper, we first construct a high-capacity teacher model by integrating Swin-Transformer V2-based attention modules, additional residual blocks, and expanded latent channels, thus achieving enhanced compression performance. Building on this foundation, we propose a \underline{F}eature and \underline{E}ntropy-based \underline{D}istillation \underline{S}trategy (\textbf{FEDS}) that transfers key knowledge from the teacher to a lightweight student model. Specifically, we align intermediate feature representations and emphasize the most informative latent channels through an entropy-based loss. A staged training scheme refines this transfer in three phases: feature alignment, channel-level distillation, and final fine-tuning. Our student model nearly matches the teacher across Kodak (1.24\% BD-Rate increase), Tecnick (1.17\%), and CLIC (0.55\%) while cutting parameters by about 63\% and accelerating encoding/decoding by around 73\%. Moreover, ablation studies indicate that FEDS generalizes effectively to transformer-based networks. The experimental results demonstrate our approach strikes a compelling balance among compression performance, speed, and model parameters, making it well-suited for real-time or resource-limited scenarios.
最近,基于学习的图像压缩(LIC)方法在率失真性能上已经超越了传统编解码器如VVC。然而,它们庞大的模型规模和高昂的计算成本限制了实际应用中的采用。在这篇论文中,我们首先通过整合Swin-Transformer V2注意力模块、额外的残差块以及扩展的潜在通道构建了一个高容量教师模型,从而实现了增强的压缩性能。在此基础上,我们提出了一种基于特征和熵的蒸馏策略(FEDS),该策略能够将关键知识从教师模型转移到轻量级的学生模型中。 具体来说,通过引入一种基于熵的损失函数来对齐中间特征表示并突出最具信息价值的潜在通道。一个分阶段训练方案在三个阶段进一步完善了这种转移:特征对齐、通道级别的蒸馏以及最终微调。我们的学生模型几乎与教师模型在Kodak(1.24% BD-Rate增加)、Tecnick(1.17%)和CLIC(0.55%)数据集上表现相当,同时参数减少了约63%,编码/解码速度提高了大约73%。此外,消融研究显示FEDS策略能够有效地推广到基于变压器的网络中。 实验结果表明我们的方法在压缩性能、速度以及模型参数之间实现了令人信服的平衡,使其非常适合实时或资源受限的应用场景。
https://arxiv.org/abs/2503.06399
Learnable Image Compression (LIC) has shown the potential to outperform standardized video codecs in RD efficiency, prompting the research for hardware-friendly implementations. Most existing LIC hardware implementations prioritize latency to RD-efficiency and through an extensive exploration of the hardware design space. We present a novel design paradigm where the burden of tuning the design for a specific hardware platform is shifted towards model dimensioning and without compromising on RD-efficiency. First, we design a framework for distilling a leaner student LIC model from a reference teacher: by tuning a single model hyperparameters, we can meet the constraints of different hardware platforms without a complex hardware design exploration. Second, we propose a hardware-friendly implementation of the Generalized Divisive Normalization (GDN) activation that preserves RD efficiency even post parameter quantization. Third, we design a pipelined FPGA configuration which takes full advantage of available FPGA resources by leveraging parallel processing and optimizing resource allocation. Our experiments with a state of the art LIC model show that we outperform all existing FPGA implementations while performing very close to the original model in terms of RD efficiency.
可学习图像压缩(LIC)展示了在速率失真(RD)效率方面超越标准视频编解码器的潜力,这促使了对硬件友好的实现方式的研究。目前大多数现有的LIC硬件实现方案优先考虑延迟而非RD效率,并通过广泛探索硬件设计空间来解决这一问题。我们提出了一种新的设计理念,在不牺牲RD效率的前提下,将针对特定硬件平台进行调整的设计负担转移到模型维度的调整上。 首先,我们设计了一个框架,用于从参考教师模型中提炼出一个更精简的学生LIC模型:只需通过调整单一模型超参数,就可以满足不同硬件平台的要求,而无需复杂的硬件设计探索。其次,我们提出了一种通用除法归一化(GDN)激活函数的硬件友好型实现方法,在进行参数量化后仍然能够保持RD效率。第三,我们设计了一个流水线化的FPGA配置,通过利用并行处理和优化资源分配来充分利用可用的FPGA资源。 我们的实验表明,使用最新的LIC模型,我们在RD效率方面几乎与原始模型相同,并且在所有现有的FPGA实现中表现最佳。
https://arxiv.org/abs/2503.04832
Recent studies in extreme image compression have achieved remarkable performance by compressing the tokens from generative tokenizers. However, these methods often prioritize clustering common semantics within the dataset, while overlooking the diverse details of individual objects. Consequently, this results in suboptimal reconstruction fidelity, especially at low bitrates. To address this issue, we introduce a Dual-generative Latent Fusion (DLF) paradigm. DLF decomposes the latent into semantic and detail elements, compressing them through two distinct branches. The semantic branch clusters high-level information into compact tokens, while the detail branch encodes perceptually critical details to enhance the overall fidelity. Additionally, we propose a cross-branch interactive design to reduce redundancy between the two branches, thereby minimizing the overall bit cost. Experimental results demonstrate the impressive reconstruction quality of DLF even below 0.01 bits per pixel (bpp). On the CLIC2020 test set, our method achieves bitrate savings of up to 27.93% on LPIPS and 53.55% on DISTS compared to MS-ILLM. Furthermore, DLF surpasses recent diffusion-based codecs in visual fidelity while maintaining a comparable level of generative realism. Code will be available later.
最近的研究在极端图像压缩方面取得了显著的成果,通过将生成式分词器中的标记进行压缩来实现。然而,这些方法通常优先考虑数据集中常见语义的聚类,而忽视了单个对象的多样细节。因此,在低比特率下会导致重建保真度不足的问题。为了解决这个问题,我们引入了一种双生生成潜在融合(DLF)范式。DLF将潜在空间分解为语义和细节元素,并通过两个独立的分支对其进行压缩。语义分支聚类高层次信息到紧凑标记中,而细节分支编码感知上至关重要的细节以增强整体保真度。此外,我们提出了一种跨分支交互设计来减少两部分之间的冗余,从而最小化总体比特成本。 实验结果表明,即使在每像素0.01位(bpp)以下的极低比特率下,DLF也展示了其出色的重建质量。在CLIC2020测试集上,相较于MS-ILLM方法,我们的技术在LPIPS和DISTS指标上的比特率节省分别高达27.93%和53.55%。此外,在视觉保真度方面,DLF超越了最近的扩散模型编码器,并且保持了一致的生成现实主义水平。 代码将在后期提供。
https://arxiv.org/abs/2503.01428
It remains a significant challenge to compress images at ultra-low bitrate while achieving both semantic consistency and high perceptual quality. We propose a novel image compression framework, Semantically Disentangled Image Compression (SEDIC) in this paper. Our proposed SEDIC leverages large multimodal models (LMMs) to disentangle the image into several essential semantic information, including an extremely compressed reference image, overall and object-level text descriptions, and the semantic masks. A multi-stage semantic decoder is designed to progressively restore the transmitted reference image object-by-object, ultimately producing high-quality and perceptually consistent reconstructions. In each decoding stage, a pre-trained controllable diffusion model is utilized to restore the object details on the reference image conditioned by the text descriptions and semantic masks. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at ultra-low bitrates ($\le$ 0.05 bpp). Our code is available at this https URL.
在极低比特率下压缩图像同时保持语义一致性和高质量的感知效果仍然是一个重大挑战。本文提出了一种新颖的图像压缩框架,名为语义解耦图像压缩(Semantically Disentangled Image Compression, SEDIC)。我们提出的SEDIC利用大型多模态模型(Large Multimodal Models, LMMs)将图像分解为几个基本的语义信息,包括一个极度压缩的参考图像、总体和对象级别的文本描述以及语义掩码。设计了一个多阶段语义解码器,逐步以每个对象为基础恢复传输的参考图像,最终生成高质量且感知一致的重建图像。在每一解码阶段中,利用预训练的可控扩散模型根据文本描述和语义掩码条件来恢复参考图上的对象细节。 实验结果表明,SEDIC显著优于现有方法,在极低比特率(≤0.05 bpp)下实现了更好的感知质量和语义一致性。我们的代码可在[此处](https://此链接应替换为实际提供的URL)获取。
https://arxiv.org/abs/2503.00399
3D Gaussian Splatting (3DGS) has recently emerged as a promising 3D representation. Much research has been focused on reducing its storage requirements and memory footprint. However, the needs to compress and transmit the 3DGS representation to the remote side are overlooked. This new application calls for rate-distortion-optimized 3DGS compression. How to quantize and entropy encode sparse Gaussian primitives in the 3D space remains largely unexplored. Few early attempts resort to the hyperprior framework from learned image compression. But, they fail to utilize fully the inter and intra correlation inherent in Gaussian primitives. Built on ScaffoldGS, this work, termed CAT-3DGS, introduces a context-adaptive triplane approach to their rate-distortion-optimized coding. It features multi-scale triplanes, oriented according to the principal axes of Gaussian primitives in the 3D space, to capture their inter correlation (i.e. spatial correlation) for spatial autoregressive coding in the projected 2D planes. With these triplanes serving as the hyperprior, we further perform channel-wise autoregressive coding to leverage the intra correlation within each individual Gaussian primitive. Our CAT-3DGS incorporates a view frequency-aware masking mechanism. It actively skips from coding those Gaussian primitives that potentially have little impact on the rendering quality. When trained end-to-end to strike a good rate-distortion trade-off, our CAT-3DGS achieves the state-of-the-art compression performance on the commonly used real-world datasets.
最近,三维高斯点阵(3D Gaussian Splatting,简称3DGS)作为一种有前景的三维表示方法崭露头角。许多研究集中在减少其存储需求和内存占用上。然而,压缩并传输3DGS表示到远程端的需求被忽略了。这一新的应用场景需要一种针对3DGS优化的率失真(rate-distortion)编码方案。如何量化并在稀疏高斯原语中进行熵编码仍是一个未充分探索的问题。早期的一些尝试借鉴了从学习图像压缩中的超先验框架,但它们未能充分利用高斯原语中存在的内部和相互关联性。 基于ScaffoldGS的工作,本研究提出了一种新的方法CAT-3DGS(Context-Adaptive Triplane 3D Gaussian Splatting),该方法引入了一种上下文自适应的三平面方法来进行优化编码。它利用了多尺度三平面,在三维空间中根据高斯原语的主要轴定向设置,以捕捉它们之间的相互关联性(即空间相关性),并在投影到二维平面上时进行空间自回归编码。通过将这些三平面作为超先验使用,我们进一步进行了通道级别的自回归编码,以此来利用每个单个高斯原语内部的内禀关系。 我们的CAT-3DGS还引入了一种视图频率感知掩码机制。它会主动跳过那些对渲染质量可能影响较小的高斯原语的编码过程。当进行端到端训练以达到良好的率失真平衡时,我们的CAT-3DGS在常用的现实世界数据集上实现了最先进的压缩性能。 这段翻译解释了最近关于三维高斯点阵(3D Gaussian Splatting)的研究进展和优化策略,强调了一种新的编码方法——上下文自适应三平面方法(Context-Adaptive Triplane),它通过充分利用高斯原语之间的关联性来实现高效的率失真优化压缩。
https://arxiv.org/abs/2503.00357
We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hypothesis that reducing the upper bound leads to faster convergence with constant model size. Our method achieves lossless representation in 2D image and audio fitting, even for high bit-depth signals, such as 16-bit, which was previously unachievable. We pioneered the presence of bit bias, which INR prioritizes as the most significant bit (MSB). We expand the application of the INR task to bit depth expansion, lossless image compression, and extreme network quantization. Our source code is available at this https URL
我们从数字角度量化了隐式神经表示(INR)模型大小的上限。随着所需的位精度增加,该模型大小的上限呈指数级增长。为此,我们提出了一种位平面分解方法,使INR能够预测位平面,从而达到减少模型大小上限的效果。我们验证了降低上限会导致在保持恒定模型尺寸的情况下更快收敛这一假设。我们的方法在二维图像和音频拟合中实现了无损表示,即使对于高比特深度信号(如16位),这也曾经是不可能实现的。我们首次引入了位偏置的存在,INR优先处理最具有意义的位(MSB)。我们将INR任务的应用扩展到了比特深度扩展、无损图像压缩和极端网络量化领域。我们的源代码可在上述链接获取。
https://arxiv.org/abs/2502.21001
Learned image compression (LIC) using deep learning architectures has seen significant advancements, yet standard rate-distortion (R-D) optimization often encounters imbalanced updates due to diverse gradients of the rate and distortion objectives. This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this challenge, we reformulate R-D optimization as a multi-objective optimization (MOO) problem and introduce two balanced R-D optimization strategies that adaptively adjust gradient updates to achieve more equitable improvements in both rate and distortion. The first proposed strategy utilizes a coarse-to-fine gradient descent approach along standard R-D optimization trajectories, making it particularly suitable for training LIC models from scratch. The second proposed strategy analytically addresses the reformulated optimization as a quadratic programming problem with an equality constraint, which is ideal for fine-tuning existing models. Experimental results demonstrate that both proposed methods enhance the R-D performance of LIC models, achieving around a 2\% BD-Rate reduction with acceptable additional training cost, leading to a more balanced and efficient optimization process. The code will be made publicly available.
利用深度学习架构进行的图像压缩(LIC)已经取得了显著进展,但标准的率失真(R-D)优化经常由于速率和失真的目标具有不同的梯度而遇到不平衡更新的问题。这种不平衡会导致次优的优化结果,在这种情况下一个目标占据主导地位,从而降低整体压缩效率。为了解决这一挑战,我们将 R-D 优化重新表述为一个多目标优化(MOO)问题,并引入了两种平衡的 R-D 优化策略,这些策略能够自适应地调整梯度更新以实现速率和失真方面的更公平改进。 第一个提出的策略采用了一种由粗到细的梯度下降方法沿着标准的 R-D 优化轨迹进行操作,特别适合从头开始训练 LIC 模型。第二个提出的策略通过将其重新表述为具有等式约束条件的二次规划问题来解决优化过程中的分析性挑战,这使得它非常适合对现有模型进行微调。 实验结果表明,两种方法都能够提高 LIC 模型的 R-D 性能,在接受合理的额外训练成本的同时,实现了大约 2% 的 BD-Rate(Bits Difference Rate)减少。这导致了一个更加平衡和高效的优化过程。代码将公开发布。
https://arxiv.org/abs/2502.20161
Most recently, learned image compression methods have outpaced traditional hand-crafted standard codecs. However, their inference typically requires to input the whole image at the cost of heavy computing resources, especially for high-resolution image compression; otherwise, the block artefact can exist when compressed by blocks within existing learned image compression methods. To address this issue, we propose a novel continuous patch stitching (CPS) framework for block-wise image compression that is able to achieve seamlessly patch stitching and mathematically eliminate block artefact, thus capable of significantly reducing the required computing resources when compressing images. More specifically, the proposed CPS framework is achieved by padding-free operations throughout, with a newly established parallel overlapping stitching strategy to provide a general upper bound for ensuring the continuity. Upon this, we further propose functional residual blocks with even-sized kernels to achieve down-sampling and up-sampling, together with bottleneck residual blocks retaining feature size to increase network depth. Experimental results demonstrate that our CPS framework achieves the state-of-the-art performance against existing baselines, whilst requiring less than half of computing resources of existing models. Our code shall be released upon acceptance.
最近,学习型图像压缩方法已经超过了传统的手工设计的标准编解码器。然而,这些方法通常需要输入整个图像来进行推断,这在计算资源方面非常昂贵,尤其是对于高分辨率的图像压缩;否则,在现有的学习型图像压缩方法中按块压缩时可能会出现方块效应。为了解决这个问题,我们提出了一种新颖的连续补丁拼接(CPS)框架,用于实现无方块效应、无缝拼接的分块式图像压缩,从而能够在压缩图像时显著减少所需的计算资源。具体来说,所提出的CPS框架通过在整个过程中采用无需填充的操作来实现,并且通过建立一种新的并行重叠拼接策略为确保连续性提供了一个通用上限。在此基础上,我们进一步提出了具有偶数大小内核的功能残差块以实现下采样和上采样,同时使用保持特征尺寸的瓶颈残差块增加网络深度。 实验结果表明,我们的CPS框架在与现有基准模型相比时达到了最先进的性能,同时所需的计算资源不到现有模型的一半。我们将在论文接受后发布代码。
https://arxiv.org/abs/2502.16795
Recent advancements in deep learning have driven significant progress in lossless image compression. With the emergence of Large Language Models (LLMs), preliminary attempts have been made to leverage the extensive prior knowledge embedded in these pretrained models to enhance lossless image compression, particularly by improving the entropy model. However, a significant challenge remains in bridging the gap between the textual prior knowledge within LLMs and lossless image compression. To tackle this challenge and unlock the potential of LLMs, this paper introduces a novel paradigm for lossless image compression that incorporates LLMs with visual prompts. Specifically, we first generate a lossy reconstruction of the input image as visual prompts, from which we extract features to serve as visual embeddings for the LLM. The residual between the original image and the lossy reconstruction is then fed into the LLM along with these visual embeddings, enabling the LLM to function as an entropy model to predict the probability distribution of the residual. Extensive experiments on multiple benchmark datasets demonstrate our method achieves state-of-the-art compression performance, surpassing both traditional and learning-based lossless image codecs. Furthermore, our approach can be easily extended to images from other domains, such as medical and screen content images, achieving impressive performance. These results highlight the potential of LLMs for lossless image compression and may inspire further research in related directions.
近期在深度学习领域的进展推动了无损图像压缩技术的重大突破。随着大型语言模型(LLMs)的出现,初步尝试利用这些预训练模型中嵌入的广泛先验知识来改进无损图像压缩,特别是通过优化熵模型来实现这一目标。然而,一个主要挑战仍然存在,即如何弥合LLM中的文本先验知识与无损图像压缩之间的差距。为了解决这个挑战并解锁LLMs的潜力,本文提出了一种新的无损图像压缩范式,该范式将视觉提示融入到LLMs中。 具体而言,我们首先生成输入图像的一个有损重建作为视觉提示,并从中提取特征用作LLM的视觉嵌入。然后,原始图像与有损重建之间的残差被馈送到LLM中,连同这些视觉嵌入一起,使LLM能够作为一个熵模型来预测残差的概率分布。 在多个基准数据集上的广泛实验表明,我们的方法实现了最先进的压缩性能,超越了传统的和基于学习的无损图像编解码器。此外,我们的方法可以轻松扩展到其他领域(如医学和屏幕内容)中的图像,实现令人印象深刻的性能。这些结果突显了LLMs在无损图像压缩领域的潜力,并可能激发相关方向的进一步研究。
https://arxiv.org/abs/2502.16163
In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature extraction module shuffles the pixels in the image, splits the resulting image into sub-images, and extracts coarse features from the sub-images. Our feature refinement module stacks the coarse features and uses an attention refinement block composed of concatenated three-dimensional convolution residual blocks to learn more compact latent features by exploiting correlations across channels, within sub-images (intra-sub-image correlations), and across sub-images (inter-sub-image correlations). Our feature enhancement module reduces information loss in the decoded features following quantization. We also propose a quantization error compensation module that mitigates the quantization mismatch between training and testing. Our four modules can be readily integrated into state-of-the-art LIC methods. Experiments show that combining our modules with Tiny-LIC outperforms existing LIC methods and image compression standards in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) on the Kodak dataset and the CLIC dataset.
近年来,学习型图像压缩(LIC)方法已经取得了显著的性能提升。然而,获取更紧凑的潜在表示以及减少量化误差的影响仍然是该领域面临的关键挑战。为了应对这些挑战,我们提出了一种特征提取模块、一种特征精炼模块和一种特征增强模块。我们的特征提取模块通过打乱图像中的像素来操作,将所得图像分割成子图,并从各个子图中抽取粗略特征。我们的特征精炼模块堆叠了这些粗略特征,并利用由串联的三维卷积残差块组成的注意力精炼模块(ARB),通过利用跨通道、在子图内(子图内部相关性)和跨越不同子图之间的相关性来学习更紧凑的潜在特征。我们的特征增强模块则减少了量化后解码特征的信息损失。我们还提出了一种量化误差补偿模块,以缓解训练与测试阶段中的量化不匹配问题。这四个模块可以轻松地整合到最先进的LIC方法中。实验表明,在Kodak数据集和CLIC数据集上,将我们的模块与Tiny-LIC结合使用时,在峰值信噪比(PSNR)和多尺度结构相似性(MS-SSIM)方面超过了现有的LIC方法和图像压缩标准。
https://arxiv.org/abs/2502.15188
The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compression: learning compact latent features, adapting quantization step sizes, and the lack of large SC datasets. To overcome these challenges, we propose a novel compression method that employs a multi-frequency two-stage octave residual block (MToRB) for feature extraction, a cascaded triple-scale feature fusion residual block (CTSFRB) for multi-scale feature integration and a multi-frequency context interaction module (MFCIM) to reduce inter-frequency correlations. Additionally, we introduce an adaptive quantization module that learns scaled uniform noise for each frequency component, enabling flexible control over quantization granularity. Furthermore, we construct a large SC image compression dataset (SDU-SCICD10K), which includes over 10,000 images spanning basic SC images, computer-rendered images, and mixed NS and SC images from both PC and mobile platforms. Experimental results demonstrate that our approach significantly improves SC image compression performance, outperforming traditional standards and state-of-the-art learning-based methods in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM).
已经有一些学习型图像压缩(LIC)方法在处理自然场景(NS)图像时超越了传统技术。然而,直接将这些方法应用于屏幕内容(SC)图像上却效果不佳,这是因为SC图像具有独特的特性,例如锐利的边缘、重复的图案、嵌入的文字和图形等。本文针对SC图像压缩中的三个关键挑战提出了解决方案:学习紧凑的潜在特征、调整量化步长以及缺乏大规模的SC数据集。为了克服这些挑战,我们提出了一种新的压缩方法,该方法使用多频两阶段八度残差块(MToRB)进行特征提取,使用级联三尺度特征融合残差块(CTSFRB)来整合多尺度特性,并采用多频率上下文交互模块(MFCIM)减少不同频率之间的相关性。此外,我们还引入了一个自适应量化模块,该模块为每个频段学习缩放的均匀噪声,从而可以灵活地控制量化粒度。 为了支持研究,我们构建了一个大规模SC图像压缩数据集(SDU-SCICD10K),其中包括超过10,000张图片,涵盖了基本的屏幕内容、计算机生成的图像以及来自PC和移动平台上的混合自然场景和屏幕内容图像。实验结果表明,我们的方法显著提升了SC图像的压缩性能,在峰值信噪比(PSNR)和多尺度结构相似性(MS-SSIM)方面均优于传统标准和最新的学习型方法。
https://arxiv.org/abs/2502.15174