2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.
2D高斯点阵(2DGS)是一种新兴的显式场景表示方法,由于其高保真度和高压缩比,在图像压缩方面具有巨大的潜力。然而,现有的低光增强算法主要在像素域内操作。处理通过2DGS压缩的图像需要一个复杂的解压-增强-重新压缩管道流程,这不仅效率低下,还会引入二次退化问题。为了克服这些限制,我们提出了LL-GaussianImage框架,这是首个针对低光图像直接在2DGS压缩表示域内进行无监督零样本增强处理的方法。 该框架提供了三个主要优势: 1. 设计了一个基于语义引导的专家混合增强框架。通过使用渲染图像作为指导,在不完全解压到像素网格的情况下对2DGS的稀疏属性空间应用动态自适应变换,从而实现压缩与增强一体化的效果。 2. 建立了一种多目标协作损失函数系统,严格限制了在增强过程中保持平滑度和保真度的要求。这种方法不仅可以抑制伪影,还可以提高视觉质量。 3. 采用两阶段优化过程来实现重建即增强的目标。通过单尺度重建确保基础表示的准确性,并加强网络鲁棒性,在维持高压缩比的同时实现了低光图像的高质量增强效果。 实验结果验证了直接在压缩表示域内进行处理的可行性和优越性,展示了LL-GaussianImage框架的有效性。
https://arxiv.org/abs/2601.15772
Neural image codecs achieve higher compression ratios than traditional hand-crafted methods such as PNG or JPEG-XL, but often incur substantial computational overhead, limiting their deployment on energy-constrained devices such as smartphones, cameras, and drones. We propose Grayscale Image Compression with Differentiable Logic Circuits (GIC-DLC), a hardware-aware codec where we train lookup tables to combine the flexibility of neural networks with the efficiency of Boolean operations. Experiments on grayscale benchmark datasets show that GIC-DLC outperforms traditional codecs in compression efficiency while allowing substantial reductions in energy consumption and latency. These results demonstrate that learned compression can be hardware-friendly, offering a promising direction for low-power image compression on edge devices.
神经图像编解码器比传统的手工设计方法(如PNG或JPEG-XL)实现了更高的压缩比率,但通常会带来显著的计算开销,限制了它们在智能手机、相机和无人机等能量受限设备上的部署。我们提出了灰度图像压缩与可微逻辑电路 (GIC-DLC),这是一种硬件感知编解码器,我们在其中训练查找表以结合神经网络的灵活性和布尔运算的效率。在灰度基准数据集上的实验表明,GIC-DLC 在压缩效率上超越了传统编解码器,并且能够在很大程度上减少能量消耗和延迟。这些结果证明了学习到的压缩技术可以是硬件友好的,为边缘设备上的低功耗图像压缩提供了一个有前景的方向。
https://arxiv.org/abs/2601.14130
Recent advancements in diffusion-based generative priors have enabled visually plausible image compression at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbf{Diff}usion-based Image Compression via \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR), a novel compression framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the $\epsilon$-prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbf{two-step decoding} by preserving the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2\% BD-rate (LPIPS) and 65.1\% BD-rate (PSNR)) and over $10\times$ speed-up compared to SOTA diffusion-based compression baselines.
最近基于扩散模型的生成先验技术的进步已经使得在极低比特率下实现视觉上合理的图像压缩成为可能。然而,现有的方法由于碎片化的训练范式而面临着采样过程缓慢和比特分配不理想的问题。在这项工作中,我们提出了一种名为“通过一致性优先级精炼加速扩散模型图像压缩”的新框架(DiffCR),旨在高效且高保真地重建图像。DiffCR的核心是一个频率感知跳跃估计模块(FaSE),该模块能够优化预训练的潜在扩散模型中的$\epsilon$预测先验,并通过频率解耦注意力机制将其与不同时间步长下的压缩潜码对齐。此外,一个轻量级的一致性评估器能够在保持扩散采样语义轨迹的同时实现快速两阶段解码过程。 在不更新基础扩散模型的情况下,DiffCR相比最先进的基于扩散的压缩基线方法,在比特率节省方面取得了显著的进步(LPIPS指标下节约27.2%比特率,PSNR指标下节约65.1%比特率),并且速度提高了超过十倍。
https://arxiv.org/abs/2601.10373
Block based image compression relies on transform coding to concentrate signal energy into a small number of coefficients. While classical codecs use fixed transforms such as the Discrete Cosine Transform (DCT), data driven methods such as Principal Component Analysis (PCA) are theoretically optimal for decorrelation. This paper presents an experimental comparison of DCT, Hadamard, and PCA across multiple block sizes and compression rates. Using rate distortion and energy compaction analysis, we show that PCA outperforms fixed transforms only when block dimensionality is sufficiently large, while DCT remains near optimal for standard block sizes such as $8\times8$ and at low bit rates. These results explain the robustness of DCT in practical codecs and highlight the limitations of block wise learned transforms.
基于块的图像压缩依赖于变换编码,以将信号能量集中在少量系数上。虽然传统的编解码器使用固定的变换如离散余弦变换(DCT),但像主成分分析(PCA)这样的数据驱动方法理论上在去相关方面是最优的。本文对不同大小的块和不同的压缩率下,比较了DCT、Hadamard变换以及PCA的表现。通过率失真分析和能量紧缩分析,我们发现当块的维度足够大时,PCA才能优于固定的变换;而在标准尺寸如$8\times8$ 的块上,特别是在低比特率的情况下,DCT依然接近最优。 这些结果解释了为什么在实际的编解码器中DCT表现出强大的鲁棒性,并突出了基于块学习变换的局限性。
https://arxiv.org/abs/2601.06273
Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to partially compress these networks by reducing the size of their encoders. Our approach uses a simplified knowledge distillation strategy to approximate the latent space of the original models with less data and shorter training, yielding lightweight encoders from heavyweight ones. We evaluate the resulting lightweight encoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity better than training lightweight encoders with the original loss, making it practical for resource-limited environments.
深度学习模型在图像压缩方面的应用经常面临硬件受限条件下的实际限制。尽管这些模型能够实现高质量的重建,但它们通常复杂且占用大量资源,需要大量的训练数据和计算资源。我们提出了一种方法,通过减小这些网络编码器的大小来对其进行部分压缩。我们的方法使用简化的知识蒸馏策略,在较少的数据和较短的训练时间下近似原始模型的潜在空间,从而从重型编码器生成轻量级编码器。 我们在两种不同架构上对产生的轻量级编码器在图像压缩任务中的效果进行了评估。实验表明,与使用原始损失训练轻量级编码器相比,我们的方法更好地保持了重建质量和统计保真度,使其适用于资源受限的环境。
https://arxiv.org/abs/2601.05639
While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA--Yet Another One-step Diffusion-based Video Compressor--which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at this https URL.
虽然一阶段扩散模型在感知图像压缩方面近期表现出色,但它们在视频压缩中的应用仍然有限。此前的努力通常依赖于预先训练的2D自编码器,这些自编码器会独立生成每一帧的潜在表示,从而忽略了时间上的依赖关系。我们提出了YODA——一种基于一阶段扩散的视频压缩器——它将多尺度特性从时间参考中嵌入到潜在生成和潜在编码过程中,以更好地利用空间-时间相关性来实现更为紧凑的表示,并采用线性扩散变换器(DiT)来进行高效的一步去噪。YODA在LPIPS、DISTS、FID和KID等指标上达到了感知性能的最佳水平,持续优于传统方法和深度学习基准。源代码将在以下网址公开发布:[此链接](https://this https URL)。 请注意,最后的网址需要你提供具体的URL地址来替换占位符“this https URL”。
https://arxiv.org/abs/2601.01141
High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at this https URL .
高保真多光谱太阳图像压缩对于太空任务来说仍然是一项挑战,因为在有限的带宽下必须权衡保留精细光谱和空间细节的需求。我们提出了一种针对太阳观测定制的学习型图像压缩框架,利用了两个互补模块:(1)跨频段窗口图嵌入 (iSWGE),该模块通过将光谱通道表示为带有学习到的边特征的图节点来显式建模频带间的关联;(2)窗口空间图注意力与卷积块注意力 (WSGA-C) 模块,它结合了稀疏图注意力和卷积注意力以减少空间冗余并强调细小结构。在涵盖六个极紫外(EUV)通道的SDOML数据集上进行评估后显示,我们的方法实现了20.15% 的平均光谱信息散度 (MSID) 减少、高达1.09% 的峰值信噪比(PSNR)提升以及对强大学习基准相比提高1.62% 对数变换的多尺度结构相似性指标(MS-SSIM),从而在相当的每像素比特率下提供了更清晰且光谱真实的重建图像。代码可在以下网址公开获取:[此处应填写实际链接,由于文本中未提供具体链接,请访问原文或相关发布页获取]。
https://arxiv.org/abs/2512.24463
Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.
无人飞行器(UAVs)是灾后搜索和救援的关键工具,面对着信息密度高、视角快速变化以及结构动态等挑战,尤其是在长距离导航方面。然而,当前的UAV视觉与语言导航(VLN)方法在复杂的环境中难以建模长时间的空间时间上下文,导致语义对齐不准确且路径规划不稳定。为此,我们提出了LongFly——一个用于长距离UAV VLN的空间时间上下文建模框架。LongFly提出了一种历史感知的空间时间建模策略,能够将片段化和冗余的历史数据转换为结构化、紧凑且表达丰富的表示。 首先,我们设计了基于槽的历史图像压缩模块,该模块可以动态提炼多视角历史观测结果并将其转化为固定长度的上下文表示。然后引入空间时间轨迹编码模块以捕捉UAV轨迹中的时间动力学及空间结构。最后,为了整合现有的空间时间上下文与当前观察结果,我们设计了引导式跨模态集成模块来支持基于时间的推理和稳健的目标点预测。 实验结果显示,在成功率上,LongFly相较于最先进的UAV VLN基线提高了7.89%,在根据路径长度加权的成功率方面则提升了6.33%。这些改进在已见及未见过的环境中均保持一致。
https://arxiv.org/abs/2512.22010
Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content, posing significant challenges for efficient compression and real-time decoding on end-user devices. Inspired by 3D Gaussian Splatting, recent 2D Gaussian image models improve representation efficiency, yet existing methods struggle to balance compression ratio and reconstruction fidelity in ultra-high-resolution scenarios. To address this issue, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that supports arbitrary image resolutions and compression ratios. SmartSplat leverages image-aware features such as gradients and color variances, introducing a Gradient-Color Guided Variational Sampling strategy together with an Exclusion-based Uniform Sampling scheme to improve the non-overlapping coverage of Gaussian primitives in pixel space. In addition, we propose a Scale-Adaptive Gaussian Color Sampling method to enhance color initialization across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat efficiently captures both local structures and global textures using a limited number of Gaussians, achieving high reconstruction quality under strong compression. Extensive experiments on DIV8K and a newly constructed 16K dataset demonstrate that SmartSplat consistently outperforms state-of-the-art methods at comparable compression ratios and exceeds their compression limits, showing strong scalability and practical applicability. The code is publicly available at this https URL.
最近在生成式AI领域的进展加速了超高分辨率视觉内容的生产,这对终端设备上的高效压缩和实时解码带来了重大挑战。受3D高斯点云启发,近期出现了一些2D高斯图像模型,这些模型提高了表示效率,但现有的方法难以在超高清场景中平衡压缩比与重建保真度之间的关系。为解决这一问题,我们提出了一种名为SmartSplat的高度适应性和特征感知的基于高斯点云(GS)的图像压缩框架,该框架支持任意分辨率和压缩比例的图像。 SmartSplat利用诸如梯度和颜色变化等图像特性,并引入了梯度-颜色引导变分采样策略以及基于排除机制的均匀采样方案,以提高像素空间中非重叠高斯原语覆盖范围。此外,我们还提出了一种尺度自适应高斯颜色采样方法来增强不同比例下的颜色初始化。 通过在空间布局、比例和颜色初始化方面进行联合优化,SmartSplat能够使用有限数量的高斯分布有效地捕捉局部结构和全局纹理,在强压缩条件下实现高质量重建。对DIV8K数据集及我们新构建的一个16K分辨率的数据集进行广泛的实验表明,相比现有最佳方法,在相同压缩比下,SmartSplat性能更优,并且超越了它们的压缩限制,显示出强大的可扩展性和实用性。代码可以在[此链接](https://example.com)公开获取(请将URL替换为实际可用地址)。
https://arxiv.org/abs/2512.20377
Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer. The code is available at this https URL.
大多数现有的图像压缩方法在像素空间中进行变换编码,以减少其空间冗余。然而,在低比特率下实现高真实感和高保真度方面遇到了困难,因为像素空间中的失真可能与人类感知不一致。为了解决这个问题,我们引入了一种生成式潜在编码(GLC)架构,该架构在生成向量量化变分自编码器(VQ-VAE)的潜在空间中进行变换编码,而不是在像素空间中。生成式潜在空间以其更高的稀疏性、更丰富的语义以及更好的人类感知一致性为特点,这使得它更适合实现高真实感和高保真的压缩效果。 此外,我们还引入了一种类别超模块以减少超信息的比特成本,并通过基于代码预测的监督来增强语义一致性。实验表明,在自然图像上我们的GLC在小于0.04 bpp的情况下能够保持高质量视觉效果,在面部图像上则是在小于0.01 bpp的情况下实现这一点。在CLIC2020测试集上,我们与MS-ILLM相比,实现了相同的FID分数但比特率减少了45%。 此外,强大的生成式潜在空间使我们的GLC管道能够支持多种应用,例如图像修复和风格转换。代码可在以下链接获取:[此URL](请将“this https URL”替换为实际的GitHub或相关存储库链接)。
https://arxiv.org/abs/2512.20194
Recent advances in learned image codecs have been extended from human perception toward machine perception. However, progressive image compression with fine granular scalability (FGS)-which enables decoding a single bitstream at multiple quality levels-remains unexplored for machine-oriented codecs. In this work, we propose a novel progressive learned image compression codec for machine perception, PICM-Net, based on trit-plane coding. By analyzing the difference between human- and machine-oriented rate-distortion priorities, we systematically examine the latent prioritization strategies in terms of machine-oriented codecs. To further enhance real-world adaptability, we design an adaptive decoding controller, which dynamically determines the necessary decoding level during inference time to maintain the desired confidence of downstream machine prediction. Extensive experiments demonstrate that our approach enables efficient and adaptive progressive transmission while maintaining high performance in the downstream classification task, establishing a new paradigm for machine-aware progressive image compression.
最近,针对机器感知的图像编码技术在学习型图像编解码领域取得了进展。然而,具有细粒度可伸缩性(FGS)的渐进式图像压缩——允许从单一比特流中以多个质量级别进行解码——对于面向机器的技术而言仍是一个未被探索的研究方向。本文提出了一种基于三值平面编码的新颖渐进式学习型图像压缩编解码器PICM-Net,专门用于机器感知。通过分析人类和机器导向的速率失真优先级之间的差异,我们系统地研究了面向机器的编码器中的潜在优先策略。为了进一步增强现实世界的适应性,我们设计了一个自适应解码控制器,在推理过程中动态确定所需的解码级别,以维持下游机器预测所需的信心水平。 广泛的实验表明,我们的方法能够在保持下游分类任务高性能的同时实现高效且适应性强的渐进式传输,从而为面向机器的渐进式图像压缩建立了一种新的范例。
https://arxiv.org/abs/2512.20070
In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion this http URL compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for machines.
近年来,机器视觉领域的图像压缩模型需求急剧增加。然而,现有的图像压缩训练框架仍然侧重于人类的视觉感知,保持过多的感知细节,在执行机器视觉任务时难以有效减少每像素比特数(bits per pixel)。本文提出了一种基于语义的低比特率机器图像压缩方法SLIM(Semantic-based Low-bitrate Image compression for Machines),该方法利用扩散技术。这是为机器视觉设计的一种新的有效的图像压缩训练框架,它使用预训练的潜在扩散模型作为编码器,并且我们的压缩模型仅关注图像潜在空间中的感兴趣区域(Region-of-Interest, RoI)以实现紧凑压缩。 然后,预训练的U-Net模型通过对包含图像语义信息的RoI聚焦文本描述来增强解压缩后的潜在特征。因此,SLIM能够在不使用引导掩模的情况下专注于图像的RoI区域进行低比特率压缩,并且它还可以通过去噪步骤优化解压后的潜在特征,使得最终从改进后潜在特征重构出的图像既能为机器视觉任务提供优化效果,又保留了人类感知所需的细节。实验结果表明,在相同每像素比特数条件下,SLIM相比于传统面向机器的图像压缩模型具有更高的分类准确率。
https://arxiv.org/abs/2512.18200
Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.
降低计算复杂性仍然是基于学习的图像压缩技术广泛采用的关键挑战。在本文中,我们提出了TreeNet,这是一种新型低复杂度图像压缩模型,它利用二叉树结构的编解码器架构来实现高效的表示和重建。我们采用了注意特征融合机制,有效地整合了来自多个分支的特性。我们在三个常用的基准数据集上评估了TreeNet,并将其性能与包括JPEG AI在内的竞争方法进行了比较,JPEG AI是基于学习的图像压缩领域的最新标准之一。在低比特率下,TreeNet相较于JPEG AI实现了平均4.83%的BD-rate改进,同时将模型复杂度降低了87.82%。此外,我们还进行了广泛的消融研究,以探讨TreeNet中各种潜在表示的影响,提供了对重建因素更深入的理解。
https://arxiv.org/abs/2512.16743
Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.
图像构成了互联网的很大一部分,因此高效的压缩技术对于减少存储和带宽需求至关重要。本研究探讨了使用奇异值分解(SVD)和低秩矩阵近似来实现图像压缩的方法,并通过相对弗罗贝尼乌斯误差和压缩比来评估其性能。该方法被应用于灰度图和多通道图像,以测试其广泛适用性。 结果表明,在视觉效果上,低秩近似的图像通常与原始图像非常相似;然而,这种压缩技术在相同误差水平下仍然不如JPEG、JPEG2000和WEBP等成熟格式高效。当容许的误差水平较低时,通过奇异值分解生成的压缩表示甚至可能超过原图大小,这表明此方法在实际图像压缩方面与行业标准编解码器相比不具备竞争力。
https://arxiv.org/abs/2512.16226
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at this https URL
评估图像压缩性能时,包括人类偏好的测试通常发现,像MSE(均方误差)这样的简单失真函数与人的感知不完全一致。为了使压缩模型更好地符合人类的视觉感知,以前的工作采用了一种可微分的感知损失方法——这种方法使用经过大规模的人类心理视觉判断数据集校准的神经网络。 我们研究发现,令人惊讶的是,最先进的视觉-语言模型(VLM)在被要求分析图像对之间的差异时,可以零样本复制人类的二元两择一强迫选择(2AFC)判断。受到这一能力的启发,我们提出了基于视觉语言模型的图像压缩系统(VLIC),这是一个基于扩散模型设计的图像压缩系统,旨在通过使用VLM进行的二元判断来进行后期训练。 VLIC利用现有的用于扩散模型偏好后训练的技术,而不是将VLM的判断提炼成独立的感知损失网络。我们的实验表明,在特定数据集上,根据感知指标和大规模用户研究的结果,校准后的VLIC系统在与人类视觉感知对齐的图像压缩性能方面可以达到竞争水平或最先进的表现。 此外,我们还进行了详尽的基于VLM奖励设计和训练过程的分析,并分享了重要的见解。有关更多可视化资料,请访问提供的链接。
https://arxiv.org/abs/2512.15701
Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model's attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model's powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.
预处理是一种已确立的技术,用于优化压缩效果。然而,现有的方法大多以率失真(R-D)为目标,并受限于像素级保真度的要求。本文首次将大规模预训练的扩散模型应用于压缩预处理中,标志着向率感知(R-P)优化的重大转变。我们提出了一种两阶段框架:首先,通过一致分数身份蒸馏(CiD),我们将多步骤的Stable Diffusion 2.1转化为一个紧凑的一步式图像到图像模型。其次,我们在注意力模块上进行参数高效的微调,并利用率感知损失和可微分编解码器代理来指导这一过程。我们的方法可以无缝地与标准编解码器结合使用,无需任何修改,并且能够利用模型的强大生成先验知识来增强纹理并减轻伪影。实验结果表明,在Kodak数据集上,我们的方法在DISTS指标上的BD率(Bitrate-Distortion)减少了高达30.13%,并且提供了更好的主观视觉质量。
https://arxiv.org/abs/2512.15270
Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.
超低比特率图像压缩(低于每像素0.05位)在带宽受限和计算资源有限的编码场景中,如边缘设备上变得越来越重要。现有的框架通常依赖于大型预训练编码器(例如,变分自编码器VAEs或基于标记化的方法),并在其生成潜在空间内执行变换编码。尽管这些方法实现了令人印象深刻的感知保真度,但它们对重型编码网络的依赖使它们不适合部署在计算能力较弱的设备上。 在这项工作中,我们探讨了应用浅层编码器进行超低比特率压缩的可能性,并提出了一个新颖的非对称极端图像压缩(AEIC)框架,旨在同时追求编码简洁性和解码质量。具体而言,AEIC采用中等深度或甚至浅层的编码网络,同时利用一步扩散解码器,在极低比特率下保持高保真度和高度真实的重建效果。 为了进一步提高浅层编码器的效率,我们设计了一种双侧特征蒸馏方案,将知识从具有中等编码网络的AEIC转移到其浅层编码变体上。实验表明,AEIC不仅在超低比特率下的速率失真感知性能方面优于现有方法,而且在1080P输入图像的情况下实现了35.8 FPS的出色编码效率,并且解码速度与现有方法相比具有竞争力。
https://arxiv.org/abs/2512.12229
Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.
机器图像压缩(ICM)已经成为视觉数据压缩领域中的一个关键研究方向。然而,随着机器智能的快速发展,压缩的目标已经从特定任务的虚拟模型转向了在现实环境中运作的具身代理(Embodied agents)。为了应对多代理系统中具身AI通信约束并确保实时任务执行,本文首次提出了具身图像压缩这一科学问题。我们建立了一个标准化基准测试平台EmbodiedComp,在闭环设置下的极低比特率条件下进行系统的评估。通过在模拟和现实世界环境中的广泛实证研究,我们发现现有的视觉-语言-行动模型(VLAs)在低于具身比特率阈值时无法可靠地执行简单的操作任务。我们认为,EmbodiedComp将推动针对具身代理的领域特定压缩技术的发展,从而加速具身AI在现实世界的部署。
https://arxiv.org/abs/2512.11612
This paper introduces ROI-Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end-task accuracy and packing them efficiently while discarding less relevant data, ROI-Packing achieves significant compression efficiency without requiring retraining or fine-tuning of end-task models. Comprehensive evaluations across five datasets and two popular tasks-object detection and instance segmentation-demonstrate up to a 44.10% reduction in bitrate without compromising end-task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state-of-the-art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).
本文介绍了ROI-Packing,这是一种专门为机器视觉设计的高效图像压缩方法。通过优先处理对最终任务准确性至关重要的感兴趣区域(ROI),并有效打包这些区域同时丢弃不太相关的信息,ROI-Packing能够在不重新训练或微调最终任务模型的情况下实现显著的压缩效率。在五个数据集和两个流行的任务——对象检测和实例分割上的全面评估表明,在不牺牲最终任务准确性的前提下,ROI-Packing能够将比特率降低高达44.10%,并且与由运动图像专家组(MPEG)标准化的最新视频编码技术Versatile Video Coding (VVC) 编码器相比,在相同的比特率下还能提高8.88%的准确性。
https://arxiv.org/abs/2512.09258
Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
基于向量量化(VQ)的生成式图像压缩方法最近展示了在平衡失真和感知质量方面的巨大潜力。然而,这些方法通常使用静态全局概率分布来估计VQ索引的熵,这无法适应每张图像的具体内容。这种非自适应的方法导致了比特率潜力未被充分利用,并且难以实现灵活的码率控制。为了应对这一挑战,我们引入了一种基于VQ超先验的可控生成式图像压缩框架(HVQ-CGIC)。该框架严格推导出在向量量化索引熵模型中引入超先验的数学基础。在此基础上,通过新颖的损失设计,据我们所知,这个框架首次实现了RD平衡和控制在基于向量量化的生成式图像压缩中的应用。配合轻量级的超先验估计网络,HVQ-CGIC在率失真性能上相比当前最先进的(SOTA)生成式压缩方法取得了显著优势。在Kodak数据集上,我们与Control-GIC、CDC和HiFiC达到了相同的LPIPS指标,但比特数平均减少了61.3%。我们认为HVQ-CGIC具有成为基于VQGAN的图像压缩的基础组件的巨大潜力,类似于超先验框架在神经网络图像压缩中的核心地位。
https://arxiv.org/abs/2512.07192