Block based image compression relies on transform coding to concentrate signal energy into a small number of coefficients. While classical codecs use fixed transforms such as the Discrete Cosine Transform (DCT), data driven methods such as Principal Component Analysis (PCA) are theoretically optimal for decorrelation. This paper presents an experimental comparison of DCT, Hadamard, and PCA across multiple block sizes and compression rates. Using rate distortion and energy compaction analysis, we show that PCA outperforms fixed transforms only when block dimensionality is sufficiently large, while DCT remains near optimal for standard block sizes such as $8\times8$ and at low bit rates. These results explain the robustness of DCT in practical codecs and highlight the limitations of block wise learned transforms.
基于块的图像压缩依赖于变换编码,以将信号能量集中在少量系数上。虽然传统的编解码器使用固定的变换如离散余弦变换(DCT),但像主成分分析(PCA)这样的数据驱动方法理论上在去相关方面是最优的。本文对不同大小的块和不同的压缩率下,比较了DCT、Hadamard变换以及PCA的表现。通过率失真分析和能量紧缩分析,我们发现当块的维度足够大时,PCA才能优于固定的变换;而在标准尺寸如$8\times8$ 的块上,特别是在低比特率的情况下,DCT依然接近最优。 这些结果解释了为什么在实际的编解码器中DCT表现出强大的鲁棒性,并突出了基于块学习变换的局限性。
https://arxiv.org/abs/2601.06273
Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to partially compress these networks by reducing the size of their encoders. Our approach uses a simplified knowledge distillation strategy to approximate the latent space of the original models with less data and shorter training, yielding lightweight encoders from heavyweight ones. We evaluate the resulting lightweight encoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity better than training lightweight encoders with the original loss, making it practical for resource-limited environments.
深度学习模型在图像压缩方面的应用经常面临硬件受限条件下的实际限制。尽管这些模型能够实现高质量的重建,但它们通常复杂且占用大量资源,需要大量的训练数据和计算资源。我们提出了一种方法,通过减小这些网络编码器的大小来对其进行部分压缩。我们的方法使用简化的知识蒸馏策略,在较少的数据和较短的训练时间下近似原始模型的潜在空间,从而从重型编码器生成轻量级编码器。 我们在两种不同架构上对产生的轻量级编码器在图像压缩任务中的效果进行了评估。实验表明,与使用原始损失训练轻量级编码器相比,我们的方法更好地保持了重建质量和统计保真度,使其适用于资源受限的环境。
https://arxiv.org/abs/2601.05639
While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA--Yet Another One-step Diffusion-based Video Compressor--which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at this https URL.
虽然一阶段扩散模型在感知图像压缩方面近期表现出色,但它们在视频压缩中的应用仍然有限。此前的努力通常依赖于预先训练的2D自编码器,这些自编码器会独立生成每一帧的潜在表示,从而忽略了时间上的依赖关系。我们提出了YODA——一种基于一阶段扩散的视频压缩器——它将多尺度特性从时间参考中嵌入到潜在生成和潜在编码过程中,以更好地利用空间-时间相关性来实现更为紧凑的表示,并采用线性扩散变换器(DiT)来进行高效的一步去噪。YODA在LPIPS、DISTS、FID和KID等指标上达到了感知性能的最佳水平,持续优于传统方法和深度学习基准。源代码将在以下网址公开发布:[此链接](https://this https URL)。 请注意,最后的网址需要你提供具体的URL地址来替换占位符“this https URL”。
https://arxiv.org/abs/2601.01141
High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at this https URL .
高保真多光谱太阳图像压缩对于太空任务来说仍然是一项挑战,因为在有限的带宽下必须权衡保留精细光谱和空间细节的需求。我们提出了一种针对太阳观测定制的学习型图像压缩框架,利用了两个互补模块:(1)跨频段窗口图嵌入 (iSWGE),该模块通过将光谱通道表示为带有学习到的边特征的图节点来显式建模频带间的关联;(2)窗口空间图注意力与卷积块注意力 (WSGA-C) 模块,它结合了稀疏图注意力和卷积注意力以减少空间冗余并强调细小结构。在涵盖六个极紫外(EUV)通道的SDOML数据集上进行评估后显示,我们的方法实现了20.15% 的平均光谱信息散度 (MSID) 减少、高达1.09% 的峰值信噪比(PSNR)提升以及对强大学习基准相比提高1.62% 对数变换的多尺度结构相似性指标(MS-SSIM),从而在相当的每像素比特率下提供了更清晰且光谱真实的重建图像。代码可在以下网址公开获取:[此处应填写实际链接,由于文本中未提供具体链接,请访问原文或相关发布页获取]。
https://arxiv.org/abs/2512.24463
Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.
无人飞行器(UAVs)是灾后搜索和救援的关键工具,面对着信息密度高、视角快速变化以及结构动态等挑战,尤其是在长距离导航方面。然而,当前的UAV视觉与语言导航(VLN)方法在复杂的环境中难以建模长时间的空间时间上下文,导致语义对齐不准确且路径规划不稳定。为此,我们提出了LongFly——一个用于长距离UAV VLN的空间时间上下文建模框架。LongFly提出了一种历史感知的空间时间建模策略,能够将片段化和冗余的历史数据转换为结构化、紧凑且表达丰富的表示。 首先,我们设计了基于槽的历史图像压缩模块,该模块可以动态提炼多视角历史观测结果并将其转化为固定长度的上下文表示。然后引入空间时间轨迹编码模块以捕捉UAV轨迹中的时间动力学及空间结构。最后,为了整合现有的空间时间上下文与当前观察结果,我们设计了引导式跨模态集成模块来支持基于时间的推理和稳健的目标点预测。 实验结果显示,在成功率上,LongFly相较于最先进的UAV VLN基线提高了7.89%,在根据路径长度加权的成功率方面则提升了6.33%。这些改进在已见及未见过的环境中均保持一致。
https://arxiv.org/abs/2512.22010
Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content, posing significant challenges for efficient compression and real-time decoding on end-user devices. Inspired by 3D Gaussian Splatting, recent 2D Gaussian image models improve representation efficiency, yet existing methods struggle to balance compression ratio and reconstruction fidelity in ultra-high-resolution scenarios. To address this issue, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that supports arbitrary image resolutions and compression ratios. SmartSplat leverages image-aware features such as gradients and color variances, introducing a Gradient-Color Guided Variational Sampling strategy together with an Exclusion-based Uniform Sampling scheme to improve the non-overlapping coverage of Gaussian primitives in pixel space. In addition, we propose a Scale-Adaptive Gaussian Color Sampling method to enhance color initialization across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat efficiently captures both local structures and global textures using a limited number of Gaussians, achieving high reconstruction quality under strong compression. Extensive experiments on DIV8K and a newly constructed 16K dataset demonstrate that SmartSplat consistently outperforms state-of-the-art methods at comparable compression ratios and exceeds their compression limits, showing strong scalability and practical applicability. The code is publicly available at this https URL.
最近在生成式AI领域的进展加速了超高分辨率视觉内容的生产,这对终端设备上的高效压缩和实时解码带来了重大挑战。受3D高斯点云启发,近期出现了一些2D高斯图像模型,这些模型提高了表示效率,但现有的方法难以在超高清场景中平衡压缩比与重建保真度之间的关系。为解决这一问题,我们提出了一种名为SmartSplat的高度适应性和特征感知的基于高斯点云(GS)的图像压缩框架,该框架支持任意分辨率和压缩比例的图像。 SmartSplat利用诸如梯度和颜色变化等图像特性,并引入了梯度-颜色引导变分采样策略以及基于排除机制的均匀采样方案,以提高像素空间中非重叠高斯原语覆盖范围。此外,我们还提出了一种尺度自适应高斯颜色采样方法来增强不同比例下的颜色初始化。 通过在空间布局、比例和颜色初始化方面进行联合优化,SmartSplat能够使用有限数量的高斯分布有效地捕捉局部结构和全局纹理,在强压缩条件下实现高质量重建。对DIV8K数据集及我们新构建的一个16K分辨率的数据集进行广泛的实验表明,相比现有最佳方法,在相同压缩比下,SmartSplat性能更优,并且超越了它们的压缩限制,显示出强大的可扩展性和实用性。代码可以在[此链接](https://example.com)公开获取(请将URL替换为实际可用地址)。
https://arxiv.org/abs/2512.20377
Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer. The code is available at this https URL.
大多数现有的图像压缩方法在像素空间中进行变换编码,以减少其空间冗余。然而,在低比特率下实现高真实感和高保真度方面遇到了困难,因为像素空间中的失真可能与人类感知不一致。为了解决这个问题,我们引入了一种生成式潜在编码(GLC)架构,该架构在生成向量量化变分自编码器(VQ-VAE)的潜在空间中进行变换编码,而不是在像素空间中。生成式潜在空间以其更高的稀疏性、更丰富的语义以及更好的人类感知一致性为特点,这使得它更适合实现高真实感和高保真的压缩效果。 此外,我们还引入了一种类别超模块以减少超信息的比特成本,并通过基于代码预测的监督来增强语义一致性。实验表明,在自然图像上我们的GLC在小于0.04 bpp的情况下能够保持高质量视觉效果,在面部图像上则是在小于0.01 bpp的情况下实现这一点。在CLIC2020测试集上,我们与MS-ILLM相比,实现了相同的FID分数但比特率减少了45%。 此外,强大的生成式潜在空间使我们的GLC管道能够支持多种应用,例如图像修复和风格转换。代码可在以下链接获取:[此URL](请将“this https URL”替换为实际的GitHub或相关存储库链接)。
https://arxiv.org/abs/2512.20194
Recent advances in learned image codecs have been extended from human perception toward machine perception. However, progressive image compression with fine granular scalability (FGS)-which enables decoding a single bitstream at multiple quality levels-remains unexplored for machine-oriented codecs. In this work, we propose a novel progressive learned image compression codec for machine perception, PICM-Net, based on trit-plane coding. By analyzing the difference between human- and machine-oriented rate-distortion priorities, we systematically examine the latent prioritization strategies in terms of machine-oriented codecs. To further enhance real-world adaptability, we design an adaptive decoding controller, which dynamically determines the necessary decoding level during inference time to maintain the desired confidence of downstream machine prediction. Extensive experiments demonstrate that our approach enables efficient and adaptive progressive transmission while maintaining high performance in the downstream classification task, establishing a new paradigm for machine-aware progressive image compression.
最近,针对机器感知的图像编码技术在学习型图像编解码领域取得了进展。然而,具有细粒度可伸缩性(FGS)的渐进式图像压缩——允许从单一比特流中以多个质量级别进行解码——对于面向机器的技术而言仍是一个未被探索的研究方向。本文提出了一种基于三值平面编码的新颖渐进式学习型图像压缩编解码器PICM-Net,专门用于机器感知。通过分析人类和机器导向的速率失真优先级之间的差异,我们系统地研究了面向机器的编码器中的潜在优先策略。为了进一步增强现实世界的适应性,我们设计了一个自适应解码控制器,在推理过程中动态确定所需的解码级别,以维持下游机器预测所需的信心水平。 广泛的实验表明,我们的方法能够在保持下游分类任务高性能的同时实现高效且适应性强的渐进式传输,从而为面向机器的渐进式图像压缩建立了一种新的范例。
https://arxiv.org/abs/2512.20070
In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion this http URL compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for machines.
近年来,机器视觉领域的图像压缩模型需求急剧增加。然而,现有的图像压缩训练框架仍然侧重于人类的视觉感知,保持过多的感知细节,在执行机器视觉任务时难以有效减少每像素比特数(bits per pixel)。本文提出了一种基于语义的低比特率机器图像压缩方法SLIM(Semantic-based Low-bitrate Image compression for Machines),该方法利用扩散技术。这是为机器视觉设计的一种新的有效的图像压缩训练框架,它使用预训练的潜在扩散模型作为编码器,并且我们的压缩模型仅关注图像潜在空间中的感兴趣区域(Region-of-Interest, RoI)以实现紧凑压缩。 然后,预训练的U-Net模型通过对包含图像语义信息的RoI聚焦文本描述来增强解压缩后的潜在特征。因此,SLIM能够在不使用引导掩模的情况下专注于图像的RoI区域进行低比特率压缩,并且它还可以通过去噪步骤优化解压后的潜在特征,使得最终从改进后潜在特征重构出的图像既能为机器视觉任务提供优化效果,又保留了人类感知所需的细节。实验结果表明,在相同每像素比特数条件下,SLIM相比于传统面向机器的图像压缩模型具有更高的分类准确率。
https://arxiv.org/abs/2512.18200
Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.
降低计算复杂性仍然是基于学习的图像压缩技术广泛采用的关键挑战。在本文中,我们提出了TreeNet,这是一种新型低复杂度图像压缩模型,它利用二叉树结构的编解码器架构来实现高效的表示和重建。我们采用了注意特征融合机制,有效地整合了来自多个分支的特性。我们在三个常用的基准数据集上评估了TreeNet,并将其性能与包括JPEG AI在内的竞争方法进行了比较,JPEG AI是基于学习的图像压缩领域的最新标准之一。在低比特率下,TreeNet相较于JPEG AI实现了平均4.83%的BD-rate改进,同时将模型复杂度降低了87.82%。此外,我们还进行了广泛的消融研究,以探讨TreeNet中各种潜在表示的影响,提供了对重建因素更深入的理解。
https://arxiv.org/abs/2512.16743
Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.
图像构成了互联网的很大一部分,因此高效的压缩技术对于减少存储和带宽需求至关重要。本研究探讨了使用奇异值分解(SVD)和低秩矩阵近似来实现图像压缩的方法,并通过相对弗罗贝尼乌斯误差和压缩比来评估其性能。该方法被应用于灰度图和多通道图像,以测试其广泛适用性。 结果表明,在视觉效果上,低秩近似的图像通常与原始图像非常相似;然而,这种压缩技术在相同误差水平下仍然不如JPEG、JPEG2000和WEBP等成熟格式高效。当容许的误差水平较低时,通过奇异值分解生成的压缩表示甚至可能超过原图大小,这表明此方法在实际图像压缩方面与行业标准编解码器相比不具备竞争力。
https://arxiv.org/abs/2512.16226
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at this https URL
评估图像压缩性能时,包括人类偏好的测试通常发现,像MSE(均方误差)这样的简单失真函数与人的感知不完全一致。为了使压缩模型更好地符合人类的视觉感知,以前的工作采用了一种可微分的感知损失方法——这种方法使用经过大规模的人类心理视觉判断数据集校准的神经网络。 我们研究发现,令人惊讶的是,最先进的视觉-语言模型(VLM)在被要求分析图像对之间的差异时,可以零样本复制人类的二元两择一强迫选择(2AFC)判断。受到这一能力的启发,我们提出了基于视觉语言模型的图像压缩系统(VLIC),这是一个基于扩散模型设计的图像压缩系统,旨在通过使用VLM进行的二元判断来进行后期训练。 VLIC利用现有的用于扩散模型偏好后训练的技术,而不是将VLM的判断提炼成独立的感知损失网络。我们的实验表明,在特定数据集上,根据感知指标和大规模用户研究的结果,校准后的VLIC系统在与人类视觉感知对齐的图像压缩性能方面可以达到竞争水平或最先进的表现。 此外,我们还进行了详尽的基于VLM奖励设计和训练过程的分析,并分享了重要的见解。有关更多可视化资料,请访问提供的链接。
https://arxiv.org/abs/2512.15701
Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model's attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model's powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.
预处理是一种已确立的技术,用于优化压缩效果。然而,现有的方法大多以率失真(R-D)为目标,并受限于像素级保真度的要求。本文首次将大规模预训练的扩散模型应用于压缩预处理中,标志着向率感知(R-P)优化的重大转变。我们提出了一种两阶段框架:首先,通过一致分数身份蒸馏(CiD),我们将多步骤的Stable Diffusion 2.1转化为一个紧凑的一步式图像到图像模型。其次,我们在注意力模块上进行参数高效的微调,并利用率感知损失和可微分编解码器代理来指导这一过程。我们的方法可以无缝地与标准编解码器结合使用,无需任何修改,并且能够利用模型的强大生成先验知识来增强纹理并减轻伪影。实验结果表明,在Kodak数据集上,我们的方法在DISTS指标上的BD率(Bitrate-Distortion)减少了高达30.13%,并且提供了更好的主观视觉质量。
https://arxiv.org/abs/2512.15270
Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.
超低比特率图像压缩(低于每像素0.05位)在带宽受限和计算资源有限的编码场景中,如边缘设备上变得越来越重要。现有的框架通常依赖于大型预训练编码器(例如,变分自编码器VAEs或基于标记化的方法),并在其生成潜在空间内执行变换编码。尽管这些方法实现了令人印象深刻的感知保真度,但它们对重型编码网络的依赖使它们不适合部署在计算能力较弱的设备上。 在这项工作中,我们探讨了应用浅层编码器进行超低比特率压缩的可能性,并提出了一个新颖的非对称极端图像压缩(AEIC)框架,旨在同时追求编码简洁性和解码质量。具体而言,AEIC采用中等深度或甚至浅层的编码网络,同时利用一步扩散解码器,在极低比特率下保持高保真度和高度真实的重建效果。 为了进一步提高浅层编码器的效率,我们设计了一种双侧特征蒸馏方案,将知识从具有中等编码网络的AEIC转移到其浅层编码变体上。实验表明,AEIC不仅在超低比特率下的速率失真感知性能方面优于现有方法,而且在1080P输入图像的情况下实现了35.8 FPS的出色编码效率,并且解码速度与现有方法相比具有竞争力。
https://arxiv.org/abs/2512.12229
Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.
机器图像压缩(ICM)已经成为视觉数据压缩领域中的一个关键研究方向。然而,随着机器智能的快速发展,压缩的目标已经从特定任务的虚拟模型转向了在现实环境中运作的具身代理(Embodied agents)。为了应对多代理系统中具身AI通信约束并确保实时任务执行,本文首次提出了具身图像压缩这一科学问题。我们建立了一个标准化基准测试平台EmbodiedComp,在闭环设置下的极低比特率条件下进行系统的评估。通过在模拟和现实世界环境中的广泛实证研究,我们发现现有的视觉-语言-行动模型(VLAs)在低于具身比特率阈值时无法可靠地执行简单的操作任务。我们认为,EmbodiedComp将推动针对具身代理的领域特定压缩技术的发展,从而加速具身AI在现实世界的部署。
https://arxiv.org/abs/2512.11612
This paper introduces ROI-Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end-task accuracy and packing them efficiently while discarding less relevant data, ROI-Packing achieves significant compression efficiency without requiring retraining or fine-tuning of end-task models. Comprehensive evaluations across five datasets and two popular tasks-object detection and instance segmentation-demonstrate up to a 44.10% reduction in bitrate without compromising end-task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state-of-the-art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).
本文介绍了ROI-Packing,这是一种专门为机器视觉设计的高效图像压缩方法。通过优先处理对最终任务准确性至关重要的感兴趣区域(ROI),并有效打包这些区域同时丢弃不太相关的信息,ROI-Packing能够在不重新训练或微调最终任务模型的情况下实现显著的压缩效率。在五个数据集和两个流行的任务——对象检测和实例分割上的全面评估表明,在不牺牲最终任务准确性的前提下,ROI-Packing能够将比特率降低高达44.10%,并且与由运动图像专家组(MPEG)标准化的最新视频编码技术Versatile Video Coding (VVC) 编码器相比,在相同的比特率下还能提高8.88%的准确性。
https://arxiv.org/abs/2512.09258
Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
基于向量量化(VQ)的生成式图像压缩方法最近展示了在平衡失真和感知质量方面的巨大潜力。然而,这些方法通常使用静态全局概率分布来估计VQ索引的熵,这无法适应每张图像的具体内容。这种非自适应的方法导致了比特率潜力未被充分利用,并且难以实现灵活的码率控制。为了应对这一挑战,我们引入了一种基于VQ超先验的可控生成式图像压缩框架(HVQ-CGIC)。该框架严格推导出在向量量化索引熵模型中引入超先验的数学基础。在此基础上,通过新颖的损失设计,据我们所知,这个框架首次实现了RD平衡和控制在基于向量量化的生成式图像压缩中的应用。配合轻量级的超先验估计网络,HVQ-CGIC在率失真性能上相比当前最先进的(SOTA)生成式压缩方法取得了显著优势。在Kodak数据集上,我们与Control-GIC、CDC和HiFiC达到了相同的LPIPS指标,但比特数平均减少了61.3%。我们认为HVQ-CGIC具有成为基于VQGAN的图像压缩的基础组件的巨大潜力,类似于超先验框架在神经网络图像压缩中的核心地位。
https://arxiv.org/abs/2512.07192
Feature extraction in noisy image datasets presents many challenges in model reliability. In this paper, we use the discrete Fourier transform in conjunction with persistent homology analysis to extract specific frequencies that correspond with certain topological features of an image. This method allows the image to be compressed and reformed while ensuring that meaningful data can be differentiated. Our experimental results show a level of compression comparable to that of using JPEG using six different metrics. The end goal of persistent homology-guided frequency filtration is its potential to improve performance in binary classification tasks (when augmenting a Convolutional Neural Network) compared to traditional feature extraction and compression methods. These findings highlight a useful end result: enhancing the reliability of image compression under noisy conditions.
在嘈杂的图像数据集中提取特征给模型可靠性带来了许多挑战。本文中,我们使用离散傅里叶变换与持久同调分析相结合的方法来提取特定频率,这些频率对应于图像中的某些拓扑特征。这种方法允许图像被压缩和重组,并确保有意义的数据可以被区分出来。我们的实验结果显示,在使用六种不同指标的情况下,这种压缩水平可媲美JPEG的性能。 持久同调引导下的频谱过滤最终目标在于提高二分类任务中的表现(在增强卷积神经网络时),相较于传统的特征提取与压缩方法有潜在的优势。这些发现突显了一个有用的最终结果:即在嘈杂条件下提升图像压缩的可靠性。
https://arxiv.org/abs/2512.07065
Generative image compression has recently shown impressive perceptual quality, but often suffers from semantic deviations caused by generative hallucinations at ultra-low bitrate (bpp < 0.05), limiting its reliable deployment in bandwidth-constrained 6G semantic communication scenarios. In this work, we reassess the positioning and role of of multimodal guidance, and propose a Multimodal-Guided Task-Aware Generative Image Compression (MTGC) framework. Specifically, MTGC integrates three guidance modalities to enhance semantic consistency: a concise but robust text caption for global semantics, a highly compressed image (HCI) retaining low-level visual information, and Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. The SPWs are generated by our designed Task-Aware Semantic Compression Module (TASCM), which operates in a task-oriented manner to drive the multi-head self-attention mechanism to focus on and extract semantics relevant to the generation task while filtering out redundancy. Subsequently, to facilitate the synergistic guidance of these modalities, we design a Multimodal-Guided Diffusion Decoder (MGDD) employing a dual-path cooperative guidance mechanism that synergizes cross-attention and ControlNet additive residuals to precisely inject these three guidance into the diffusion process, and leverages the diffusion model's powerful generative priors to reconstruct the image. Extensive experiments demonstrate that MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on the DIV2K dataset) while also achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.
最近,生成式图像压缩技术在感知质量方面表现出色,但在极低比特率(bpp < 0.05)下却常常因为生成性幻觉而导致语义偏差,这限制了其在带宽受限的6G语义通信场景中的可靠部署。为此,在这项工作中,我们重新评估了多模态引导的作用和定位,并提出了一种名为“多模态引导任务感知生成图像压缩”(MTGC)框架。具体而言,MTGC整合了三种增强语义一致性的引导模式:一种简洁但稳健的文本描述用于全局语义,一张高度压缩且保留低级视觉信息的图像(HCI),以及一些细粒度的任务相关语义伪词(SPWs)。这些SPWs通过我们设计的任务感知语义压缩模块(TASCM)生成。该模块以任务为导向运行,促使多头自注意力机制聚焦并提取与生成任务相关的语义,并过滤掉冗余信息。 随后,为了促进这三种模式的协同引导,我们设计了一种名为“多模态引导扩散解码器”(MGDD),它采用双路径协作引导机制,通过交叉注意和ControlNet加性残差,将上述三个指导因素精确注入到扩散过程中。此外,该框架还利用了扩散模型的强大生成先验来重构图像。 广泛的实验表明,MTGC在保持语义一致性的同时(例如,在DIV2K数据集上DISTS指标下降10.59%),还在极低比特率下实现了显著的感知质量和像素级保真度提升。
https://arxiv.org/abs/2512.06344
This work presents an independent reproducibility study of a lossy image compression technique that integrates singular value decomposition (SVD) and wavelet difference reduction (WDR). The original paper claims that combining SVD and WDR yields better visual quality and higher compression ratios than JPEG2000 and standalone WDR. I re-implemented the proposed method, carefully examined missing implementation details, and replicated the original experiments as closely as possible. I then conducted additional experiments on new images and evaluated performance using PSNR and SSIM. In contrast to the original claims, my results indicate that the SVD+WDR technique generally does not surpass JPEG2000 or WDR in terms of PSNR, and only partially improves SSIM relative to JPEG2000. The study highlights ambiguities in the original description (e.g., quantization and threshold initialization) and illustrates how such gaps can significantly impact reproducibility and reported performance.
这项工作提出了一种对一种结合奇异值分解(SVD)和小波差分减少(WDR)的有损图像压缩技术进行独立再现性研究。原论文声称,将SVD与WDR相结合能够比JPEG2000和单独使用WDR提供更好的视觉质量和更高的压缩比率。我重新实现了所提出的方法,仔细审查了缺失的实现细节,并尽可能地复制了原始实验。然后,我在新的图像上进行了额外的实验,并利用PSNR(峰值信噪比)和SSIM(结构相似性指数)对性能进行了评估。 与原论文中的说法相反,我的结果表明SVD+WDR技术在PSNR方面通常不如JPEG2000或WDR,在SSIM方面的改善也仅部分优于JPEG2000。这项研究强调了原始描述中存在的模糊之处(例如量化和阈值初始化),并展示了这些缺口如何显著影响再现性和报告的性能。
https://arxiv.org/abs/2512.06221