Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for real-time compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time $60$~FPS encoding and $42$~FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85\% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment. Codes are released at this https URL
https://arxiv.org/abs/2604.12525
The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at this https URL.
https://arxiv.org/abs/2604.10546
Parameter-efficient fine-tuning of pre-trained codecs is a promising direction in image compression for human and machine vision. While most existing works have primarily focused on tuning the feature structure within the encoder-decoder backbones, the adaptation of the statistical semantics within the entropy model has received limited attention despite its function of predicting the probability distribution of latent features. Our analysis reveals that naive adapter insertion into the entropy model can lead to suboptimal outcomes, underscoring that the effectiveness of adapter-based tuning depends critically on the coordination between adapter type and placement across the compression pipeline. Therefore, we introduce Structure-Semantics Co-Tuning (S2-CoT), a novel framework that achieves this coordination via two specialized, synergistic adapters: the Structural Fidelity Adapter (SFA) and the Semantic Context Adapter (SCA). SFA is integrated into the encoder-decoder to preserve high-fidelity representations by dynamically fusing spatial and frequency information; meanwhile, the SCA adapts the entropy model to align with SFA-tuned features by refining the channel context for more efficient statistical coding. Through joint optimization, S2-CoT turns potential performance degradation into synergistic gains, achieving state-of-the-art results across four diverse base codecs with only a small fraction of trainable parameters, closely matching full fine-tuning performance. Code is available at this https URL.
https://arxiv.org/abs/2604.10017
Image compression is a ubiquitous component of modern visual pipelines, routinely applied by social media platforms and resource-constrained systems prior to inference. Despite its prevalence, the impact of compression on adversarial robustness remains poorly understood. We study a previously unexplored adversarial setting in which attacks are applied directly in compressed representations, and show that compression can act as an adversarial amplifier for deep image classifiers. Under identical nominal perturbation budgets, compression-aware attacks are substantially more effective than their pixel-space counterparts. We attribute this effect to decision space reduction, whereby compression induces a non-invertible, information-losing transformation that contracts classification margins and increases sensitivity to perturbations. Extensive experiments across standard benchmarks and architectures support our analysis and reveal a critical vulnerability in compression-in-the-loop deployment settings. Code will be released.
https://arxiv.org/abs/2604.06954
With the great success of diffusion models in image generation, diffusion-based image compression is attracting increasing interests. However, due to the random noise introduced in the diffusion learning, they usually produce reconstructions with deviation from the original images, leading to suboptimal compression results. To address this problem, in this paper, we propose a Noise Constrained Diffusion (NC-Diffusion) framework for high fidelity image compression. Unlike existing diffusion-based compression methods that add random Gaussian noise and direct the noise into the image space, the proposed NC-Diffusion formulates the quantization noise originally added in the learned image compression as the noise in the forward process of diffusion. Then a noise constrained diffusion process is constructed from the ground-truth image to the initial compression result generated with quantization noise. The NC-Diffusion overcomes the problem of noise mismatch between compression and diffusion, significantly improving the inference efficiency. In addition, an adaptive frequency-domain filtering module is developed to enhance the skip connections in the U-Net based diffusion architecture, in order to enhance high-frequency details. Moreover, a zero-shot sample-guided enhancement method is designed to further improve the fidelity of the image. Experiments on multiple benchmark datasets demonstrate that our method can achieve the best performance compared with existing methods.
https://arxiv.org/abs/2604.06568
Modern image compression methods are typically optimized for the rate--distortion--perception trade-off, whereas their robustness to bit-level corruption is rarely examined. We show that diffusion-based compressors built on the Reverse Channel Coding (RCC) paradigm are substantially more robust to bit flips than classical and learned codecs. We further introduce a more robust variant of Turbo-DDCM that significantly improves robustness while only minimally affecting the rate--distortion--perception trade-off. Our findings suggest that RCC-based compression can yield more resilient compressed representations, potentially reducing reliance on error-correcting codes in highly noisy environments.
现代图像压缩方法通常针对码率-失真-感知度权衡进行优化,而其对比特级损坏的鲁棒性却很少被考察。我们证明,基于反向信道编码(RCC)范式的扩散压缩器在比特翻转方面比传统及学习型编解码器具有显著更高的鲁棒性。我们还引入了Turbo-DDCM的一个更鲁棒变体,该变体在显著提升鲁棒性的同时,仅对码率-失真-感知度权衡产生极小影响。我们的研究结果表明,基于RCC的压缩能够产生更鲁棒的压缩表示,在高噪声环境中可能减少对纠错码的依赖。
https://arxiv.org/abs/2604.05743
Traditional human vision-centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance-driven learned Image Coding for Machines (CI-ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non-uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance-based Context (CI-CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task-Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI-ICM achieves BD-mAP@50:95 gains of 16.25$\%$ in object detection and 13.72$\%$ in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI-ICM. This work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception.
https://arxiv.org/abs/2604.05347
While neural lossless image compression has advanced significantly with learned entropy models, lossless video compression remains largely unexplored in the neural setting. We present NeuralLVC, a neural lossless video codec that combines masked diffusion with an I/P-frame architecture for exploiting temporal redundancy. Our I-frame model compresses individual frames using bijective linear tokenization that guarantees exact pixel reconstruction. The P-frame model compresses temporal differences between consecutive frames, conditioned on the previous decoded frame via a lightweight reference embedding that adds only 1.3% trainable parameters. Group-wise decoding enables controllable speed-compression trade-offs. Our codec is lossless in the input domain: for video, it reconstructs YUV420 planes exactly; for image evaluation, RGB channels are reconstructed exactly. Experiments on 9 Xiph CIF sequences show that NeuralLVC outperforms H.264 and H.265 lossless by a significant margin. We verify exact reconstruction through end-to-end encode-decode testing with arithmetic coding. These results suggest that masked diffusion with temporal conditioning is a promising direction for neural lossless video compression.
尽管神经无损图像压缩已通过学习熵模型取得显著进展,但神经框架下的无损视频压缩仍 largely 未被探索。本文提出 NeuralLVC,一种结合掩码扩散与 I/P 帧架构的神经无损视频编解码器,用于利用时间冗余。其 I 帧模型通过双射线性分词压缩单帧,保证像素精确重建;P 帧模型则通过轻量级参考嵌入(仅增加 1.3% 可训练参数)对连续帧间的时间差进行压缩,并以已解码前一帧为条件。分组解码机制实现了可控的速度-压缩率权衡。该编解码器在输入域实现无损:视频场景下精确重建 YUV420 平面;图像评估场景下精确重建 RGB 通道。在 9 个 Xiph CIF 序列上的实验表明,NeuralLVC 显著优于 H.264 与 H.265 的无损模式。通过算术编码的端到端编解码测试验证了精确重建能力。这些结果表明,结合时间条件的掩码扩散是神经无损视频压缩的一个有前景的方向。
https://arxiv.org/abs/2604.03353
The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human-centric codecs are suboptimal in this setting because they preserve many task-irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt-driven VLMs with an open-ended variety of objectives. We propose a lightweight, plug-and-play, prompt-guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression efficiency. It is codec-agnostic and can be applied before conventional and learned encoders. Experiments on several VQA benchmarks show that our approach achieves a 25-50% average bitrate reduction while maintaining the same task accuracy. Our source code is available at this https URL.
大型视觉语言模型(VLMs)的快速发展已催生了图像理解与视觉问答(VQA)等广泛应用。查询图像通常需上传至云端——VLMs 通常部署于此——因此高效的图像压缩至关重要。然而,传统面向人眼的编解码器在此场景下并非最优,因其保留了大量与任务无关的细节。现有面向机器的图像编码(ICM)方法也存在局限:它们预设固定的下游任务集,无法适应目标开放的提示驱动型 VLMs。我们提出一个轻量级、即插即用、由提示引导的预过滤模块,用于识别与文本提示(进而与下游任务)最相关的图像区域。该模块在保留关键细节的同时平滑处理相关性较低的区域,以提升压缩效率。其设计独立于具体编解码器,可应用于传统编码器与学习型编码器前。在多个 VQA 基准测试上的实验表明,该方法在保持相同任务准确率的前提下,实现了平均 25%-50% 的码率降低。源代码已公开于此链接。
https://arxiv.org/abs/2604.00314
Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: (i) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization, (ii) a hyperprior-based autoencoder optimized for lossy compression, and (iii) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.
在风力涡轮机检测过程中传输大量高分辨率图像,会对评估和检测严重缺陷造成瓶颈。高效编码必须在叶片区域保持高保真度,同时高效压缩背景区域。本研究提出一个端到端深度学习框架,联合执行分割与双模式(有损和无损)压缩。该框架的分割模块能精确定位叶片区域,随后其感兴趣区域(ROI)压缩器以优于图像其他部分的质量对其进行编码。与传统ROI方案仅向显著区域分配更多比特不同,本框架整合了以下组件:(i)配备CRF正则化损失的鲁棒分割网络(BU-Netv2+P),用于精确叶片定位;(ii)针对有损压缩优化的基于超先验的自编码器;(iii)采用分层模型的扩展比特回传编码器,实现完全无损的叶片重建。此外,本ROI框架通过复用背景编码比特,消除了比特回传编码中的顺序依赖,实现了并行化高效双模式压缩。据我们所知,这是首个完全集成化的基于学习的ROI编解码器,结合了分割、有损与无损压缩,确保后续缺陷检测不受影响。在大规模风力涡轮机数据集上的实验表明,该方法在压缩性能和效率上均具有优势,为自动化检测提供了实用解决方案。
https://arxiv.org/abs/2603.29927
Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as "real" regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.
视觉语言模型(VLMs)在面对经典光学幻觉时表现出系统性偏见:无论图像是否经过反事实修改,它们几乎总是将幻觉预测为“真实”的。我们提出了一种用于DataCV 2026挑战赛(任务I和II)的工具引导推理框架,该框架无需任何模型训练即可解决此失效模式。我们赋予一个现成的视觉语言模型访问一组通用图像处理工具的权限:线条绘制、区域裁剪、并排比较和通道隔离,同时配备一个幻觉类型路由系统提示,该系统提示规定了针对每类感知问题应调用哪些工具。关键在于,每次工具调用都会生成一个新的不可变图像资源,并将其追加到持久注册表中,从而使模型能够在整个推理链中引用和组合任何先前的注释视图。与硬编码幻觉特定模块不同,这种通用工具加路由的设计实现了强大的跨结构泛化:在验证集和包含结构不熟悉的幻觉变体(例如马赫带效应从垂直堆叠旋转为水平堆叠)的测试集上,性能保持稳定。我们还报告了三个我们认为值得进一步研究的实证观察结果:(i)一种强烈的正向检测偏见,可能源于不平衡的幻觉训练数据;(ii)像素精确的空间推理与对自生成注释的逻辑推理之间存在显著分离;(iii)对图像压缩伪影的明显敏感度,这加剧了假阳性结果。
https://arxiv.org/abs/2603.29428
Raw images preserve linear sensor measurements and high bit-depth information crucial for advanced vision tasks and photography applications, yet their storage remains challenging due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing learned lossless compression methods mainly target 8-bit sRGB images, while raw reconstruction approaches are inherently lossy and rely on camera-specific assumptions. To address these challenges, we introduce RAWIC, a bit-depth-adaptive learned lossless compression framework for Bayer-pattern raw images. We first convert single-channel Bayer data into a four-channel RGGB format and partition it into patches. For each patch, we compute its bit depth and use it as auxiliary input to guide compression. A bit-depth-adaptive entropy model is then designed to estimate patch distributions conditioned on their bit depths. This architecture enables a single model to handle raw images from diverse cameras and bit depths. Experiments show that RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL. Our code is available at this https URL.
原始图像保留了线性传感器测量值和高位深度信息,这对高级视觉任务和摄影应用至关重要,然而其存储仍因文件体积庞大、位深度多变以及传感器特性差异而面临挑战。现有基于学习的无损压缩方法主要针对8位sRGB图像,而原始图像重建方案本质上有损且依赖特定相机假设。为应对这些挑战,我们提出RAWIC——一种面向拜耳模式原始图像的位深度自适应学习无损压缩框架。首先将单通道拜耳数据转换为四通道RGGB格式并分割为图像块,针对每个图像块计算其位深度作为辅助输入指导压缩。随后设计位深度自适应熵模型,根据图像块的位深度条件估计其分布。该架构使单一模型能处理不同相机与位深度的原始图像。实验表明,RAWIC持续超越传统无损编解码器,在比特率上平均比JPEG-XL降低7.7%。代码已公开于此https URL。
https://arxiv.org/abs/2603.28105
High-resolution sensors are critical for robust autonomous perception but impose a severe memory wall on battery-constrained electric vehicles. In these systems, data movement energy often outweighs computation. Traditional image compression is ill-suited as it is semantically blind and optimizes for storage rather than bus switching activity. We propose MotiMem, a hardware-software co-designed interface. Exploiting temporal coherence,MotiMem uses lightweight 2D Motion Propagation to dynamically identify Regions of Interest (RoI). Complementing this, a Hybrid Sparsity-Aware Coding scheme leverages adaptive inversion and truncation to induce bitlevel sparsity. Extensive experiments across nuScenes, Waymo, and KITTI with 16 detection models demonstrate that MotiMem reduces memory-interface dynamic energy by approximately 43 percent while retaining approximately 93 percent of the object detection accuracy, establishing a new Pareto frontier significantly superior to standard codecs like JPEG and WebP.
高分辨率传感器对于实现鲁棒的自主感知至关重要,但给电池约束型电动车带来了严重的内存墙。在这些系统中,数据移动能耗往往超过计算能耗。传统图像压缩因其语义盲特性且优化目标为存储而非总线切换活动,并不适用。我们提出了 MotiMem,一个硬件-软件协同设计的接口。利用时间相干性,MotiMem 采用轻量级二维运动传播动态识别感兴趣区域(RoI)。作为补充,一种混合稀疏感知编码方案通过自适应求逆与截断来诱导比特级稀疏性。在 nuScenes、Waymo 和 KITTI 数据集上使用 16 个检测模型的广泛实验表明,MotiMem 在保留约 93% 目标检测精度的同时,将内存接口动态能耗降低了约 43%,建立了显著优于 JPEG 和 WebP 等标准编解码器的新帕累托前沿。
https://arxiv.org/abs/2603.27108
The rapid growth of hyperspectral data archives in remote sensing (RS) necessitates effective compression methods for storage and transmission. Recent advances in learning-based hyperspectral image (HSI) compression have significantly enhanced both reconstruction fidelity and compression efficiency. However, existing methods typically adapt variational image compression models designed for natural images, without adequately accounting for the distinct spatio-spectral redundancies inherent in HSIs. In particular, they lack explicit architectural designs to balance spatial and spectral feature learning, limiting their ability to effectively leverage the unique characteristics of hyperspectral data. To address this issue, we introduce spatio-spectral variational hyperspectral image compression architecture (HyVIC). The proposed model comprises four main components: 1) adjustable spatio-spectral encoder; 2) spatio-spectral hyperencoder; 3) spatio-spectral hyperdecoder; and 4) adjustable spatio-spectral decoder. We demonstrate that the trade-off between spatial and spectral feature learning is crucial for the reconstruction fidelity, and therefore present a metric-driven strategy to systematically select the hyperparameters of the proposed model. Extensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed model, achieving high spatial and spectral reconstruction fidelity across a wide range of compression ratios (CRs) and improving the state of the art by up to 4.66dB in terms of BD-PSNR. Based on our results, we offer insights and derive practical guidelines to guide future research directions in learning-based variational HSI compression. Our code and pre-trained model weights are publicly available at this https URL .
高光谱遥感(RS)数据档案的快速增长对存储和传输提出了有效的压缩方法需求。近年来基于学习的高光谱图像(HSI)压缩技术显著提升了重建精度与压缩效率。然而现有方法通常直接适配为自然图像设计的变分图像压缩模型,未能充分考量高光谱图像固有的独特空-谱冗余特性。具体而言,这些方法缺乏显式的架构设计来平衡空间与光谱特征学习,从而限制了其有效利用高光谱数据独特特性的能力。针对这一问题,本文提出空-谱变分高光谱图像压缩架构(HyVIC)。该模型包含四个核心组件:1)可调空-谱编码器;2)空-谱超编码器;3)空-谱超解码器;4)可调空-谱解码器。研究表明空间与光谱特征学习的权衡对重建精度至关重要,因此我们提出一种指标驱动策略来系统选择模型超参数。在两个基准数据集上的大量实验验证了所提模型的有效性,其在宽泛的压缩比(CR)范围内均能实现高空间与光谱重建精度,并以最高4.66dB的BD-PSNR值提升现有技术水平。基于研究成果,我们提供了见解并提炼出实用指南,以指导未来基于学习的变分HSI压缩研究方向。本文代码与预训练模型权重已公开于此https链接。
https://arxiv.org/abs/2603.26468
Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model's ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at this https URL.
https://arxiv.org/abs/2603.25316
Given the popularity of 360° images on social media platforms, 360° image compression becomes a critical technology for media storage and transmission. Conventional 360° image compression pipeline projects the spherical image into a single 2D plane, leading to issues of oversampling and distortion. In this paper, we propose a novel viewport-based neural compression pipeline for 360° images. By replacing the image projection in conventional 360° image compression pipelines with viewport extraction and efficiently compressing multiple viewports, the proposed pipeline minimizes the inherent oversampling and distortion issues. However, viewport extraction impedes information sharing between multiple viewports during compression, causing the loss of global information about the spherical image. To tackle this global information loss, we design a neural viewport codec to capture global prior information across multiple viewports and maximally compress the viewport data. The viewport codec is empowered by a transformer-based ViewPort ConText (VPCT) module that can be integrated with canonical learning-based 2D image compression structures. We compare the proposed pipeline with existing 360° image compression models and conventional 360° image compression pipelines building on learning-based 2D image codecs and standard hand-crafted codecs. Results show that our pipeline saves an average of $14.01\%$ bit consumption compared to the best-performing 360° image compression methods without compromising quality. The proposed VPCT-based codec also outperforms existing 2D image codecs in the viewport-based neural compression pipeline. Our code can be found at: this https URL.
鉴于360°图像在社交媒体平台上的流行,360°图像压缩成为媒体存储与传输的关键技术。传统的360°图像压缩流程将球形图像投影至单一二维平面,导致过采样与失真问题。本文提出一种新颖的基于视口的神经压缩流程,通过用视口提取替代传统流程中的图像投影,并对多个视口进行高效压缩,从而最小化固有的过采样与失真问题。然而,视口提取阻碍了压缩过程中多个视口间的信息共享,导致球形图像的全局信息丢失。为应对此全局信息损失,我们设计了神经视口编解码器以捕获跨多个视口的全局先验信息,并最大程度压缩视口数据。该视口编解码器依托于基于Transformer的视口上下文(VPCT)模块,该模块可集成至基于学习的标准2D图像压缩结构中。我们将所提流程与现有360°图像压缩模型及基于学习的2D图像编解码器与标准手工设计编解码器构建的传统360°图像压缩流程进行对比。结果表明,在不损失质量的前提下,我们的流程平均节省了14.01%的比特消耗。此外,所提基于VPCT的编解码器在基于视口的神经压缩流程中也优于现有2D图像编解码器。代码链接为:此https URL。
https://arxiv.org/abs/2603.22776
Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.
https://arxiv.org/abs/2603.18660
Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction.
https://arxiv.org/abs/2603.17408
Existing remote sensing image compression methods still explore to balance high compression efficiency with the preservation of fine details and task-relevant information. Meanwhile, high-resolution drone imagery offers valuable structural details for urban monitoring and disaster assessment, but large-area datasets can easily reach hundreds of gigabytes, creating significant challenges for storage and long-term management. In this paper, we propose a PPO-based bitrate allocation Conditional Diffusion Compression (PCDC) framework. PCDC integrates a conditional diffusion decoder with a PPO-based block-wise bitrate allocation strategy to achieve high compression ratios while maintaining strong perceptual performance. We also release a high-resolution drone image dataset with richer structural details at a consistent low altitude over residential neighborhoods in coastal urban areas. Experimental results show compression ratios of 19.3x on DIV2K and 21.2x on the drone image dataset. Moreover, downstream object detection experiments demonstrate that the reconstructed images preserve task-relevant information with negligible performance loss.
https://arxiv.org/abs/2603.15365
We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal'' evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed this http URL model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction this http URL contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50\% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to $\times$5. Code will be released later.
https://arxiv.org/abs/2603.15129