From Paleolithic cave paintings to Impressionism, human painting has evolved to depict increasingly complex and detailed scenes, conveying more nuanced messages. This paper attempts to emerge this artistic capability by simulating the evolutionary pressures that enhance visual communication efficiency. Specifically, we present a model with a stroke branch and a palette branch that together simulate human-like painting. The palette branch learns a limited colour palette, while the stroke branch parameterises each stroke using Bézier curves to render an image, subsequently evaluated by a high-level recognition module. We quantify the efficiency of visual communication by measuring the recognition accuracy achieved with machine vision. The model then optimises the control points and colour choices for each stroke to maximise recognition accuracy with minimal strokes and colours. Experimental results show that our model achieves superior performance in high-level recognition tasks, delivering artistic expression and aesthetic appeal, especially in abstract sketches. Additionally, our approach shows promise as an efficient bit-level image compression technique, outperforming traditional methods.
从旧石器时代的洞穴壁画到印象派,人类绘画的发展历程中,描绘的场景逐渐变得更加复杂和详细,并传达出更为细腻的信息。本文试图通过模拟增强视觉通信效率的进化压力来再现这种艺术能力。具体来说,我们提出了一种具有笔触分支和调色板分支的模型,这些分支共同模拟了类似人类的作画方式。调色板分支学习有限的颜色方案,而笔触分支则使用Bézier曲线参数化每个笔触以生成图像,并随后通过高层次识别模块进行评估。我们通过测量机器视觉实现的识别准确率来量化视觉通信的有效性。模型优化每条笔触的控制点和颜色选择,以在最少的笔画和颜色下最大化识别准确性。 实验结果显示,我们的模型在高级别识别任务中表现出色,在抽象草图方面尤其具有艺术表现力和美学吸引力。此外,我们的方法显示出了作为高效位级图像压缩技术的巨大潜力,并且优于传统的方法。
https://arxiv.org/abs/2501.04966
The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more generalizable for the tasks relying on the information of different granularity. Furthermore, for supporting both human and machine visions with only a unifying bitstream, we incorporate a conditional decoding strategy that takes as conditions human or machine preferences, enabling the bitstream to be decoded into different versions for corresponding preferences. As such, our proposed UG-ICM is fully trained in a self-supervised manner, i.e., without awareness of any specific downstream models and tasks. The extensive experiments have shown that the proposed UG-ICM is capable of achieving remarkable improvements in various unseen machine analytics tasks, while simultaneously providing perceptually satisfying images.
图像压缩模型长期以来一直面临适应性和泛化的难题,因为解码后的比特流通常只服务于人类或机器的需求,而无法保留用于未见过的视觉任务的信息。因此,本文创新性地引入了来自多模态预训练模型的监督,并结合自适应多目标优化策略,旨在通过单一的比特流同时支持人类视觉感知和机器视觉,该方法被称为统一与泛化图像编码模型(Unified and Generalized Image Coding for Machine, UG-ICM)。 具体而言,为了解除压缩模型依赖下游任务监督的问题,我们将对比语言—图像预训练(CLIP)模型引入到训练约束中以提升其泛化能力。全局至实例级别的CLIP监督被应用来帮助获取层次语义信息,从而使模型更加适应于依赖不同粒度信息的任务。 此外,为了仅通过单一的统一比特流支持人类和机器视觉,我们引入了一种条件解码策略,该策略根据人类或机器的偏好作为条件进行解码,从而使得比特流可以被解析成不同的版本以满足相应的偏好。因此,我们的UG-ICM模型完全采用自监督方式进行训练,即无需了解任何特定下游模型和任务。 广泛的实验表明,提出的UG-ICM能够在各种未见过的机器分析任务中实现显著改进,同时提供感知上令人满意的图像质量。
https://arxiv.org/abs/2501.04579
While most existing neural image compression (NIC) and neural video compression (NVC) methodologies have achieved remarkable success, their optimization is primarily focused on human visual perception. However, with the rapid development of artificial intelligence, many images and videos will be used for various machine vision tasks. Consequently, such existing compression methodologies cannot achieve competitive performance in machine vision. In this work, we introduce an efficient adaptive compression (EAC) method tailored for both human perception and multiple machine vision tasks. Our method involves two key modules: 1), an adaptive compression mechanism, that adaptively selects several subsets from latent features to balance the optimizations for multiple machine vision tasks (e.g., segmentation, and detection) and human vision. 2), a task-specific adapter, that uses the parameter-efficient delta-tuning strategy to stimulate the comprehensive downstream analytical networks for specific machine vision tasks. By using the above two modules, we can optimize the bit-rate costs and improve machine vision performance. In general, our proposed EAC can seamlessly integrate with existing NIC (i.e., Ballé2018, and Cheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on various benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101, and DAVIS) shows that our method enhances performance for multiple machine vision tasks while maintaining the quality of human vision.
尽管现有的神经图像压缩(NIC)和神经视频压缩(NVC)方法已经取得了显著的成功,但它们的优化主要集中在人类视觉感知上。然而,随着人工智能的快速发展,许多图像和视频将被用于各种机器视觉任务中。因此,这些现有压缩方法在机器视觉中的表现无法达到竞争水平。为此,在这项工作中,我们引入了一种高效的自适应压缩(EAC)方法,专门针对人类视觉感知及多种机器视觉任务进行优化。 我们的方法包括两个关键模块:1)自适应压缩机制,该机制能够从潜在特征中选择多个子集,以平衡对多项机器视觉任务(如分割和检测)以及人眼视觉的优化。2)特定任务适配器,通过参数高效的微调策略来激发下游分析网络在具体机器视觉任务上的性能。 通过使用上述两个模块,我们可以优化比特率成本并提升机器视觉的表现。总体而言,我们提出的EAC方法可以无缝地与现有的NIC(如Ballé2018和Cheng2020)以及NVC(如DVC和FVC)方法结合在一起。在各种基准数据集(如VOC2007、ILSVRC2012、VOC2012、COCO、UCF101及DAVIS)上的广泛评估显示,我们的方法能够提升多项机器视觉任务的表现,同时保持人类视觉质量的水平。
https://arxiv.org/abs/2501.04329
The vast volume of medical image data necessitates efficient compression techniques to support remote healthcare services. This paper explores Region of Interest (ROI) coding to address the balance between compression rate and image quality. By leveraging UNET segmentation on the Brats 2020 dataset, we accurately identify tumor regions, which are critical for diagnosis. These regions are then subjected to High Efficiency Video Coding (HEVC) for compression, enhancing compression rates while preserving essential diagnostic information. This approach ensures that critical image regions maintain their quality, while non-essential areas are compressed more. Our method optimizes storage space and transmission bandwidth, meeting the demands of telemedicine and large-scale medical imaging. Through this technique, we provide a robust solution that maintains the integrity of vital data and improves the efficiency of medical image handling.
大量的医学影像数据需要高效的压缩技术来支持远程医疗服务。本文探讨了区域感兴趣(ROI)编码,以平衡压缩率和图像质量之间的关系。通过在Brats 2020数据集上应用UNET分割算法,我们能够精确识别肿瘤区域,这些区域对诊断至关重要。随后,我们将这些关键区域采用高效率视频编码(HEVC)进行压缩,从而提高压缩比率同时保留重要的诊断信息。这种方法确保了关键图像区域的质量得到维持,而非重要区域则会被更大幅度地压缩。我们的方法优化了存储空间和传输带宽,在满足远程医疗以及大规模医学影像的需求的同时提高了数据的完整性及处理效率。通过这一技术手段,我们提供了一种既能保持关键数据完整又能提高医学影像处理效率的可靠解决方案。
https://arxiv.org/abs/2501.02895
The growing field of remote sensing faces a challenge: the ever-increasing size and volume of imagery data are exceeding the storage and transmission capabilities of satellite platforms. Efficient compression of remote sensing imagery is a critical solution to alleviate these burdens on satellites. However, existing compression methods are often too computationally expensive for satellites. With the continued advancement of compressed sensing theory, single-pixel imaging emerges as a powerful tool that brings new possibilities for on-orbit image compression. However, it still suffers from prolonged imaging times and the inability to perform high-resolution imaging, hindering its practical application. This paper advances the study of compressed sensing in remote sensing image compression, proposing Block Modulated Imaging (BMI). By requiring only a single exposure, BMI significantly enhances imaging acquisition speeds. Additionally, BMI obviates the need for digital micromirror devices and surpasses limitations in image resolution. Furthermore, we propose a novel decoding network specifically designed to reconstruct images compressed under the BMI framework. Leveraging the gated 3D convolutions and promoting efficient information flow across stages through a Two-Way Cross-Attention module, our decoding network exhibits demonstrably superior reconstruction performance. Extensive experiments conducted on multiple renowned remote sensing datasets unequivocally demonstrate the efficacy of our proposed method. To further validate its practical applicability, we developed and tested a prototype of the BMI-based camera, which has shown promising potential for on-orbit image compression. The code is available at this https URL.
遥感领域的快速发展面临着一个挑战:图像数据的不断增加,超过了卫星平台的存储和传输能力。高效压缩遥感图像是减轻这些负担的关键解决方案。然而,现有的压缩方法通常对卫星来说计算成本过高。随着压缩感知理论的不断进步,单像素成像作为一种强大的工具,为轨道上的图像压缩带来了新的可能性。但是,它仍然存在成像时间长以及无法实现高分辨率成像的问题,阻碍了其实际应用。 本文推进了遥感图像压缩中压缩感知的研究,提出了块调制成像(BMI)。通过仅需一次曝光,BMI显著提高了图像采集速度。此外,BMI无需数字微镜设备,并超越了解像度的限制。我们还提出了一种专门用于重建BMI框架下压缩图像的新解码网络。该网络利用了门控3D卷积,并通过双向交叉注意力模块促进了各阶段之间的高效信息流动,展现了显著更优的重构性能。 在多个著名遥感数据集上进行的广泛实验明确展示了我们所提方法的有效性。为了进一步验证其实际应用潜力,我们开发并测试了一个基于BMI的相机原型,显示出对轨道图像压缩具有良好的前景。代码可在以下网址获取:[此 https URL]。
https://arxiv.org/abs/2412.18417
While learned image compression methods have achieved impressive results in either human visual perception or machine vision tasks, they are often specialized only for one domain. This drawback limits their versatility and generalizability across scenarios and also requires retraining to adapt to new applications-a process that adds significant complexity and cost in real-world scenarios. In this study, we introduce an innovative semantics DISentanglement and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks. The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side. At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks. Extensive experimental evaluations substantiate the robustness and effectiveness of DISCOVER, demonstrating superior performance in fulfilling the dual objectives of human and machine vision requirements.
虽然已有的图像压缩方法在人类视觉感知或机器视觉任务中取得了令人印象深刻的结果,但它们通常仅专用于一个领域。这一缺点限制了它们在不同场景中的多样性和通用性,并且还需要重新训练以适应新应用——这一过程在现实世界情境中增加了显著的复杂性和成本。在这项研究中,我们引入了一种创新性的语义分离与组合多功能编解码器(DISCOVER),旨在同时增强人类视觉感知和机器视觉任务的效果。该方法通过多模态大型模型为每个任务生成一组标签,并应用这些基础模型进行精准定位,从而在编码端实现对图像组件的全面理解和分解。在解码阶段,通过利用这些编码后的成分以及生成模型中的先验信息,实现了图像的整体重建,进而优化了人类视觉感知和基于机器分析的任务性能。广泛的实验评估证明了DISCOVER的稳健性和有效性,展示了它在满足人类和机器视觉需求方面的优越表现。
https://arxiv.org/abs/2412.18158
Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL) principle and Parameter-Efficient Transfer Learning (PETL), leading to the development of a novel content-adaptive approach for learned lossless image compression, dubbed CALLIC. Specifically, we first propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations, termed Masked Gated ConvFormer (MGCF), and pretrain MGCF on training dataset. Cache then Crop Inference (CCI) is proposed to accelerate the coding process. During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT). RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time. Extensive experiments across diverse datasets demonstrate that CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.
近年来,学习无损图像压缩取得了显著的进展。然而,现有的方法通常依赖于在大规模数据集上训练泛化的生成模型,在编码过程中对特定测试图像的概率分布估计次优。为了解决这一挑战,我们探讨了最小描述长度(MDL)原则与参数高效迁移学习(PETL)之间的联系,从而开发了一种新的内容自适应学习无损图像压缩方法,称为CALLIC。具体而言,我们首先通过利用卷积门控操作提出了一个基于内容的自回归自注意力机制,称为掩码门控卷积转换器(MGCF),并在训练数据集上对其进行预训练。随后,提出缓存裁剪推理(CCI)以加速编码过程。在编码过程中,我们将预训练层分解为低秩矩阵,并通过率指导渐进微调(RPFT)来适应测试图像上的增量权重。RPFT通过对估计熵按降序排列的块进行逐步增加的微调,优化了学习过程并减少了适应时间。跨多种数据集的广泛实验表明,CALLIC为学习无损图像压缩树立了新的最新技术标准。
https://arxiv.org/abs/2412.17464
In the field of autonomous driving, a variety of sensor data types exist, each representing different modalities of the same scene. Therefore, it is feasible to utilize data from other sensors to facilitate image compression. However, few techniques have explored the potential benefits of utilizing inter-modality correlations to enhance the image compression performance. In this paper, motivated by the recent success of learned image compression, we propose a new framework that uses sparse point clouds to assist in learned image compression in the autonomous driving scenario. We first project the 3D sparse point cloud onto a 2D plane, resulting in a sparse depth map. Utilizing this depth map, we proceed to predict camera images. Subsequently, we use these predicted images to extract multi-scale structural features. These features are then incorporated into learned image compression pipeline as additional information to improve the compression performance. Our proposed framework is compatible with various mainstream learned image compression models, and we validate our approach using different existing image compression methods. The experimental results show that incorporating point cloud assistance into the compression pipeline consistently enhances the performance.
在自动驾驶领域,存在多种类型的传感器数据,每种类型代表同一场景的不同模式。因此,利用其他传感器的数据来促进图像压缩是可行的。然而,很少有技术探索了利用跨模态相关性来提升图像压缩性能的可能性。受近期学习型图像压缩成功的启发,在本文中,我们提出了一种新的框架,该框架使用稀疏点云在自动驾驶场景下辅助学习型图像压缩。首先,我们将3D稀疏点云投影到2D平面上,从而得到一个稀疏深度图。利用这个深度图,我们继续预测相机图像。随后,我们使用这些预测的图像提取多尺度结构特征。然后将这些特征作为额外信息整合进学习型图像压缩流程中以提升压缩性能。我们的框架与各种主流的学习型图像压缩模型兼容,并通过不同的现有图像压缩方法验证了我们的方法。实验结果表明,将点云辅助融入压缩流程始终可以增强性能。
https://arxiv.org/abs/2412.15752
Signal compression based on implicit neural representation (INR) is an emerging technique to represent multimedia signals with a small number of bits. While INR-based signal compression achieves high-quality reconstruction for relatively low-resolution signals, the accuracy of high-frequency details is significantly degraded with a small model. To improve the compression efficiency of INR, we introduce quantum INR (quINR), which leverages the exponentially rich expressivity of quantum neural networks for data compression. Evaluations using some benchmark datasets show that the proposed quINR-based compression could improve rate-distortion performance in image compression compared with traditional codecs and classic INR-based coding methods, up to 1.2dB gain.
基于隐式神经表示(INR)的信号压缩是一种新兴的技术,用于用少量比特来表示多媒体信号。虽然基于INR的信号压缩能够以相对低分辨率的信号实现高质量重建,但在模型较小的情况下,高频细节的准确性会显著降低。为了提高INR的压缩效率,我们引入了量子隐式神经网络(quINR),该方法利用量子神经网络的指数丰富的表达能力来进行数据压缩。通过一些基准数据集的评估显示,所提出的基于quINR的压缩技术相比于传统编解码器和经典INR编码方法,在图像压缩中能够提升率失真性能,最多可提高1.2dB。
https://arxiv.org/abs/2412.19828
Recent advances in Artificial Intelligence Generated Content (AIGC) have garnered significant interest, accompanied by an increasing need to transmit and compress the vast number of AI-generated images (AIGIs). However, there is a noticeable deficiency in research focused on compression methods for AIGIs. To address this critical gap, we introduce a scalable cross-modal compression framework that incorporates multiple human-comprehensible modalities, designed to efficiently capture and relay essential visual information for AIGIs. In particular, our framework encodes images into a layered bitstream consisting of a semantic layer that delivers high-level semantic information through text prompts; a structural layer that captures spatial details using edge or skeleton maps; and a texture layer that preserves local textures via a colormap. Utilizing Stable Diffusion as the backend, the framework effectively leverages these multimodal priors for image generation, effectively functioning as a decoder when these priors are encoded. Qualitative and quantitative results show that our method proficiently restores both semantic and visual details, competing against baseline approaches at extremely low bitrates ( <0.02 bpp). Additionally, our framework facilitates downstream editing applications without requiring full decoding, thereby paving a new direction for future research in AIGI compression.
最近,人工智能生成内容(AIGC)方面的进步引起了广泛关注,随之而来的是传输和压缩大量AI生成图像(AIGIs)的日益增长的需求。然而,在针对AIGIs的压缩方法研究中存在明显的不足。为了解决这一关键问题,我们提出了一种可扩展的跨模态压缩框架,该框架整合了多种人类易理解的模态,旨在高效地捕捉和传递AIGIs的关键视觉信息。特别是,我们的框架将图像编码成由多个层次组成的比特流:一个语义层通过文本提示传输高层次的语义信息;一个结构层使用边缘或骨架图捕捉空间细节;以及一个纹理层通过颜色映射保留局部纹理。利用Stable Diffusion作为后端,该框架有效地利用了这些多模态先验来进行图像生成,并在这些先验被编码时充当解码器的角色。定性和定量的结果表明,我们的方法能够高效地恢复语义和视觉细节,在极低的比特率(<0.02 bpp)下与基线方法竞争。此外,我们的框架支持下游编辑应用而无需完全解码,为未来AIGI压缩的研究开辟了新的方向。
https://arxiv.org/abs/2412.12982
High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is widely adopted in the area of autonomous driving. We first unify the data representation for both modalities to facilitate data processing. Then, we propose the point cloud-assisted neural image codec (PCA-NIC) to enhance the preservation of image texture and structure by utilizing the high-dimensional point cloud information. We further introduce a multi-modal feature fusion transform module (MMFFT) to capture more representative image features, remove redundant information between channels and modalities that are not relevant to the image content. Our work is the first to improve image compression performance using point cloud and achieves state-of-the-art performance.
高效图像压缩是一项关键需求。在多种模态数据由不同传感器捕获的几种场景中,现有的仅针对图像的编解码器并未充分利用其他模态提供的辅助信息,导致压缩效率次优。本文通过利用广泛应用于自动驾驶领域的点云,提升了图像压缩性能。我们首先统一了两种模态的数据表示方式以促进数据处理。然后,我们提出了点云辅助神经图像编码(PCA-NIC),通过利用高维点云信息增强图像纹理和结构的保存。此外,我们引入了一种多模态特征融合变换模块(MMFFT),用于捕捉更具代表性的图像特征,并去除与图像内容无关的通道间和模态间的冗余信息。我们的工作首次使用点云提升图像压缩性能,并实现了最先进的性能表现。
https://arxiv.org/abs/2412.11771
Neural image compression often faces a challenging trade-off among rate, distortion and perception. While most existing methods typically focus on either achieving high pixel-level fidelity or optimizing for perceptual metrics, we propose a novel approach that simultaneously addresses both aspects for a fixed neural image codec. Specifically, we introduce a plug-and-play module at the decoder side that leverages a latent diffusion process to transform the decoded features, enhancing either low distortion or high perceptual quality without altering the original image compression codec. Our approach facilitates fusion of original and transformed features without additional training, enabling users to flexibly adjust the balance between distortion and perception during inference. Extensive experimental results demonstrate that our method significantly enhances the pretrained codecs with a wide, adjustable distortion-perception range while maintaining their original compression capabilities. For instance, we can achieve more than 150% improvement in LPIPS-BDRate without sacrificing more than 1 dB in PSNR.
神经图像压缩通常面临着码率、失真和感知之间的艰难权衡。尽管现有的大多数方法往往专注于实现高像素级保真度或优化感知指标,我们提出了一种全新的方法,在固定神经图像编解码器中同时解决这两个方面的问题。具体来说,我们在解码器端引入了一个即插即用模块,该模块利用隐式扩散过程转换解码后的特征,从而在不改变原始图像压缩编解码器的前提下提升低失真或高感知质量。我们的方法促进了原始和转换后特征的融合,并且无需额外训练,使用户能够在推理过程中灵活调整失真与感知之间的平衡。广泛的实验结果表明,我们的方法显著提升了预训练编码器在保持原有压缩能力的同时,在广泛可调的失真-感知范围内表现更佳。例如,我们在LPIPS-BDRate上的改进超过了150%,而PSNR损失不超过1 dB。
https://arxiv.org/abs/2412.11379
Diffusion probabilistic models have achieved mainstream success in many generative modeling tasks, from image generation to inverse problem solving. A distinct feature of these models is that they correspond to deep hierarchical latent variable models optimizing a variational evidence lower bound (ELBO) on the data likelihood. Drawing on a basic connection between likelihood modeling and compression, we explore the potential of diffusion models for progressive coding, resulting in a sequence of bits that can be incrementally transmitted and decoded with progressively improving reconstruction quality. Unlike prior work based on Gaussian diffusion or conditional diffusion models, we propose a new form of diffusion model with uniform noise in the forward process, whose negative ELBO corresponds to the end-to-end compression cost using universal quantization. We obtain promising first results on image compression, achieving competitive rate-distortion and rate-realism results on a wide range of bit-rates with a single model, bringing neural codecs a step closer to practical deployment.
扩散概率模型已经在许多生成建模任务中取得了主流成功,从图像生成到逆问题求解。这些模型的一个显著特征是它们对应于深度层次潜在变量模型,并优化数据似然性的变分证据下界(ELBO)。基于可能性建模与压缩之间基本连接的探索,我们研究了扩散模型在渐进编码中的潜力,从而产生了一系列可以逐步传输和解码、并随着解码过程不断改进重建质量的比特序列。不同于以往基于高斯扩散或条件扩散模型的工作,我们提出了一种新的扩散模型形式,在正向过程中采用均匀噪声,其负ELBO对应于使用通用量化时端到端压缩成本。我们在图像压缩方面获得了令人鼓舞的初步结果,一个单一模型在一系列比特率下实现了具有竞争力的速率失真和速率真实感结果,从而使得神经编解码器更接近实际部署。 翻译这段文本涉及了一些专业的术语和概念,如扩散概率模型(Diffusion probabilistic models)、变分证据下界(variational evidence lower bound, ELBO)以及图像压缩技术中的速率失真(rate-distortion)。在翻译过程中尽量保持了原意的准确传达,并根据中文表达习惯进行了适当的调整。
https://arxiv.org/abs/2412.10935
Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of either leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessive smoothing quality as well as lengthy decoder models. In this paper, we propose an innovative paradigm, which we dub \textbf{Unicorn} (\textbf{U}nified \textbf{N}eural \textbf{I}mage \textbf{C}ompression with \textbf{O}ne \textbf{N}number \textbf{R}econstruction). By conceptualizing the images as index-image pairs and learning the inherent distribution of pairs in a subtle neural network model, Unicorn can reconstruct a visually pleasing image from a randomly generated noise with only one index number. The neural model serves as the unified decoder of images while the noises and indexes corresponds to explicit representations. As a proof of concept, we propose an effective and efficient prototype of Unicorn based on latent diffusion models with tailored model designs. Quantitive and qualitative experimental results demonstrate that our prototype achieves significant bitrates reduction compared with EIC and IIC algorithms. More impressively, benefitting from the unified decoder, our compression ratio escalates as the quantity of images increases. We envision that more advanced model designs will endow Unicorn with greater potential in image compression. We will release our codes in \url{this https URL}.
流行的有损图像压缩方案可以分为两类:1) 显式图像压缩(EIC),包括传统标准和神经端到端算法;2) 基于隐式神经表示(INR)的隐式图像压缩(IIC)。前者在复杂度急剧增加的同时遇到了码率降低停滞不前的瓶颈,而后者则面临过度平滑的质量问题以及较长的解码模型。本文中,我们提出了一种创新性的范例,将其称为\textbf{Unicorn}(\textbf{U}nified \textbf{N}eural \textbf{I}mage \textbf{C}ompression with \textbf{O}ne \textbf{N}umber \textbf{R}econstruction,即统一神经图像压缩与单数重建)。通过将图像视为索引-图像对,并在一个细致的神经网络模型中学习这些对的内在分布,Unicorn可以从一个随机生成的噪声和仅有的一个索引号中重构出视觉效果良好的图像。神经模型作为图像的统一解码器,而噪声和索引对应于显式表示。作为一种概念验证,我们基于潜扩散模型并进行定制化设计提出了Unicorn的一个有效且高效的原型。定量和定性的实验结果表明,我们的原型与EIC和IIC算法相比实现了显著的比特率降低。更令人印象深刻的是,得益于统一解码器,随着图像数量的增加,我们的压缩比率也随之上升。我们设想更加先进的模型设计将赋予Unicorn在图像压缩方面更大的潜力。我们将发布代码至\url{this https URL}。
https://arxiv.org/abs/2412.08210
Semantic communications provide significant performance gains over traditional communications by transmitting task-relevant semantic features through wireless channels. However, most existing studies rely on end-to-end (E2E) training of neural-type encoders and decoders to ensure effective transmission of these semantic features. To enable semantic communications without relying on E2E training, this paper presents a vision transformer (ViT)-based semantic communication system with importance-aware quantization (IAQ) for wireless image transmission. The core idea of the presented system is to leverage the attention scores of a pretrained ViT model to quantify the importance levels of image patches. Based on this idea, our IAQ framework assigns different quantization bits to image patches based on their importance levels. This is achieved by formulating a weighted quantization error minimization problem, where the weight is set to be an increasing function of the attention score. Then, an optimal incremental allocation method and a low-complexity water-filling method are devised to solve the formulated problem. Our framework is further extended for realistic digital communication systems by modifying the bit allocation problem and the corresponding allocation methods based on an equivalent binary symmetric channel (BSC) model. Simulations on single-view and multi-view image classification tasks show that our IAQ framework outperforms conventional image compression methods in both error-free and realistic communication scenarios.
语义通信通过无线信道传输与任务相关的语义特征,相比传统的通信方式提供了显著的性能提升。然而,大多数现有的研究依赖于端到端(E2E)训练神经型编码器和解码器来确保这些语义特征的有效传输。为了在不依赖E2E训练的情况下实现语义通信,本文提出了一种基于视觉变换器(ViT)的语义通信系统,并引入了重要性感知量化(IAQ),用于无线图像传输。该系统的核心理念是利用预训练的ViT模型中的注意力分数来量化图像块的重要性级别。根据这一理念,我们的IAQ框架根据不同图像块的重要程度为其分配不同的量化比特。这是通过制定一个加权量化误差最小化问题实现的,其中权重设置为注意力得分的递增函数。然后,我们设计了一种最优增量分配方法和一种低复杂度的水填充方法来解决该问题。为了使我们的框架适用于现实的数字通信系统,我们将位分配问题及其对应的分配方法进行了调整,基于等效二进制对称信道(BSC)模型进行修改。在单视图和多视图图像分类任务上的模拟结果显示,与传统图像压缩方法相比,在无错误和真实通信场景中,我们的IAQ框架均表现出色。
https://arxiv.org/abs/2412.06038
We present UniMIC, a universal multi-modality image compression framework, intending to unify the rate-distortion-perception (RDP) optimization for multiple image codecs simultaneously through excavating cross-modality generative priors. Unlike most existing works that need to design and optimize image codecs from scratch, our UniMIC introduces the visual codec repository, which incorporates amounts of representative image codecs and directly uses them as the basic codecs for various practical applications. Moreover, we propose multi-grained textual coding, where variable-length content prompt and compression prompt are designed and encoded to assist the perceptual reconstruction through the multi-modality conditional generation. In particular, a universal perception compensator is proposed to improve the perception quality of decoded images from all basic codecs at the decoder side by reusing text-assisted diffusion priors from stable diffusion. With the cooperation of the above three strategies, our UniMIC achieves a significant improvement of RDP optimization for different compression codecs, e.g., traditional and learnable codecs, and different compression costs, e.g., ultra-low bitrates. The code will be available in this https URL .
我们提出了一个名为UniMIC的通用多模态图像压缩框架,旨在通过挖掘跨模式生成先验知识来统一多个图像编解码器的同时进行率失真感知(RDP)优化。与大多数现有的需要从头设计和优化图像编解码器的工作不同,我们的UniMIC引入了视觉编解码库,该库整合了大量的代表性图像编解码器,并直接将它们用作各种实际应用的基本编解码器。此外,我们提出了多粒度文本编码方法,其中设计并编码了可变长度的内容提示和压缩提示来辅助通过多模态条件生成进行感知重构。特别地,提出了一种通用的感知补偿器,在解码侧通过复用带有文本辅助扩散先验的稳定扩散来提高所有基本编解码器解码出的图像的感知质量。通过上述三种策略的合作,我们的UniMIC在不同压缩编解码器(例如传统和可学习的编解码器)以及不同的压缩成本(例如超低比特率)上实现了RDP优化的显著改进。代码将在以下链接中提供:https://url.com。
https://arxiv.org/abs/2412.04912
Image Compression for Machines (ICM) aims to compress images for machine vision tasks rather than human viewing. Current works predominantly concentrate on high-level tasks like object detection and semantic segmentation. However, the quality of original images is usually not guaranteed in the real world, leading to even worse perceptual quality or downstream task performance after compression. Low-level (LL) machine vision models, like image restoration models, can help improve such quality, and thereby their compression requirements should also be considered. In this paper, we propose a pioneered ICM framework for LL machine vision tasks, namely LL-ICM. By jointly optimizing compression and LL tasks, the proposed LL-ICM not only enriches its encoding ability in generalizing to versatile LL tasks but also optimizes the processing ability of down-stream LL task models, achieving mutual adaptation for image codecs and LL task models. Furthermore, we integrate large-scale vision-language models into the LL-ICM framework to generate more universal and distortion-robust feature embeddings for LL vision tasks. Therefore, one LL-ICM codec can generalize to multiple tasks. We establish a solid benchmark to evaluate LL-ICM, which includes extensive objective experiments by using both full and no-reference image quality assessments. Experimental results show that LL-ICM can achieve 22.65% BD-rate reductions over the state-of-the-art methods.
https://arxiv.org/abs/2412.03841
In this paper, we investigate the counter-forensic effects of the forthcoming JPEG AI standard based on neural image compression, focusing on two critical areas: deepfake image detection and image splicing localization. Neural image compression leverages advanced neural network algorithms to achieve higher compression rates while maintaining image quality. However, it introduces artifacts that closely resemble those generated by image synthesis techniques and image splicing pipelines, complicating the work of researchers when discriminating pristine from manipulated content. We comprehensively analyze JPEG AI's counter-forensic effects through extensive experiments on several state-of-the-art detectors and datasets. Our results demonstrate that an increase in false alarms impairs the performance of leading forensic detectors when analyzing genuine content processed through JPEG AI. By exposing the vulnerabilities of the available forensic tools we aim to raise the urgent need for multimedia forensics researchers to include JPEG AI images in their experimental setups and develop robust forensic techniques to distinguish between neural compression artifacts and actual manipulations.
https://arxiv.org/abs/2412.03261
Recent advancements in deep learning-based compression techniques have surpassed traditional methods. However, deep neural networks remain vulnerable to backdoor attacks, where pre-defined triggers induce malicious behaviors. This paper introduces a novel frequency-based trigger injection model for launching backdoor attacks with multiple triggers on learned image compression models. Inspired by the widely used DCT in compression codecs, triggers are embedded in the DCT domain. We design attack objectives tailored to diverse scenarios, including: 1) degrading compression quality in terms of bit-rate and reconstruction accuracy; 2) targeting task-driven measures like face recognition and semantic segmentation. To improve training efficiency, we propose a dynamic loss function that balances loss terms with fewer hyper-parameters, optimizing attack objectives effectively. For advanced scenarios, we evaluate the attack's resistance to defensive preprocessing and propose a two-stage training schedule with robust frequency selection to enhance resilience. To improve cross-model and cross-domain transferability for downstream tasks, we adjust the classification boundary in the attack loss during training. Experiments show that our trigger injection models, combined with minor modifications to encoder parameters, successfully inject multiple backdoors and their triggers into a single compression model, demonstrating strong performance and versatility. (*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)
https://arxiv.org/abs/2412.01646
Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.
https://arxiv.org/abs/2412.00505