Storing and transmitting LiDAR point cloud data is essential for many AV applications, such as training data collection, remote control, cloud services or SLAM. However, due to the sparsity and unordered structure of the data, it is difficult to compress point cloud data to a low volume. Transforming the raw point cloud data into a dense 2D matrix structure is a promising way for applying compression algorithms. We propose a new lossless and calibrated 3D-to-2D transformation which allows compression algorithms to efficiently exploit spatial correlations within the 2D representation. To compress the structured representation, we use common image compression methods and also a self-supervised deep compression approach using a recurrent neural network. We also rearrange the LiDAR's intensity measurements to a dense 2D representation and propose a new metric to evaluate the compression performance of the intensity. Compared to approaches that are based on generic octree point cloud compression or based on raw point cloud data compression, our approach achieves the best quantitative and visual performance. Source code and dataset are available at this https URL.
存储和传输激光雷达点云数据对于许多自动驾驶应用(如训练数据收集、远程控制、云计算或SLAM)至关重要。然而,由于数据稀疏且无序的结构,将其压缩到低体积是非常困难的。将原始点云数据转换为密集的2D矩阵结构是应用压缩算法的一种有前途的方法。我们提出了一种新的无损失且校准的3D到2D变换,允许压缩算法有效地利用2D表示中的空间相关性。为了压缩结构化表示,我们使用常见的图像压缩方法和自监督的深度压缩方法(使用循环神经网络)。我们还将激光雷达的强度测量重新排列为密集的2D表示,并提出了一个新指标来评估强度压缩性能。与基于通用Octree点云压缩或基于原始点云数据压缩的方法相比,我们的方法在数量和视觉方面都取得了最佳结果。源代码和数据集可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2402.11680
Convolutional neural networks (CNNs) for image processing tend to focus on localized texture patterns, commonly referred to as texture bias. While most of the previous works in the literature focus on the task of image classification, we go beyond this and study the texture bias of CNNs in semantic segmentation. In this work, we propose to train CNNs on pre-processed images with less texture to reduce the texture bias. Therein, the challenge is to suppress image texture while preserving shape information. To this end, we utilize edge enhancing diffusion (EED), an anisotropic image diffusion method initially introduced for image compression, to create texture reduced duplicates of existing datasets. Extensive numerical studies are performed with both CNNs and vision transformer models trained on original data and EED-processed data from the Cityscapes dataset and the CARLA driving simulator. We observe strong texture-dependence of CNNs and moderate texture-dependence of transformers. Training CNNs on EED-processed images enables the models to become completely ignorant with respect to texture, demonstrating resilience with respect to texture re-introduction to any degree. Additionally we analyze the performance reduction in depth on a level of connected components in the semantic segmentation and study the influence of EED pre-processing on domain generalization as well as adversarial robustness.
卷积神经网络(CNNs)在图像处理中通常会关注局部纹理模式,通常被称为纹理偏差。尽管文献中大多数工作都关注图像分类任务,但我们超越了这个范畴,研究了CNN在语义分割中的纹理偏差。在这项工作中,我们提出了一种在预处理图像上训练CNN以减少纹理偏差的方法。这里的挑战在于在保留形状信息的同时抑制图像纹理。为此,我们利用边缘增强扩散(EED),最初用于图像压缩,创建纹理减少的现有数据集的副本。对CNN和视觉Transformer模型在原始数据和EED处理数据上的 extensive numerical studies 进行了研究。我们观察到CNNs 的纹理依赖性很强,而Transformer的纹理依赖性较弱。在EED处理的图像上训练CNN使模型对纹理完全失聪,证明了纹理重新引入的任何程度上的弹性。此外,我们分析了语义分割中连接组件的深度下降,研究了EED预处理对领域泛化的影响以及对抗鲁棒性。
https://arxiv.org/abs/2402.09530
Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural compressed frames and generates subsequent frames. When the reconstruction quality drops below the desired level, new frames are encoded to restart prediction. The entire video is sequentially encoded to achieve a visually pleasing reconstruction, considering perceptual quality metrics such as the learned perceptual image patch similarity (LPIPS) and the Frechet video distance (FVD), at bit rates as low as 0.02 bits per pixel (bpp). Experimental results demonstrate the effectiveness of the proposed scheme compared to standard codecs such as H.264 and H.265 in the low bpp regime. The results showcase the potential of exploiting the temporal relations in video data using generative models. Code is available at: this https URL
扩散模型在生成高质量图像和视频数据方面取得了显著的成功。更最近,它们还被用于具有高感知质量的图像压缩。在本文中,我们提出了一种利用扩散基于生成模型的预测能力来实现极端视频压缩的新方法。条件扩散模型对几个神经压缩帧进行编码,生成后续帧。当重建质量低于期望水平时,新帧被编码以重新启动预测。整个视频按位率序列编码以实现视觉上令人愉悦的重建,考虑学习到的感知图像补丁相似度(LPIPS)和费希特视频距离(FVD)等感知质量指标, bit rates在0.02 bit/pixel(bpp)时。实验结果表明,与低bpp范围内的标准编解码器(如H.264和H.265)相比,所提出的方案在bpp低端具有有效的效果。结果突出了在视频数据中利用生成模型的时间关系潜力。代码可在此处下载:https:// this URL
https://arxiv.org/abs/2402.08934
Learned image compression has gained widespread popularity for their efficiency in achieving ultra-low bit-rates. Yet, images containing substantial textual content, particularly screen-content images (SCI), often suffers from text distortion at such compressed levels. To address this, we propose to minimize a novel text logit loss designed to quantify the disparity in text between the original and reconstructed images, thereby improving the perceptual quality of the reconstructed text. Through rigorous experimentation across diverse datasets and employing state-of-the-art algorithms, our findings reveal significant enhancements in the quality of reconstructed text upon integration of the proposed loss function with appropriate weighting. Notably, we achieve a Bjontegaard delta (BD) rate of -32.64% for Character Error Rate (CER) and -28.03% for Word Error Rate (WER) on average by applying the text logit loss for two screenshot datasets. Additionally, we present quantitative metrics tailored for evaluating text quality in image compression tasks. Our findings underscore the efficacy and potential applicability of our proposed text logit loss function across various text-aware image compression contexts.
学习到的图像压缩已经因在实现超低比特率方面的高效而得到了广泛的应用。然而,含有大量文本内容的图像(特别是屏幕内容图像(SCI))在如此低的压缩水平下经常会出现文本扭曲。为了应对这个问题,我们提出了一个新型的文本对数值损失,用于量化原始和重构图像之间的文本差异,从而改善重构文本的感知质量。通过在各种数据集上进行严谨的实验,并采用最先进的算法,我们得出的结论是,在将所提出的损失函数与适当的权重相结合后,重构文本的质量显著提高。值得注意的是,通过应用两个截图数据集上的文本对数值损失,我们平均获得了-32.64%的Bjontegaard delta(BD)率和-28.03%的词误率(WER)。此外,我们还针对图像压缩任务评估了定量的文本质量指标。我们的研究结果表明,我们的提出的文本对数值损失函数在各种文本感知图像压缩环境中具有有效性和潜在应用价值。
https://arxiv.org/abs/2402.08643
Noisy images are a challenge to image compression algorithms due to the inherent difficulty of compressing noise. As noise cannot easily be discerned from image details, such as high-frequency signals, its presence leads to extra bits needed for compression. Since the emerging learned image compression paradigm enables end-to-end optimization of codecs, recent efforts were made to integrate denoising into the compression model, relying on clean image features to guide denoising. However, these methods exhibit suboptimal performance under high noise levels, lacking the capability to generalize across diverse noise types. In this paper, we propose a novel method integrating a multi-scale denoiser comprising of Self Organizing Operational Neural Networks, for joint image compression and denoising. We employ contrastive learning to boost the network ability to differentiate noise from high frequency signal components, by emphasizing the correlation between noisy and clean counterparts. Experimental results demonstrate the effectiveness of the proposed method both in rate-distortion performance, and codec speed, outperforming the current state-of-the-art.
由于噪声本身压缩噪声的困难,因此噪声图像对图像压缩算法来说是一个挑战。由于噪声不易从图像细节中分辨出来,比如高频信号,它的存在需要压缩模型额外添加比特。由于新兴的图像压缩范式可以实现端到端压缩码的优化,因此最近的努力是将去噪整合到压缩模型中,通过干净图像特征引导去噪。然而,这些方法在 high noise levels 下表现出 suboptimal 的性能,缺乏对不同噪声类型的泛化能力。在本文中,我们提出了一种名为多尺度去噪的新方法,该方法包括自组织操作神经网络,用于联合图像压缩和去噪。我们采用对比学习来增强网络对噪声和高频信号成分之间的相关性的理解,从而提高网络对噪声的分辨能力。实验结果表明,与当前最先进的方法相比,所提出的方法在码率失真和编码速度方面都表现出色。
https://arxiv.org/abs/2402.05582
We introduce RAGE, an image compression framework that achieves four generally conflicting objectives: 1) good compression for a wide variety of color images, 2) computationally efficient, fast decompression, 3) fast random access of images with pixel-level granularity without the need to decompress the entire image, 4) support for both lossless and lossy compression. To achieve these, we rely on the recent concept of generalized deduplication (GD), which is known to provide efficient lossless (de)compression and fast random access in time-series data, and deliver key expansions suitable for image compression, both lossless and lossy. Using nine different datasets, incl. graphics, logos, natural images, we show that RAGE has similar or better compression ratios to state-of-the-art lossless image compressors, while delivering pixel-level random access capabilities. Tests in an ARM Cortex-M33 platform show seek times between 9.9 and 40.6~ns and average decoding time per pixel between 274 and 1226~ns. Our measurements also show that RAGE's lossy variant, RAGE-Q, outperforms JPEG by several fold in terms of distortion in embedded graphics and has reasonable compression and distortion for natural images.
我们介绍了 RAGE,一个图像压缩框架,它实现了四个通常相互冲突的目标:1)对各种颜色图像进行良好的压缩,2)计算高效,快速解码,3)具有像素级别的高速随机访问,无需解压缩整个图像,4)支持无损失和有损失压缩。为了实现这些目标,我们依赖于最近的概念 generalized deduplication(GD),它被证明在时序数据中提供高效的无损失(de)压缩和快速随机访问,并且适合图像压缩,无论是无损失还是有损失。使用九个不同的数据集,包括图形、标志、自然图像,我们证明了 RAGE 在压缩比方面与最先进的无损失图像压缩器相当或更好,同时提供了像素级别的随机访问功能。在ARM Cortex-M33平台上的测试显示,RAGE的寻址时间在9.9到40.6毫秒之间,每个像素的解码时间在274到1226毫秒之间。我们的测量结果还显示,RAGE的有损变体RAGE-Q在嵌入式图形中的失真比JPEG大几倍,在自然图像上的压缩和失真相当。
https://arxiv.org/abs/2402.05974
Emerging Learned image Compression (LC) achieves significant improvements in coding efficiency by end-to-end training of neural networks for compression. An important benefit of this approach over traditional codecs is that any optimization criteria can be directly applied to the encoder-decoder networks during training. Perceptual optimization of LC to comply with the Human Visual System (HVS) is among such criteria, which has not been fully explored yet. This paper addresses this gap by proposing a novel framework to integrate Just Noticeable Distortion (JND) principles into LC. Leveraging existing JND datasets, three perceptual optimization methods are proposed to integrate JND into the LC training process: (1) Pixel-Wise JND Loss (PWL) prioritizes pixel-by-pixel fidelity in reproducing JND characteristics, (2) Image-Wise JND Loss (IWL) emphasizes on overall imperceptible degradation levels, and (3) Feature-Wise JND Loss (FWL) aligns the reconstructed image features with perceptually significant features. Experimental evaluations demonstrate the effectiveness of JND integration, highlighting improvements in rate-distortion performance and visual quality, compared to baseline methods. The proposed methods add no extra complexity after training.
新兴学习图像压缩(LC)通过端到端训练神经网络实现显著的压缩效率改进。与传统编码器相比,这种方法的重要优势在于可以在训练过程中直接应用优化准则。其中,满足人类视觉系统(HVS)的感知优化是一种优化准则,目前还没有完全探索过。本文通过提出一种将JND原则集成到LC的新框架来填补这一空白。利用现有的JND数据集,提出了三种将JND集成到LC训练过程的方法:(1)逐像素JND损失(PWL)优先考虑每个像素的JND特征还原;(2)图像级JND损失(IWL)强调整体不可感知退化水平;(3)特征级JND损失(FWL)将重构的图像特征与感知上重要的特征对齐。实验评估证明,JND的集成非常有效,提高了压缩率和视觉质量,与基线方法相比。这些方法在训练后没有增加额外的复杂性。
https://arxiv.org/abs/2402.02836
This paper presents a learned video compression method in response to video compression track of the 6th Challenge on Learned Image Compression (CLIC), at DCC 2024.Specifically, we propose a unified contextual video compression framework (UCVC) for joint P-frame and B-frame coding. Each non-intra frame refers to two neighboring decoded frames, which can be either both from the past for P-frame compression, or one from the past and one from the future for B-frame compression. In training stage, the model parameters are jointly optimized with both P-frames and B-frames. Benefiting from the designs, the framework can support both P-frame and B-frame coding and achieve comparable compression efficiency with that specifically designed for P-frame or this http URL for challenge submission, we report the optimal compression efficiency by selecting appropriate frame types for each test sequence. Our team name is PKUSZ-LVC.
本文针对第六届学习图像压缩(CLIC)挑战赛,在2024年的DCC会议上提出了一种学习视频压缩方法。具体来说,我们提出了一种统一上下文视频压缩框架(UCVC)用于联合P-帧和B-帧编码。每个非内帧都指向两个相邻解码帧,可以是从过去的P-帧压缩中两个相邻的帧,或者一个是过去的P-帧压缩,另一个是未来的B-帧压缩。在训练阶段,模型参数与P-帧和B-帧共同优化。得益于这些设计,该框架可以支持P-帧和B-帧编码,并达到与专门为P-帧或此http URL提交挑战的压缩效率相当的水平。通过选择每个测试序列适当的帧类型,我们报告了最优压缩效率。我们的团队名是PKUSZ-LVC。
https://arxiv.org/abs/2402.01289
It is well-known that there is no universal metric for image quality evaluation. In this case, distortion-specific metrics can be more reliable. The artifact imposed by image compression can be considered as a combination of various distortions. Depending on the image context, this combination can be different. As a result, Generalization can be regarded as the major challenge in compressed image quality assessment. In this approach, stacking is employed to provide a reliable method. Both semantic and low-level information are employed in the presented IQA to predict the human visual system. Moreover, the results of the Full-Reference (FR) and No-Reference (NR) models are aggregated to improve the proposed Full-Reference method for compressed image quality evaluation. The accuracy of the quality benchmark of the clic2024 perceptual image challenge was achieved 79.6\%, which illustrates the effectiveness of the proposed fusion-based approach.
众所周知,图像质量评估没有一种通用的度量标准。在这种情况下,特定的畸形度量可以更可靠。由图像压缩引起的伪影可以被视为各种畸形的一个组合。根据图像上下文,这个组合可能会有所不同。因此,在压缩图像质量评估中,泛化被认为是主要挑战。在这种方法中,通过堆叠提供了可靠的方法。在所提出的IQA中,语义和低级信息都得到了使用,以预测人类视觉系统。此外,完整参考(FR)和无参考(NR)模型的结果被聚合,以提高压缩图像质量评估的完整参考方法。Clic2024感知图像挑战的基准质量指标的准确性达到了79.6%,这说明所提出的融合基方法的有效性。
https://arxiv.org/abs/2402.00993
Neural image compression has made a great deal of progress. State-of-the-art models are based on variational autoencoders and are outperforming classical models. Neural compression models learn to encode an image into a quantized latent representation that can be efficiently sent to the decoder, which decodes the quantized latent into a reconstructed image. While these models have proven successful in practice, they lead to sub-optimal results due to imperfect optimization and limitations in the encoder and decoder capacity. Recent work shows how to use stochastic Gumbel annealing (SGA) to refine the latents of pre-trained neural image compression models. We extend this idea by introducing SGA+, which contains three different methods that build upon SGA. Further, we give a detailed analysis of our proposed methods, show how they improve performance, and show that they are less sensitive to hyperparameter choices. Besides, we show how each method can be extended to three- instead of two-class rounding. Finally, we show how refinement of the latents with our best-performing method improves the compression performance on the Tecnick dataset and how it can be deployed to partly move along the rate-distortion curve.
神经图像压缩取得了很大进展。最先进的模型是基于变分自编码器的,并且在实践中表现出色。神经压缩模型学会了将图像编码成一个可以有效发送到解码器的量化latent表示,解码器将量化latent表示解码成重构图像。尽管这些模型在实践中证明成功了,但由于优化不完美和编码器和解码器容量的限制,它们导致次优结果。最近的工作展示了如何使用随机Gumbel退火(SGA)优化预训练神经图像压缩模型的latent。我们通过引入SGA+,它包含三种不同的方法,建立在SGA之上。此外,我们详细分析了我们提出的方法,展示了它们如何提高性能,并表明它们对超参数选择的敏感性较低。此外,我们还展示了如何将每个方法扩展到三分类而不是两分类 rounding。最后,我们证明了使用我们最佳表现的模型优化latent可以提高Tecnick数据集的压缩性能,并可以部分沿着速率失真曲线移动。
https://arxiv.org/abs/2401.17789
Recent advancements in neural compression have surpassed traditional codecs in PSNR and MS-SSIM measurements. However, at low bit-rates, these methods can introduce visually displeasing artifacts, such as blurring, color shifting, and texture loss, thereby compromising perceptual quality of images. To address these issues, this study presents an enhanced neural compression method designed for optimal visual fidelity. We have trained our model with a sophisticated semantic ensemble loss, integrating Charbonnier loss, perceptual loss, style loss, and a non-binary adversarial loss, to enhance the perceptual quality of image reconstructions. Additionally, we have implemented a latent refinement process to generate content-aware latent codes. These codes adhere to bit-rate constraints, balance the trade-off between distortion and fidelity, and prioritize bit allocation to regions of greater importance. Our empirical findings demonstrate that this approach significantly improves the statistical fidelity of neural image compression. On CLIC2024 validation set, our approach achieves a 62% bitrate saving compared to MS-ILLM under FID metric.
近年来,神经压缩的进步已经超越了传统压缩码在 PSNR 和 MS-SSIM 测量的优势。然而,在低比特率下,这些方法可能会引入 visually令人不满意的伪影,例如模糊、色彩偏移和纹理损失,从而削弱图像的感知质量。为了应对这些问题,本研究提出了一种针对最佳视觉保真的增强神经压缩方法。我们通过训练我们的模型使用复杂的语义集损失,包括 Charbonnier 损失、感知损失、风格损失和非二进制对抗损失,来提高图像重构的感知质量。此外,我们还实现了一个潜在的优化过程,以生成内容感知到的潜在代码。这些代码符合比特率约束,平衡失真和保真的权衡,并优先考虑分配比特到更重要区域的分配。我们的实证结果表明,这种方法显著提高了神经图像压缩的统计保真度。在 CLIC2024 验证集上,我们的方法在 FID metric 下实现了与 MS-ILLM 的 62% 比特率节省。
https://arxiv.org/abs/2401.14007
This document is an expanded version of a one-page abstract originally presented at the 2024 Data Compression Conference. It describes our proposed method for the video track of the Challenge on Learned Image Compression (CLIC) 2024. Our scheme follows the typical hybrid coding framework with some novel techniques. Firstly, we adopt Spynet network to produce accurate motion vectors for motion estimation. Secondly, we introduce the context mining scheme with conditional frame coding to fully exploit the spatial-temporal information. As for the low target bitrates given by CLIC, we integrate spatial-temporal super-resolution modules to improve rate-distortion performance. Our team name is IMCLVC.
本文是2024年数据压缩会议上一篇摘要的扩展版本,描述了我们对CLIC 2024挑战中视频 tracks 的方法。我们的方案遵循了典型的混合编码框架,并采用了一些新颖的技术。首先,我们采用Spynet网络来生成准确的运动矢量,以进行运动估计。其次,我们引入了条件帧编码方案,以充分利用空间和时间信息。关于CLIC中给出的低目标比特率,我们将采用空间和时间超分辨率模块来提高码率-失真性能。我们的团队名是IMCLVC。
https://arxiv.org/abs/2401.13959
Recently, DNN models for lossless image coding have surpassed their traditional counterparts in compression performance, reducing the bit rate by about ten percent for natural color images. But even with these advances, mathematically lossless image compression (MLLIC) ratios for natural images still fall short of the bandwidth and cost-effectiveness requirements of most practical imaging and vision systems at present and beyond. To break the bottleneck of MLLIC in compression performance, we question the necessity of MLLIC, as almost all digital sensors inherently introduce acquisition noises, making mathematically lossless compression counterproductive. Therefore, in contrast to MLLIC, we propose a new paradigm of joint denoising and compression called functionally lossless image compression (FLLIC), which performs lossless compression of optimally denoised images (the optimality may be task-specific). Although not literally lossless with respect to the noisy input, FLLIC aims to achieve the best possible reconstruction of the latent noise-free original image. Extensive experiments show that FLLIC achieves state-of-the-art performance in joint denoising and compression of noisy images and does so at a lower computational cost.
近年来,无损图像编码模型的性能已经超越了传统的对应模型。对于自然色彩图像,它们的压缩性能降低了约10%的带宽。然而,即使有了这些进步,对于现在的绝大多数实际成像和视觉系统来说,无损图像压缩(MLLIC)的数学极限仍然低于频带和性价比要求。为了打破无损图像压缩性能的瓶颈,我们质疑无损图像压缩的必要性,因为几乎所有数字传感器都会引入采集噪声,使得无损数学压缩变得毫无意义。因此,我们提出了一个名为功能无损图像压缩(FLLIC)的新范式,它旨在实现无损压缩最优化去噪图像(优化可能因任务而异)。尽管在噪声输入上,FLLIC并不是完全无损的,但它旨在实现最可能的原始图像熵恢复。大量实验证明,FLLIC在噪声图像的联合去噪和压缩方面实现了最先进的性能,同时具有较低的计算成本。
https://arxiv.org/abs/2401.13616
Displaying high-quality images on edge devices, such as augmented reality devices, is essential for enhancing the user experience. However, these devices often face power consumption and computing resource limitations, making it challenging to apply many deep learning-based image compression algorithms in this field. Implicit Neural Representation (INR) for image compression is an emerging technology that offers two key benefits compared to cutting-edge autoencoder models: low computational complexity and parameter-free decoding. It also outperforms many traditional and early neural compression methods in terms of quality. In this study, we introduce a new Mixed Autoregressive Model (MARM) to significantly reduce the decoding time for the current INR codec, along with a new synthesis network to enhance reconstruction quality. MARM includes our proposed Autoregressive Upsampler (ARU) blocks, which are highly computationally efficient, and ARM from previous work to balance decoding time and reconstruction quality. We also propose enhancing ARU's performance using a checkerboard two-stage decoding strategy. Moreover, the ratio of different modules can be adjusted to maintain a balance between quality and speed. Comprehensive experiments demonstrate that our method significantly improves computational efficiency while preserving image quality. With different parameter settings, our method can outperform popular AE-based codecs in constrained environments in terms of both quality and decoding time, or achieve state-of-the-art reconstruction quality compared to other INR codecs.
在边缘设备上展示高质量图像,如增强现实设备,是提高用户体验的关键。然而,这些设备通常面临功耗和计算资源限制,这使得在该领域应用许多基于深度学习的图像压缩算法具有挑战性。隐式神经表示(INR)作为一种新兴技术,与先进的自动编码器模型相比具有两个关键优势:低计算复杂性和无参数解码。它还在质量方面比许多传统和早期的神经压缩方法表现出色。在这项研究中,我们引入了一种新型混合自回归模型(MARM)来显著减少当前INR编码的解码时间,并引入了一个新的合成网络来提高重建质量。MARM包括我们提出的自回归提升器(ARU)模块,这些模块具有高度的计算效率,以及来自之前工作的ARM来平衡解码时间和重建质量。我们还提出了一种使用方格图案两级解码策略来增强ARU性能的方法。此外,不同的模块比可以调整,以保持质量和速度之间的平衡。综合实验证明,我们的方法在保持图像质量的同时显著提高了计算效率。通过不同的参数设置,我们的方法可以在约束环境中的流行AE基编码器中比其他INR编码器在质量和解码时间方面都表现出更好的性能,或者与其他INR编码器相比实现最先进的重建质量。
https://arxiv.org/abs/2401.12587
We study the robustness of learned image compression models against adversarial attacks and present a training-free defense technique based on simple image transform functions. Recent learned image compression models are vulnerable to adversarial attacks that result in poor compression rate, low reconstruction quality, or weird artifacts. To address the limitations, we propose a simple but effective two-way compression algorithm with random input transforms, which is conveniently applicable to existing image compression models. Unlike the naïve approaches, our approach preserves the original rate-distortion performance of the models on clean images. Moreover, the proposed algorithm requires no additional training or modification of existing models, making it more practical. We demonstrate the effectiveness of the proposed techniques through extensive experiments under multiple compression models, evaluation metrics, and attack scenarios.
我们研究了学习图像压缩模型对对抗性攻击的鲁棒性,并提出了一个基于简单图像变换函数的训练免费防御技术。最近学习的图像压缩模型容易受到导致压缩率低、重构质量差或奇怪伪影的对抗性攻击。为了应对这些局限,我们提出了一个简单但有效的双向压缩算法,带有随机输入变换,该算法方便地应用于现有的图像压缩模型。与 naive 方法不同,我们的方法保留了模型在干净图像上的原始率-畸变性能。此外,与现有的模型相比,所提出的算法不需要额外的训练或对现有模型的修改,使其更加实用。通过在多个压缩模型、评估指标和攻击场景下进行广泛的实验,我们证明了所提出技术的效果。
https://arxiv.org/abs/2401.11902
Idempotence is the stability of image codec to re-compression. At the first glance, it is unrelated to perceptual image compression. However, we find that theoretically: 1) Conditional generative model-based perceptual codec satisfies idempotence; 2) Unconditional generative model with idempotence constraint is equivalent to conditional generative codec. Based on this newfound equivalence, we propose a new paradigm of perceptual image codec by inverting unconditional generative model with idempotence constraints. Our codec is theoretically equivalent to conditional generative codec, and it does not require training new models. Instead, it only requires a pre-trained mean-square-error codec and unconditional generative model. Empirically, we show that our proposed approach outperforms state-of-the-art methods such as HiFiC and ILLM, in terms of Fréchet Inception Distance (FID). The source code is provided in this https URL.
幂等性是图像编码在重新压缩时的稳定性。从表面上看,它与感知图像压缩没有关系。然而,从理论上讲:1)基于条件生成模型的感知编码满足幂等性;2)具有幂等性约束的不条件生成模型等价于条件生成编码器。基于这一新的等价性,我们提出了一个新的感知图像编码 paradigm,通过反转具有幂等性约束的条件生成模型。我们的编码器在理论上是等价的于条件生成编码器,而且不需要训练新模型。相反,它只需要一个预训练的均方误差编码器和一个条件生成模型。实验证明,我们提出的方法在 Fréchet Inception Distance(FID)方面优于最先进的诸如 HiFiC 和 ILLM 等方法。源代码可在此处访问的 URL 提供。
https://arxiv.org/abs/2401.08920
Image compression constitutes a significant challenge amidst the era of information explosion. Recent studies employing deep learning methods have demonstrated the superior performance of learning-based image compression methods over traditional codecs. However, an inherent challenge associated with these methods lies in their lack of interpretability. Following an analysis of the varying degrees of compression degradation across different frequency bands, we propose the end-to-end optimized image compression model facilitated by the frequency-oriented transform. The proposed end-to-end image compression model consists of four components: spatial sampling, frequency-oriented transform, entropy estimation, and frequency-aware fusion. The frequency-oriented transform separates the original image signal into distinct frequency bands, aligning with the human-interpretable concept. Leveraging the non-overlapping hypothesis, the model enables scalable coding through the selective transmission of arbitrary frequency components. Extensive experiments are conducted to demonstrate that our model outperforms all traditional codecs including next-generation standard H.266/VVC on MS-SSIM metric. Moreover, visual analysis tasks (i.e., object detection and semantic segmentation) are conducted to verify the proposed compression method could preserve semantic fidelity besides signal-level precision.
在信息爆炸的时代,图像压缩是一个重要的挑战。最近使用深度学习方法进行的研究表明,与传统编码器相比,基于学习的图像压缩方法具有卓越的性能。然而,这些方法的一个固有挑战是它们的不可解释性。通过分析不同频率带之间的压缩失真程度,我们提出了一个频率相关转换的端到端优化图像压缩模型。所提出的端到端图像压缩模型包括四个组件:空间采样、频率相关转换、熵估计和频率感知融合。频率相关转换将原始图像信号分离为不同的频率带,符合人类可解释的概念。利用非重叠假设,该模型通过选择性地传输任意频率分量实现可扩展的编码。通过大量实验证明,我们的模型在包括下一代标准H.266/VVC在内的所有传统编码器中均具有卓越的性能。此外,还进行了视觉分析任务(即物体检测和语义分割)来证实所提出的压缩方法在保持语义保真度的同时,仍具有信号级精度。
https://arxiv.org/abs/2401.08194
This one page paper describes our method for the track of image compression. To achieve better perceptual quality, we use the adversarial loss to generate realistic textures, use region of interest (ROI) mask to guide the bit allocation for different regions. Our Team name is TLIC.
这份一页论文描述了我们用于图像压缩的方法。为了实现更好的感知质量,我们使用对抗损失来生成逼真的纹理,并使用感兴趣区域(ROI)掩码来指导不同区域的位分配。我们的团队名为TLIC。
https://arxiv.org/abs/2401.08154
We propose an end-to-end learned image compression codec wherein the analysis transform is jointly trained with an object classification task. This study affirms that the compressed latent representation can predict human perceptual distance judgments with an accuracy comparable to a custom-tailored DNN-based quality metric. We further investigate various neural encoders and demonstrate the effectiveness of employing the analysis transform as a perceptual loss network for image tasks beyond quality judgments. Our experiments show that the off-the-shelf neural encoder proves proficient in perceptual modeling without needing an additional VGG network. We expect this research to serve as a valuable reference developing of a semantic-aware and coding-efficient neural encoder.
我们提出了一种端到端学习的图像压缩编码器,其中分析变换与物体分类任务共同训练。本研究证实了压缩的潜在表示具有与定制定制 tailored 的 DNN 质量度量相当的精度预测人类感知距离判断。我们进一步研究了各种神经编码器,并证明了将分析变换应用于图像任务以实现质量判断之外的效果。我们的实验表明,不需要额外的 VGG 网络,标准神经编码器在感知建模方面表现出色。我们期望这项研究成为开发具有语义感知和编码效率的神经编码器的有价值的参考。
https://arxiv.org/abs/2401.07200
Image compression has been applied in the fields of image storage and video broadcasting. However, it's formidably tough to distinguish the subtle quality differences between those distorted images generated by different algorithms. In this paper, we propose a new image quality assessment framework to decide which image is better in an image group. To capture the subtle differences, a fine-grained network is adopted to acquire multi-scale features. Subsequently, we design a cross subtract block for separating and gathering the information within positive and negative image pairs. Enabling image comparison in feature space. After that, a progressive feature fusion block is designed, which fuses multi-scale features in a novel progressive way. Hierarchical spatial 2D features can thus be processed gradually. Experimental results show that compared with the current mainstream image quality assessment methods, the proposed network can achieve more accurate image quality assessment and ranks second in the benchmark of CLIC in the image perceptual model track.
图像压缩在图像存储和视频广播等领域得到了广泛应用。然而,区分不同算法生成的失真图像之间的微妙质量差异是一个相当具有挑战性的任务。在本文中,我们提出了一个新图像质量评估框架,以决定哪个图像在给定的图像组中更好。为了捕捉微妙的差异,我们采用了细粒度网络来获取多尺度特征。然后,我们设计了一个正负图像对信息分离和聚集的交叉差分块。从而实现图像在特征空间中的比较。接着,我们设计了一个逐步特征融合块,它以一种新颖的方式融合多尺度特征。因此,层次结构2D特征可以逐步处理。实验结果表明,与当前主流图像质量评估方法相比,所提出的网络可以实现更精确的图像质量评估,并且在图像感知模型跟踪基准中排名第二。
https://arxiv.org/abs/2401.06992