The emerging Learned Compression (LC) replaces the traditional codec modules with Deep Neural Networks (DNN), which are trained end-to-end for rate-distortion performance. This approach is considered as the future of image/video compression, and major efforts have been dedicated to improving its compression efficiency. However, most proposed works target compression efficiency by employing more complex DNNS, which contributes to higher computational complexity. Alternatively, this paper proposes to improve compression by fully exploiting the existing DNN capacity. To do so, the latent features are guided to learn a richer and more diverse set of features, which corresponds to better reconstruction. A channel-wise feature decorrelation loss is designed and is integrated into the LC optimization. Three strategies are proposed and evaluated, which optimize (1) the transformation network, (2) the context model, and (3) both networks. Experimental results on two established LC methods show that the proposed method improves the compression with a BD-Rate of up to 8.06%, with no added complexity. The proposed solution can be applied as a plug-and-play solution to optimize any similar LC method.
学习压缩(LC)作为一种新兴的压缩技术,取代了传统的编码模块,使用了深度神经网络(DNN),这些网络是针对码率失真性能进行端到端训练的。这种方法被认为是图像/视频压缩的未来,并且为提高其压缩效率做出了主要努力。然而,大多数提出的作品通过采用更复杂的DNN来提高压缩效率,导致计算复杂度更高。相反,本文提出了一种通过充分利用现有DNN能力来提高压缩的方法。为此,将潜在特征指导学习更丰富和更多样化的特征,从而实现更好的重构。在LC优化中,设计了一个通道级特征相关损失,并将其集成进去。提出了三种策略并对其进行了评估,它们分别是优化(1)转换网络,(2)上下文模型,(3)两个网络。在两个已有的LC方法上进行实验,结果表明,与所提出的方法相比,压缩率提高了至少8.06%,而没有增加复杂性。所提出的解决方案可以作为一个可插拔的解决方案,用于优化任何类似的LC方法。
https://arxiv.org/abs/2403.10936
Generative adversial network (GAN) is a type of generative model that maps a high-dimensional noise to samples in target distribution. However, the dimension of noise required in GAN is not well understood. Previous approaches view GAN as a mapping from a continuous distribution to another continous distribution. In this paper, we propose to view GAN as a discrete sampler instead. From this perspective, we build a connection between the minimum noise required and the bits to losslessly compress the images. Furthermore, to understand the behaviour of GAN when noise dimension is limited, we propose divergence-entropy trade-off. This trade-off depicts the best divergence we can achieve when noise is limited. And as rate distortion trade-off, it can be numerically solved when source distribution is known. Finally, we verifies our theory with experiments on image generation.
生成对抗网络(GAN)是一种生成模型,它将高维噪声映射到目标分布中的样本。然而,GAN所需的噪声维度并不清楚。之前的方法将GAN视为从连续分布到另一个连续分布的映射。在本文中,我们提出了将GAN视为离散采样器的观点。从这种角度来看,我们建立了最低噪声所需量和图像无损压缩所需的比特数之间的联系。此外,为了了解GAN在噪声维度受限时的行为,我们提出了熵增益 trade-off。这个 trade-off 描述了当噪声受限时我们能实现的最佳熵。当源分布已知时,它可以通过数值求解得到。最后,我们通过图像生成实验验证了我们的理论。
https://arxiv.org/abs/2403.09196
Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed.
现有的基于学习的立体图像编码器采用简单的源于单张图像编码器的熵模型对潜在表示进行编码。然而,这些熵模型很难有效地捕捉立体图像固有的空间差异特点,导致速率失真结果。在本文中,我们提出了一个名为CAMSIC的立体图像压缩框架。CAMSIC将每个图像独立转换为潜在表示,并采用一种强大的无解码器-无关Transformer熵模型来捕捉空间和差异依赖,通过引入一种新颖的内容感知掩膜图像建模(MIM)技术。我们内容感知MIM使前后信息之间实现高效的双向交互,自然地消除了需要额外Transformer解码器的需要。实验结果表明,我们的立体图像编码器在两个立体图像数据集Cityscapes和InStereo2K上实现了与最先进的速率失真性能相同的水平,具有快速的编码和解码速度。
https://arxiv.org/abs/2403.08505
Image compression emerges as a pivotal tool in the efficient handling and transmission of digital images. Its ability to substantially reduce file size not only facilitates enhanced data storage capacity but also potentially brings advantages to the development of continual machine learning (ML) systems, which learn new knowledge incrementally from sequential data. Continual ML systems often rely on storing representative samples, also known as exemplars, within a limited memory constraint to maintain the performance on previously learned data. These methods are known as memory replay-based algorithms and have proven effective at mitigating the detrimental effects of catastrophic forgetting. Nonetheless, the limited memory buffer size often falls short of adequately representing the entire data distribution. In this paper, we explore the use of image compression as a strategy to enhance the buffer's capacity, thereby increasing exemplar diversity. However, directly using compressed exemplars introduces domain shift during continual ML, marked by a discrepancy between compressed training data and uncompressed testing data. Additionally, it is essential to determine the appropriate compression algorithm and select the most effective rate for continual ML systems to balance the trade-off between exemplar quality and quantity. To this end, we introduce a new framework to incorporate image compression for continual ML including a pre-processing data compression step and an efficient compression rate/algorithm selection method. We conduct extensive experiments on CIFAR-100 and ImageNet datasets and show that our method significantly improves image classification accuracy in continual ML settings.
图像压缩成为处理和传输数字图像的高效手段。其大幅减小文件大小不仅提高了数据存储容量,还有助于连续机器学习(ML)系统的开发,这些系统从序列数据中逐步学习新知识。连续ML系统通常需要在有限内存约束下存储代表性样本,也就是实例,以保持对之前学习数据的性能。这些方法称为基于回放的算法,已经在减轻灾难性遗忘的有害影响方面取得了有效成果。然而,有限的内存缓冲区往往无法充分表示整个数据分布。在本文中,我们探讨了将图像压缩作为一种策略来提高缓冲器容量,从而增加实例多样性。然而,直接使用压缩实例在连续ML过程中会导致领域漂移,表现为压缩训练数据和未压缩测试数据之间的差异。此外,确定适当的压缩算法以及为连续ML系统选择最有效的压缩率至关重要。为此,我们引入了一个新的框架,包括预处理数据压缩步骤和高效的压缩率/算法选择方法,用于将图像压缩应用于连续ML。我们在CIFAR-100和ImageNet数据集上进行了广泛的实验,结果表明,我们的方法在连续ML环境中显著提高了图像分类准确性。
https://arxiv.org/abs/2403.06288
Image Coding for Machines (ICM) is an image compression technique for image recognition. This technique is essential due to the growing demand for image recognition AI. In this paper, we propose a method for ICM that focuses on encoding and decoding only the edge information of object parts in an image, which we call SA-ICM. This is an Learned Image Compression (LIC) model trained using edge information created by Segment Anything. Our method can be used for image recognition models with various tasks. SA-ICM is also robust to changes in input data, making it effective for a variety of use cases. Additionally, our method provides benefits from a privacy point of view, as it removes human facial information on the encoder's side, thus protecting one's privacy. Furthermore, this LIC model training method can be used to train Neural Representations for Videos (NeRV), which is a video compression model. By training NeRV using edge information created by Segment Anything, it is possible to create a NeRV that is effective for image recognition (SA-NeRV). Experimental results confirm the advantages of SA-ICM, presenting the best performance in image compression for image recognition. We also show that SA-NeRV is superior to ordinary NeRV in video compression for machines.
图像编码(ICM)是一种图像压缩技术,用于图像识别。由于图像识别人工智能(AI)的需求不断增长,ICM技术在图像识别中具有重要作用。在本文中,我们提出了一个专注于对图像中物体部分边缘信息的编码和解码的方法,我们称之为SA-ICM。这是我们使用Segment Anything生成的边缘信息训练的学习图像压缩(LIC)模型。我们的方法可以应用于各种图像识别任务模型。SA-ICM对输入数据的变化也非常鲁棒,因此适用于各种用例。此外,从隐私角度来看,我们的方法移除了编码器侧的人脸信息,从而保护个人隐私。此外,通过使用Segment Anything生成的边缘信息训练LIC模型,我们还可以用于训练Neural Representations for Videos(NeRV),这是一种视频压缩模型。通过训练NeRV使用Segment Anything生成的边缘信息,可以创建一个有效的NeRV用于图像识别(SA-NeRV)。实验结果证实了SA-ICM的优越性,在图像压缩方面取得了最佳性能。我们还证明了SA-NeRV在机器对视频压缩方面的优势。
https://arxiv.org/abs/2403.04173
Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we propose a compression framework that leverages text information mainly by text-adaptive encoding and training with joint image-text loss. By doing so, we avoid decoding based on text-guided generative models -- known for high generative diversity -- and effectively utilize the semantic information of text at a global level. Experimental results on various datasets show that our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions. In particular, our method outperforms all baselines in terms of LPIPS, with some room for even more improvements when we use more carefully generated captions.
近年来,在文本引导图像压缩方面的先进技术揭示了增强重构图像感知质量的很大潜力。然而,这些方法往往导致在像素级别显著降低的保真度,限制了其实用性。为了填补这一空白,我们开发了一种新的文本引导图像压缩算法,实现了高感知质量和像素级别的保真度。 特别是,我们提出了一个压缩框架,主要通过文本自适应编码和与联合图像-文本损失的训练来利用文本信息。通过这种方式,我们避免了基于文本引导的生成模型--以其高生成多样性而闻名--的解码,并有效利用了文本的语义信息。在各种数据集上的实验结果表明,我们的方法可以在人类或机器生成的视频中实现高像素级别和感知质量。特别是,当我们使用更仔细生成的注释时,我们的方法在LPIPS方面超越了所有基线。
https://arxiv.org/abs/2403.02944
Learned image compression codecs have recently achieved impressive compression performances surpassing the most efficient image coding architectures. However, most approaches are trained to minimize rate and distortion which often leads to unsatisfactory visual results at low bitrates since perceptual metrics are not taken into account. In this paper, we show that conditional diffusion models can lead to promising results in the generative compression task when used as a decoder, and that, given a compressed representation, they allow creating new tradeoff points between distortion and perception at the decoder side based on the sampling method.
近年来,学习到的图像压缩编码器在压缩性能上已经达到了令人印象深刻的水平,超过了最有效的图像编码架构。然而,大多数方法都是通过最小化率和失真来训练的,这往往导致低比特率下的视觉效果不令人满意,因为感知指标没有被考虑在内。在本文中,我们证明了条件扩散模型作为解码器在生成压缩任务中可以实现有前景的结果,并且,在压缩表示的基础上,它们可以在解码器端根据采样方法创建新的权衡点,从而在失真和感知之间实现新的平衡。
https://arxiv.org/abs/2403.02887
This work proposes to augment the lifting steps of the conventional wavelet transform with additional neural network assisted lifting steps. These additional steps reduce residual redundancy (notably aliasing information) amongst the wavelet subbands, and also improve the visual quality of reconstructed images at reduced resolutions. The proposed approach involves two steps, a high-to-low step followed by a low-to-high step. The high-to-low step suppresses aliasing in the low-pass band by using the detail bands at the same resolution, while the low-to-high step aims to further remove redundancy from detail bands, so as to achieve higher energy compaction. The proposed two lifting steps are trained in an end-to-end fashion; we employ a backward annealing approach to overcome the non-differentiability of the quantization and cost functions during back-propagation. Importantly, the networks employed in this paper are compact and with limited non-linearities, allowing a fully scalable system; one pair of trained network parameters are applied for all levels of decomposition and for all bit-rates of interest. By employing the proposed approach within the JPEG 2000 image coding standard, our method can achieve up to 17.4% average BD bit-rate saving over a wide range of bit-rates, while retaining quality and resolution scalability features of JPEG 2000.
本文提出了一种通过添加神经网络辅助提升传统波浪变换的 lifting 步骤来增强其提升步数的方法。这些额外的步骤减少了波浪子带之间的残余冗余(显著是 aliasing 信息),并且还改善了在低分辨率下重构图像的视觉效果。所提出的方法包括两个步骤:从高到低的步骤和从低到高的步骤。从高到低的步骤通过在同一分辨率下使用详细波浪带来抑制低通带中的 aliasing,而从低到高的步骤旨在进一步消除详细波浪带中的冗余,以实现更高的能量压缩。与传统的提升步骤相比,本文提出的两个提升步骤在端到端的方式下进行训练;我们采用反向退化方法来克服在反向传播过程中量化和非线性函数的不可导性。重要的是,本文使用的网络具有紧凑的模型和有限的非线性,允许实现完全可扩展的系统;对于所有分解级别和感兴趣的比特率,我们采用一对训练好的网络参数。通过将所提出的提升方法应用于 JPEG 2000 图像编码标准,我们的方法在广泛的比特率范围内可以实现最高 17.4% 的平均 BD 位率节省,同时保留 JPEG 2000 的质量和分辨率可扩展性特征。
https://arxiv.org/abs/2403.01647
Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extracting region-adaptive contextual information. Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks. Additionally, we introduce a plug-and-play module named Scale Affine Layer to incorporate rich contexts from various regions. While there have been prior image compression efforts that involve segmentation masks as additional intermediate inputs, our approach differs significantly from them. Our advantages lie in that, to avoid extra bitrate overhead, we treat these masks as privilege information, which is accessible during the model training stage but not required during the inference phase. To the best of our knowledge, we are the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR). The experimental results demonstrate our improvement compared to previously well-performing methods, with about 8.2% bitrate saving compared to VTM-17.0. The code will be released at this https URL.
近年来,学习图像压缩(LIC)取得了显著进展。现有的 works 通常使用基于 CNN 的或自注意力机制的压缩方法。然而,还没有关于聚焦于特定区域的神经转换的研究。为了回应这个问题,我们引入了类无关的分割掩码(即没有类别标签的语义掩码)以提取区域适应的上下文信息。我们提出的模块,区域适应转换模块,在掩码的指导下对不同区域应用自适应卷积。此外,我们还引入了一个名为 Scale Affine Layer 的插件,以包含来自各个区域的丰富上下文。虽然之前有一些图像压缩努力使用了分割掩码作为附加的中间输入,但我们的方法与它们有显著区别。我们的优势在于,为了避免额外比特率开销,我们将这些掩码视为特权信息,在模型训练阶段可以访问,但在推理阶段不需要。据我们所知,我们是第一个将类无关掩码作为特权信息并实现像素级质量指标(如峰值信号噪声比,PSNR)优越性能的机构。实验结果表明,与之前表现良好的方法相比,我们的改进程度大约为 8.2% 比特率节省。代码将在此处发布:https:// 这个 URL。
https://arxiv.org/abs/2403.00628
Achieving successful variable bitrate compression with computationally simple algorithms from a single end-to-end learned image or video compression model remains a challenge. Many approaches have been proposed, including conditional auto-encoders, channel-adaptive gains for the latent tensor or uniformly quantizing all elements of the latent tensor. This paper follows the traditional approach to vary a single quantization step size to perform uniform quantization of all latent tensor elements. However, three modifications are proposed to improve the variable rate compression performance. First, multi objective optimization is used for (post) training. Second, a quantization-reconstruction offset is introduced into the quantization operation. Third, variable rate quantization is used also for the hyper latent. All these modifications can be made on a pre-trained single-rate compression model by performing post training. The algorithms are implemented into three well-known image compression models and the achieved variable rate compression results indicate negligible or minimal compression performance loss compared to training multiple models. (Codes will be shared at this https URL)
实现使用单个端到端学习图像或视频压缩模型成功实现变比压缩仍然是一个挑战。已经提出了许多方法,包括条件自动编码器、针对延迟张量或对所有延迟张量进行均匀量化。本文遵循传统方法,通过单步量化的方式对所有延迟张量元素进行均匀量化。然而,本文提出了三种修改方法来提高变比压缩性能。首先,使用多目标优化进行后训练。其次,引入了量化重建偏移量到量化操作中。第三,在超低延迟时也使用变比量化。这些修改都可以通过在预训练的单步压缩模型上进行后训练来完成。将算法实现到三个著名的图像压缩模型中,所获得的变比压缩结果表明,与训练多个模型相比,压缩性能损失非常小。(代码将在此处共享)
https://arxiv.org/abs/2402.18930
This paper provides a comprehensive study on features and performance of different ways to incorporate neural networks into lifting-based wavelet-like transforms, within the context of fully scalable and accessible image compression. Specifically, we explore different arrangements of lifting steps, as well as various network architectures for learned lifting operators. Moreover, we examine the impact of the number of learned lifting steps, the number of channels, the number of layers and the support of kernels in each learned lifting operator. To facilitate the study, we investigate two generic training methodologies that are simultaneously appropriate to a wide variety of lifting structures considered. Experimental results ultimately suggest that retaining fixed lifting steps from the base wavelet transform is highly beneficial. Moreover, we demonstrate that employing more learned lifting steps and more layers in each learned lifting operator do not contribute strongly to the compression performance. However, benefits can be obtained by utilizing more channels in each learned lifting operator. Ultimately, the learned wavelet-like transform proposed in this paper achieves over 25% bit-rate savings compared to JPEG 2000 with compact spatial support.
本文对将神经网络嵌入到基于提升的波浪变换中特征和性能进行了全面的探讨,并考虑了完全可扩展和可访问图像压缩的上下文。具体来说,我们研究了不同提升步数的安排以及各种学习提升操作网络架构。此外,我们还检查了每个学习提升操作中参数的数量、通道数量、层数和内核支持对压缩性能的影响。为了方便研究,我们研究了两种适用于各种提升结构的通用训练方法。实验结果最终表明,保留基于原始波浪变换的基本提升步日是高度有益的。此外,我们还证明了在每种学习提升操作中使用更多的学习步数和层数并不能显著提高压缩性能。然而,通过在每个学习提升操作中使用更多的通道,可以获得比JPEG 2000有更好的压缩性能。最终,本文提出的学波浪变换实现了与紧凑空间支持下的JPEG 2000超过25%的比特率节省。
https://arxiv.org/abs/2402.18761
The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties evoked the attention of both scientific and industrial communities, resulting in the standardization activity JPEG-AI. The verification model for the standardization process of JPEG-AI is already in development and has surpassed the advanced VVC intra codec. To generate reconstructed images with the desired bits per pixel and assess the BD-rate performance of both the JPEG-AI verification model and VVC intra, bit rate matching is employed. However, the current state of the JPEG-AI verification model experiences significant slowdowns during bit rate matching, resulting in suboptimal performance due to an unsuitable model. The proposed methodology offers a gradual algorithmic optimization for matching bit rates, resulting in a fourfold acceleration and over 1% improvement in BD-rate at the base operation point. At the high operation point, the acceleration increases up to sixfold.
基于神经网络(NN)的图像压缩研究已经表明,与经典压缩框架相比具有卓越的性能。与经典框架中的手工程变换不同,NN模型通过学习非线性变换提供更加紧凑的比特表示,并在并行设备上实现更快的学生速度。这些特性引起了科学和工业界的关注,从而促进了JPEG-AI标准化活动的开展。JPEG-AI标准化过程的验证模型已经在开发中,并已经超越了先进的VVC内部编码器。为了生成目标比特每像素的图像,并评估JPEG-AI验证模型和VVC内部编码器的BD-rate性能,采用位率匹配。然而,在位率匹配过程中,JPEG-AI验证模型的状态出现了显著的减速,导致由于不合适的模型导致的性能 suboptimal。所提出的方法提供了一种逐步算法优化位率匹配,从而实现四倍加速和基操作点处超过1%的BD-rate改善。在高端操作点,加速会增加至六倍。
https://arxiv.org/abs/2402.17487
Currently, there is a high demand for neural network-based image compression codecs. These codecs employ non-linear transforms to create compact bit representations and facilitate faster coding speeds on devices compared to the hand-crafted transforms used in classical frameworks. The scientific and industrial communities are highly interested in these properties, leading to the standardization effort of JPEG-AI. The JPEG-AI verification model has been released and is currently under development for standardization. Utilizing neural networks, it can outperform the classic codec VVC intra by over 10% BD-rate operating at base operation point. Researchers attribute this success to the flexible bit distribution in the spatial domain, in contrast to VVC intra's anchor that is generated with a constant quality point. However, our study reveals that VVC intra displays a more adaptable bit distribution structure through the implementation of various block sizes. As a result of our observations, we have proposed a spatial bit allocation method to optimize the JPEG-AI verification model's bit distribution and enhance the visual quality. Furthermore, by applying the VVC bit distribution strategy, the objective performance of JPEG-AI verification mode can be further improved, resulting in a maximum gain of 0.45 dB in PSNR-Y.
目前,基于神经网络的图像压缩码族具有很高需求。这些码族采用非线性变换来创建紧凑的位表示,与经典框架中使用的 hand-crafted 变换相比,有助于在设备上实现更快的编码速度。科学和工业界对这些特性非常感兴趣,导致了 JPEG-AI 的标准化努力。JPEG-AI 验证模型已经发布,目前正在标准化过程中。利用神经网络,它可以在基操作点上比经典编码 VVC 快超过 10% 的 BD-rate。研究人员认为,这一成功是因为在空间域中具有灵活的位分布,而 VVC 中的锚定点是使用常定质量点产生的。然而,我们的研究显示,通过实现各种块大小,VVC 中的位分布结构更加灵活。因此,我们提出了一个空间位分配方法,以优化 JPEG-AI 验证模型的位分布并提高视觉效果。此外,通过应用 VVC 位分布策略,JPEG-AI 验证模式的客观性能可以进一步优化,实现 PSNR-Y 最大增益为 0.45 dB。
https://arxiv.org/abs/2402.17470
With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. In recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on this https URL.
随着存储和通信协议的进化,超低带率图像压缩已成为一个高度要求的技术挑战。然而,现有的压缩算法必须牺牲与地面真实一致性或感知质量在超低带率下的权衡。近年来,大型多模态模型(LMM)的快速发展使得在两个目标之间实现平衡成为可能。为解决这个问题,本文提出了一个名为多模态图像语义压缩(MISC)的方法,它包括一个LMM编码器用于提取图像的语义信息,一个映射编码器用于查找与语义相对应的区域,一个图像编码器生成一个非常压缩的比特流,和一个解码器根据上述信息重构图像。实验结果表明,我们提出的MISC适用于压缩传统自然感知的(NSIs)和新兴的人工生成图像(AIGIs)内容。在节省50%带率的同时,可以实现最优的一致性和感知结果,这在下一代存储和通信领域具有强大的应用潜力。代码发布在https://这个URL上。
https://arxiv.org/abs/2402.16749
Representing the Neural Radiance Field (NeRF) with the explicit voxel grid (EVG) is a promising direction for improving NeRFs. However, the EVG representation is not efficient for storage and transmission because of the terrific memory cost. Current methods for compressing EVG mainly inherit the methods designed for neural network compression, such as pruning and quantization, which do not take full advantage of the spatial correlation of voxels. Inspired by prosperous digital image compression techniques, this paper proposes SPC-NeRF, a novel framework applying spatial predictive coding in EVG compression. The proposed framework can remove spatial redundancy efficiently for better compression performance.Moreover, we model the bitrate and design a novel form of the loss function, where we can jointly optimize compression ratio and distortion to achieve higher coding efficiency. Extensive experiments demonstrate that our method can achieve 32% bit saving compared to the state-of-the-art method VQRF on multiple representative test datasets, with comparable training time.
使用明确体素网格(EVG)表示神经元辐射场(NeRF)是一个改进NeRFs的有前途的方向。然而,由于出色的内存成本,EVG表示并不高效地进行存储和传输。目前用于压缩EVG的方法主要继承了为神经网络压缩设计的算法,如剪枝和量化,这些方法没有充分利用体素之间的空间关联。受到繁荣的数字图像压缩技术的启发,本文提出了SPC-NeRF,一种在EVG压缩中应用空间预测编码的新框架。与现有的方法相比,所提出的框架可以有效地消除空间冗余,从而提高压缩性能。此外,我们建模了带宽和设计了一种新的损失函数,使得我们能够共同优化压缩比和失真,以实现更高的编码效率。大量实验证明,与最先进的VQRF方法相比,我们的方法可以在多个代表性测试数据集上实现32%的带宽节省,具有与训练时间相当的可比训练时间。
https://arxiv.org/abs/2402.16366
Recently, many deep image compression methods have been proposed and achieved remarkable performance. However, these methods are dedicated to optimizing the compression performance and speed at medium and high bitrates, while research on ultra low bitrates is limited. In this work, we propose a ultra low bitrates enhanced invertible encoding network guided by traditional transformation theory, experiments show that our codec outperforms existing methods in both compression and reconstruction performance. Specifically, we introduce the Block Discrete Cosine Transformation to model the sparsity of features and employ traditional Haar transformation to improve the reconstruction performance of the model without increasing the bitstream cost.
近年来,已经提出并实现了很多深图像压缩方法,获得了显著的性能。然而,这些方法都致力于在 medium 和 high 比特率中优化压缩性能和速度,而关于低比特率的超低比特率的研究有限。在这项工作中,我们提出了一种基于传统变换理论的超低比特率增强逆向编码网络,实验证明,我们的编码在压缩和重构性能上都优于现有方法。具体来说,我们引入了块离散余弦变换来建模特征的稀疏性,并采用传统的哈勃变换来提高模型的重构性能,而不会增加带宽成本。
https://arxiv.org/abs/2402.15744
Traditional methods, such as JPEG, perform image compression by operating on structural information, such as pixel values or frequency content. These methods are effective to bitrates around one bit per pixel (bpp) and higher at standard image sizes. In contrast, text-based semantic compression directly stores concepts and their relationships using natural language, which has evolved with humans to efficiently represent these salient concepts. These methods can operate at extremely low bitrates by disregarding structural information like location, size, and orientation. In this work, we use GPT-4V and DALL-E3 from OpenAI to explore the quality-compression frontier for image compression and identify the limitations of current technology. We push semantic compression as low as 100 $\mu$bpp (up to $10,000\times$ smaller than JPEG) by introducing an iterative reflection process to improve the decoded image. We further hypothesize this 100 $\mu$bpp level represents a soft limit on semantic compression at standard image resolutions.
传统方法,如JPEG,通过操作于结构信息,如像素值或频率内容来进行图像压缩。这些方法在标准图像大小下的位速率在每像素1位(bpp)及以上时效果很好。相比之下,基于文本的语义压缩直接使用自然语言存储概念及其关系,这些概念随着人类的发展而有效地表示。这些方法可以操作在非常低的位速率,通过忽略类似于位置、大小和方向的结构信息来达到这一目的。在这项工作中,我们使用GPT-4V和DALL-E3来自OpenAI,探讨图像压缩的品质-压缩前沿,并确定当前技术的局限性。我们通过引入迭代反射过程将语义压缩推向100 $\mu$bpp(最多比JPEG低10,000倍)的极限。我们进一步假设,100 $\mu$bpp的级别代表了在标准图像分辨率下的语义压缩的软极限。
https://arxiv.org/abs/2402.13536
Storing and transmitting LiDAR point cloud data is essential for many AV applications, such as training data collection, remote control, cloud services or SLAM. However, due to the sparsity and unordered structure of the data, it is difficult to compress point cloud data to a low volume. Transforming the raw point cloud data into a dense 2D matrix structure is a promising way for applying compression algorithms. We propose a new lossless and calibrated 3D-to-2D transformation which allows compression algorithms to efficiently exploit spatial correlations within the 2D representation. To compress the structured representation, we use common image compression methods and also a self-supervised deep compression approach using a recurrent neural network. We also rearrange the LiDAR's intensity measurements to a dense 2D representation and propose a new metric to evaluate the compression performance of the intensity. Compared to approaches that are based on generic octree point cloud compression or based on raw point cloud data compression, our approach achieves the best quantitative and visual performance. Source code and dataset are available at this https URL.
存储和传输激光雷达点云数据对于许多自动驾驶应用(如训练数据收集、远程控制、云计算或SLAM)至关重要。然而,由于数据稀疏且无序的结构,将其压缩到低体积是非常困难的。将原始点云数据转换为密集的2D矩阵结构是应用压缩算法的一种有前途的方法。我们提出了一种新的无损失且校准的3D到2D变换,允许压缩算法有效地利用2D表示中的空间相关性。为了压缩结构化表示,我们使用常见的图像压缩方法和自监督的深度压缩方法(使用循环神经网络)。我们还将激光雷达的强度测量重新排列为密集的2D表示,并提出了一个新指标来评估强度压缩性能。与基于通用Octree点云压缩或基于原始点云数据压缩的方法相比,我们的方法在数量和视觉方面都取得了最佳结果。源代码和数据集可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2402.11680
Convolutional neural networks (CNNs) for image processing tend to focus on localized texture patterns, commonly referred to as texture bias. While most of the previous works in the literature focus on the task of image classification, we go beyond this and study the texture bias of CNNs in semantic segmentation. In this work, we propose to train CNNs on pre-processed images with less texture to reduce the texture bias. Therein, the challenge is to suppress image texture while preserving shape information. To this end, we utilize edge enhancing diffusion (EED), an anisotropic image diffusion method initially introduced for image compression, to create texture reduced duplicates of existing datasets. Extensive numerical studies are performed with both CNNs and vision transformer models trained on original data and EED-processed data from the Cityscapes dataset and the CARLA driving simulator. We observe strong texture-dependence of CNNs and moderate texture-dependence of transformers. Training CNNs on EED-processed images enables the models to become completely ignorant with respect to texture, demonstrating resilience with respect to texture re-introduction to any degree. Additionally we analyze the performance reduction in depth on a level of connected components in the semantic segmentation and study the influence of EED pre-processing on domain generalization as well as adversarial robustness.
卷积神经网络(CNNs)在图像处理中通常会关注局部纹理模式,通常被称为纹理偏差。尽管文献中大多数工作都关注图像分类任务,但我们超越了这个范畴,研究了CNN在语义分割中的纹理偏差。在这项工作中,我们提出了一种在预处理图像上训练CNN以减少纹理偏差的方法。这里的挑战在于在保留形状信息的同时抑制图像纹理。为此,我们利用边缘增强扩散(EED),最初用于图像压缩,创建纹理减少的现有数据集的副本。对CNN和视觉Transformer模型在原始数据和EED处理数据上的 extensive numerical studies 进行了研究。我们观察到CNNs 的纹理依赖性很强,而Transformer的纹理依赖性较弱。在EED处理的图像上训练CNN使模型对纹理完全失聪,证明了纹理重新引入的任何程度上的弹性。此外,我们分析了语义分割中连接组件的深度下降,研究了EED预处理对领域泛化的影响以及对抗鲁棒性。
https://arxiv.org/abs/2402.09530
Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural compressed frames and generates subsequent frames. When the reconstruction quality drops below the desired level, new frames are encoded to restart prediction. The entire video is sequentially encoded to achieve a visually pleasing reconstruction, considering perceptual quality metrics such as the learned perceptual image patch similarity (LPIPS) and the Frechet video distance (FVD), at bit rates as low as 0.02 bits per pixel (bpp). Experimental results demonstrate the effectiveness of the proposed scheme compared to standard codecs such as H.264 and H.265 in the low bpp regime. The results showcase the potential of exploiting the temporal relations in video data using generative models. Code is available at: this https URL
扩散模型在生成高质量图像和视频数据方面取得了显著的成功。更最近,它们还被用于具有高感知质量的图像压缩。在本文中,我们提出了一种利用扩散基于生成模型的预测能力来实现极端视频压缩的新方法。条件扩散模型对几个神经压缩帧进行编码,生成后续帧。当重建质量低于期望水平时,新帧被编码以重新启动预测。整个视频按位率序列编码以实现视觉上令人愉悦的重建,考虑学习到的感知图像补丁相似度(LPIPS)和费希特视频距离(FVD)等感知质量指标, bit rates在0.02 bit/pixel(bpp)时。实验结果表明,与低bpp范围内的标准编解码器(如H.264和H.265)相比,所提出的方案在bpp低端具有有效的效果。结果突出了在视频数据中利用生成模型的时间关系潜力。代码可在此处下载:https:// this URL
https://arxiv.org/abs/2402.08934