This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at this https URL.
本文提出了一种针对英特尔数据中心GPU Max 1550的多层感知器(MLPs)实现,该实现针对数据集和并优化了英特尔数据中心GPU Max 1550。为了提高性能,我们的实现通过在通用寄存器和共享本地内存中最大化数据利用率来最小化全局内存访问的延迟。我们使用一个简单的屋顶线模型来证明,这导致算术强度的大幅增加,从而提高了性能,特别是推理。我们还将我们的实现与类似CUDA的MLP实现进行了比较,并证明了在推理和训练方面的性能均优于CUDA实现。此外,本文还展示了我们在三个重要领域:图像压缩、神经辐射场和物理建模机器学习方面的SYCL实现的效率。在所有情况下,我们的实现都胜过同一Intel GPU上的普通英特尔扩展PyTorch(IPEX)实现,其性能提高了30倍以上,而CUDA PyTorch版本在NVIDIA的H100 GPU上的性能提高了19倍。代码可以在该https URL上找到。
https://arxiv.org/abs/2403.17607
While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression, their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation, we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents, thereby facilitating the generation of high-quality images. Moreover, our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space, while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer, which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding, the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs, and the proposed entropy model contributes to notable bitrate savings.
在用条件扩散模型替换高斯解码器以提高神经图像压缩重建的感知质量的同时,它们的缺乏归纳偏见对于图像数据会限制其实现最优感知水平的能力。为了克服这一局限,我们在解码器端采用非均匀扩散模型。这个模型旨在通过区分频率内容来建立归纳偏见,从而促进高质量图像的生成。此外,我们的框架配备了一种新颖的熵模型,该模型通过利用潜在空间中的空间-通道关联精确建模了隐含表示的概率分布,同时加速熵解码步骤。这个通道层面的熵模型利用了每个通道块内的局部和全局空间上下文。全局空间上下文基于Transformer,这是专门为图像压缩任务而设计的。经过设计的Transformer采用了一个Laplacian形状的定位编码,其中可学习参数会根据每个通道簇进行自适应调整。我们的实验结果表明,与最先进的基于生成算法的压缩编码相比,我们所提出的框架具有更好的感知质量,并提出了一种有益的压缩比节省。
https://arxiv.org/abs/2403.16258
Image compression and denoising represent fundamental challenges in image processing with many real-world applications. To address practical demands, current solutions can be categorized into two main strategies: 1) sequential method; and 2) joint method. However, sequential methods have the disadvantage of error accumulation as there is information loss between multiple individual models. Recently, the academic community began to make some attempts to tackle this problem through end-to-end joint methods. Most of them ignore that different regions of noisy images have different characteristics. To solve these problems, in this paper, our proposed signal-to-noise ratio~(SNR) aware joint solution exploits local and non-local features for image compression and denoising simultaneously. We design an end-to-end trainable network, which includes the main encoder branch, the guidance branch, and the signal-to-noise ratio~(SNR) aware branch. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that our joint solution outperforms existing state-of-the-art methods.
图像压缩和去噪在图像处理中是一个基本挑战,许多实际应用都依赖于它。为了解决实际需求,当前的解决方案可以分为两个主要策略:1)序列方法;2)联合方法。然而,序列方法的一个缺点是,多个独立模型的信息损失会导致错误累积。最近,学术界开始尝试通过端到端的联合方法来解决这个问题。大多数方法忽略了噪声图像不同区域具有不同的特征。为了解决这些问题,本文提出的信号噪声比(SNR)感知联合解决方案同时利用了图像压缩和去噪的局部和非局部特征。我们设计了一个端到端的可训练网络,包括主要编码分支、指导分支和信号噪声比(SNR)感知分支。我们在合成和真实世界数据集上进行了广泛的实验,证明了我们的联合解决方案超越了现有最先进的方法。
https://arxiv.org/abs/2403.14135
The widespread adoption of face recognition has led to increasing privacy concerns, as unauthorized access to face images can expose sensitive personal information. This paper explores face image protection against viewing and recovery attacks. Inspired by image compression, we propose creating a visually uninformative face image through feature subtraction between an original face and its model-produced regeneration. Recognizable identity features within the image are encouraged by co-training a recognition model on its high-dimensional feature representation. To enhance privacy, the high-dimensional representation is crafted through random channel shuffling, resulting in randomized recognizable images devoid of attacker-leverageable texture details. We distill our methodologies into a novel privacy-preserving face recognition method, MinusFace. Experiments demonstrate its high recognition accuracy and effective privacy protection. Its code is available at this https URL.
随着人脸识别的广泛应用,隐私问题越来越引起人们的关注,因为未经授权地访问人脸图像会泄露敏感的个人信息。本文探讨了防止观看和恢复攻击的人脸图像保护方法。为了实现这一目标,我们提出了通过在原始人脸和其模型的特征下采样来创建具有视觉上无信息性的人脸图像的方法。通过在图像中鼓励可识别身份特征的识别模型在其高维特征表示上进行共同训练,我们促进了识别模型的可识别性。为了增强隐私,我们通过随机通道洗牌来制作高维表示,从而生成无攻击者利用的纹理细节的随机可识别图像。我们将我们的方法归类为一种新的隐私保护人脸识别方法,称为MinusFace。实验证明,其高识别准确性和有效的隐私保护功能。其代码可在此处访问:https://www. this URL。
https://arxiv.org/abs/2403.12457
Learned Image Compression (LIC) has achieved dramatic progress regarding objective and subjective metrics. MSE-based models aim to improve objective metrics while generative models are leveraged to improve visual quality measured by subjective metrics. However, they all suffer from blurring or deformation at low bit rates, especially at below $0.2bpp$. Besides, deformation on human faces and text is unacceptable for visual quality assessment, and the problem becomes more prominent on small faces and text. To solve this problem, we combine the advantage of MSE-based models and generative models by utilizing region of interest (ROI). We propose Hierarchical-ROI (H-ROI), to split images into several foreground regions and one background region to improve the reconstruction of regions containing faces, text, and complex textures. Further, we propose adaptive quantization by non-linear mapping within the channel dimension to constrain the bit rate while maintaining the visual quality. Exhaustive experiments demonstrate that our methods achieve better visual quality on small faces and text with lower bit rates, e.g., $0.7X$ bits of HiFiC and $0.5X$ bits of BPG.
学习到的图像压缩(LIC)在客观和主观指标方面取得了显著的进步。基于MSE的模型旨在提高客观指标,而基于生成模型的模型则试图利用生成模型的优势来改善客观指标。然而,它们在低比特率下都存在模糊或变形的问题,特别是在低于0.2bpp的比特率下。此外,对于视觉质量评估,面部和文本的变形是不可以接受的,问题在较小和文本上变得更加突出。为了解决这个问题,我们结合了基于MSE模型的优势和生成模型的优势,通过使用区域感兴趣(ROI)。我们提出了Hierarchical-ROI(H-ROI),将图像分割为多个前景区域和一个背景区域,以改善包含面部、文本和复杂纹理的区域的重建。此外,我们通过在通道维度非线性映射来实现自适应量化,以在保持视觉质量的同时约束比特率。充分的实验证明,我们的方法在低比特率下能够实现更好的视觉效果,例如,$0.7X$bits的HiFiC和$0.5X$bits的BPG。
https://arxiv.org/abs/2403.13030
The emerging Learned Compression (LC) replaces the traditional codec modules with Deep Neural Networks (DNN), which are trained end-to-end for rate-distortion performance. This approach is considered as the future of image/video compression, and major efforts have been dedicated to improving its compression efficiency. However, most proposed works target compression efficiency by employing more complex DNNS, which contributes to higher computational complexity. Alternatively, this paper proposes to improve compression by fully exploiting the existing DNN capacity. To do so, the latent features are guided to learn a richer and more diverse set of features, which corresponds to better reconstruction. A channel-wise feature decorrelation loss is designed and is integrated into the LC optimization. Three strategies are proposed and evaluated, which optimize (1) the transformation network, (2) the context model, and (3) both networks. Experimental results on two established LC methods show that the proposed method improves the compression with a BD-Rate of up to 8.06%, with no added complexity. The proposed solution can be applied as a plug-and-play solution to optimize any similar LC method.
学习压缩(LC)作为一种新兴的压缩技术,取代了传统的编码模块,使用了深度神经网络(DNN),这些网络是针对码率失真性能进行端到端训练的。这种方法被认为是图像/视频压缩的未来,并且为提高其压缩效率做出了主要努力。然而,大多数提出的作品通过采用更复杂的DNN来提高压缩效率,导致计算复杂度更高。相反,本文提出了一种通过充分利用现有DNN能力来提高压缩的方法。为此,将潜在特征指导学习更丰富和更多样化的特征,从而实现更好的重构。在LC优化中,设计了一个通道级特征相关损失,并将其集成进去。提出了三种策略并对其进行了评估,它们分别是优化(1)转换网络,(2)上下文模型,(3)两个网络。在两个已有的LC方法上进行实验,结果表明,与所提出的方法相比,压缩率提高了至少8.06%,而没有增加复杂性。所提出的解决方案可以作为一个可插拔的解决方案,用于优化任何类似的LC方法。
https://arxiv.org/abs/2403.10936
Generative adversial network (GAN) is a type of generative model that maps a high-dimensional noise to samples in target distribution. However, the dimension of noise required in GAN is not well understood. Previous approaches view GAN as a mapping from a continuous distribution to another continous distribution. In this paper, we propose to view GAN as a discrete sampler instead. From this perspective, we build a connection between the minimum noise required and the bits to losslessly compress the images. Furthermore, to understand the behaviour of GAN when noise dimension is limited, we propose divergence-entropy trade-off. This trade-off depicts the best divergence we can achieve when noise is limited. And as rate distortion trade-off, it can be numerically solved when source distribution is known. Finally, we verifies our theory with experiments on image generation.
生成对抗网络(GAN)是一种生成模型,它将高维噪声映射到目标分布中的样本。然而,GAN所需的噪声维度并不清楚。之前的方法将GAN视为从连续分布到另一个连续分布的映射。在本文中,我们提出了将GAN视为离散采样器的观点。从这种角度来看,我们建立了最低噪声所需量和图像无损压缩所需的比特数之间的联系。此外,为了了解GAN在噪声维度受限时的行为,我们提出了熵增益 trade-off。这个 trade-off 描述了当噪声受限时我们能实现的最佳熵。当源分布已知时,它可以通过数值求解得到。最后,我们通过图像生成实验验证了我们的理论。
https://arxiv.org/abs/2403.09196
Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed.
现有的基于学习的立体图像编码器采用简单的源于单张图像编码器的熵模型对潜在表示进行编码。然而,这些熵模型很难有效地捕捉立体图像固有的空间差异特点,导致速率失真结果。在本文中,我们提出了一个名为CAMSIC的立体图像压缩框架。CAMSIC将每个图像独立转换为潜在表示,并采用一种强大的无解码器-无关Transformer熵模型来捕捉空间和差异依赖,通过引入一种新颖的内容感知掩膜图像建模(MIM)技术。我们内容感知MIM使前后信息之间实现高效的双向交互,自然地消除了需要额外Transformer解码器的需要。实验结果表明,我们的立体图像编码器在两个立体图像数据集Cityscapes和InStereo2K上实现了与最先进的速率失真性能相同的水平,具有快速的编码和解码速度。
https://arxiv.org/abs/2403.08505
Image compression emerges as a pivotal tool in the efficient handling and transmission of digital images. Its ability to substantially reduce file size not only facilitates enhanced data storage capacity but also potentially brings advantages to the development of continual machine learning (ML) systems, which learn new knowledge incrementally from sequential data. Continual ML systems often rely on storing representative samples, also known as exemplars, within a limited memory constraint to maintain the performance on previously learned data. These methods are known as memory replay-based algorithms and have proven effective at mitigating the detrimental effects of catastrophic forgetting. Nonetheless, the limited memory buffer size often falls short of adequately representing the entire data distribution. In this paper, we explore the use of image compression as a strategy to enhance the buffer's capacity, thereby increasing exemplar diversity. However, directly using compressed exemplars introduces domain shift during continual ML, marked by a discrepancy between compressed training data and uncompressed testing data. Additionally, it is essential to determine the appropriate compression algorithm and select the most effective rate for continual ML systems to balance the trade-off between exemplar quality and quantity. To this end, we introduce a new framework to incorporate image compression for continual ML including a pre-processing data compression step and an efficient compression rate/algorithm selection method. We conduct extensive experiments on CIFAR-100 and ImageNet datasets and show that our method significantly improves image classification accuracy in continual ML settings.
图像压缩成为处理和传输数字图像的高效手段。其大幅减小文件大小不仅提高了数据存储容量,还有助于连续机器学习(ML)系统的开发,这些系统从序列数据中逐步学习新知识。连续ML系统通常需要在有限内存约束下存储代表性样本,也就是实例,以保持对之前学习数据的性能。这些方法称为基于回放的算法,已经在减轻灾难性遗忘的有害影响方面取得了有效成果。然而,有限的内存缓冲区往往无法充分表示整个数据分布。在本文中,我们探讨了将图像压缩作为一种策略来提高缓冲器容量,从而增加实例多样性。然而,直接使用压缩实例在连续ML过程中会导致领域漂移,表现为压缩训练数据和未压缩测试数据之间的差异。此外,确定适当的压缩算法以及为连续ML系统选择最有效的压缩率至关重要。为此,我们引入了一个新的框架,包括预处理数据压缩步骤和高效的压缩率/算法选择方法,用于将图像压缩应用于连续ML。我们在CIFAR-100和ImageNet数据集上进行了广泛的实验,结果表明,我们的方法在连续ML环境中显著提高了图像分类准确性。
https://arxiv.org/abs/2403.06288
Image Coding for Machines (ICM) is an image compression technique for image recognition. This technique is essential due to the growing demand for image recognition AI. In this paper, we propose a method for ICM that focuses on encoding and decoding only the edge information of object parts in an image, which we call SA-ICM. This is an Learned Image Compression (LIC) model trained using edge information created by Segment Anything. Our method can be used for image recognition models with various tasks. SA-ICM is also robust to changes in input data, making it effective for a variety of use cases. Additionally, our method provides benefits from a privacy point of view, as it removes human facial information on the encoder's side, thus protecting one's privacy. Furthermore, this LIC model training method can be used to train Neural Representations for Videos (NeRV), which is a video compression model. By training NeRV using edge information created by Segment Anything, it is possible to create a NeRV that is effective for image recognition (SA-NeRV). Experimental results confirm the advantages of SA-ICM, presenting the best performance in image compression for image recognition. We also show that SA-NeRV is superior to ordinary NeRV in video compression for machines.
图像编码(ICM)是一种图像压缩技术,用于图像识别。由于图像识别人工智能(AI)的需求不断增长,ICM技术在图像识别中具有重要作用。在本文中,我们提出了一个专注于对图像中物体部分边缘信息的编码和解码的方法,我们称之为SA-ICM。这是我们使用Segment Anything生成的边缘信息训练的学习图像压缩(LIC)模型。我们的方法可以应用于各种图像识别任务模型。SA-ICM对输入数据的变化也非常鲁棒,因此适用于各种用例。此外,从隐私角度来看,我们的方法移除了编码器侧的人脸信息,从而保护个人隐私。此外,通过使用Segment Anything生成的边缘信息训练LIC模型,我们还可以用于训练Neural Representations for Videos(NeRV),这是一种视频压缩模型。通过训练NeRV使用Segment Anything生成的边缘信息,可以创建一个有效的NeRV用于图像识别(SA-NeRV)。实验结果证实了SA-ICM的优越性,在图像压缩方面取得了最佳性能。我们还证明了SA-NeRV在机器对视频压缩方面的优势。
https://arxiv.org/abs/2403.04173
Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we propose a compression framework that leverages text information mainly by text-adaptive encoding and training with joint image-text loss. By doing so, we avoid decoding based on text-guided generative models -- known for high generative diversity -- and effectively utilize the semantic information of text at a global level. Experimental results on various datasets show that our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions. In particular, our method outperforms all baselines in terms of LPIPS, with some room for even more improvements when we use more carefully generated captions.
近年来,在文本引导图像压缩方面的先进技术揭示了增强重构图像感知质量的很大潜力。然而,这些方法往往导致在像素级别显著降低的保真度,限制了其实用性。为了填补这一空白,我们开发了一种新的文本引导图像压缩算法,实现了高感知质量和像素级别的保真度。 特别是,我们提出了一个压缩框架,主要通过文本自适应编码和与联合图像-文本损失的训练来利用文本信息。通过这种方式,我们避免了基于文本引导的生成模型--以其高生成多样性而闻名--的解码,并有效利用了文本的语义信息。在各种数据集上的实验结果表明,我们的方法可以在人类或机器生成的视频中实现高像素级别和感知质量。特别是,当我们使用更仔细生成的注释时,我们的方法在LPIPS方面超越了所有基线。
https://arxiv.org/abs/2403.02944
Learned image compression codecs have recently achieved impressive compression performances surpassing the most efficient image coding architectures. However, most approaches are trained to minimize rate and distortion which often leads to unsatisfactory visual results at low bitrates since perceptual metrics are not taken into account. In this paper, we show that conditional diffusion models can lead to promising results in the generative compression task when used as a decoder, and that, given a compressed representation, they allow creating new tradeoff points between distortion and perception at the decoder side based on the sampling method.
近年来,学习到的图像压缩编码器在压缩性能上已经达到了令人印象深刻的水平,超过了最有效的图像编码架构。然而,大多数方法都是通过最小化率和失真来训练的,这往往导致低比特率下的视觉效果不令人满意,因为感知指标没有被考虑在内。在本文中,我们证明了条件扩散模型作为解码器在生成压缩任务中可以实现有前景的结果,并且,在压缩表示的基础上,它们可以在解码器端根据采样方法创建新的权衡点,从而在失真和感知之间实现新的平衡。
https://arxiv.org/abs/2403.02887
This work proposes to augment the lifting steps of the conventional wavelet transform with additional neural network assisted lifting steps. These additional steps reduce residual redundancy (notably aliasing information) amongst the wavelet subbands, and also improve the visual quality of reconstructed images at reduced resolutions. The proposed approach involves two steps, a high-to-low step followed by a low-to-high step. The high-to-low step suppresses aliasing in the low-pass band by using the detail bands at the same resolution, while the low-to-high step aims to further remove redundancy from detail bands, so as to achieve higher energy compaction. The proposed two lifting steps are trained in an end-to-end fashion; we employ a backward annealing approach to overcome the non-differentiability of the quantization and cost functions during back-propagation. Importantly, the networks employed in this paper are compact and with limited non-linearities, allowing a fully scalable system; one pair of trained network parameters are applied for all levels of decomposition and for all bit-rates of interest. By employing the proposed approach within the JPEG 2000 image coding standard, our method can achieve up to 17.4% average BD bit-rate saving over a wide range of bit-rates, while retaining quality and resolution scalability features of JPEG 2000.
本文提出了一种通过添加神经网络辅助提升传统波浪变换的 lifting 步骤来增强其提升步数的方法。这些额外的步骤减少了波浪子带之间的残余冗余(显著是 aliasing 信息),并且还改善了在低分辨率下重构图像的视觉效果。所提出的方法包括两个步骤:从高到低的步骤和从低到高的步骤。从高到低的步骤通过在同一分辨率下使用详细波浪带来抑制低通带中的 aliasing,而从低到高的步骤旨在进一步消除详细波浪带中的冗余,以实现更高的能量压缩。与传统的提升步骤相比,本文提出的两个提升步骤在端到端的方式下进行训练;我们采用反向退化方法来克服在反向传播过程中量化和非线性函数的不可导性。重要的是,本文使用的网络具有紧凑的模型和有限的非线性,允许实现完全可扩展的系统;对于所有分解级别和感兴趣的比特率,我们采用一对训练好的网络参数。通过将所提出的提升方法应用于 JPEG 2000 图像编码标准,我们的方法在广泛的比特率范围内可以实现最高 17.4% 的平均 BD 位率节省,同时保留 JPEG 2000 的质量和分辨率可扩展性特征。
https://arxiv.org/abs/2403.01647
Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extracting region-adaptive contextual information. Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks. Additionally, we introduce a plug-and-play module named Scale Affine Layer to incorporate rich contexts from various regions. While there have been prior image compression efforts that involve segmentation masks as additional intermediate inputs, our approach differs significantly from them. Our advantages lie in that, to avoid extra bitrate overhead, we treat these masks as privilege information, which is accessible during the model training stage but not required during the inference phase. To the best of our knowledge, we are the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR). The experimental results demonstrate our improvement compared to previously well-performing methods, with about 8.2% bitrate saving compared to VTM-17.0. The code will be released at this https URL.
近年来,学习图像压缩(LIC)取得了显著进展。现有的 works 通常使用基于 CNN 的或自注意力机制的压缩方法。然而,还没有关于聚焦于特定区域的神经转换的研究。为了回应这个问题,我们引入了类无关的分割掩码(即没有类别标签的语义掩码)以提取区域适应的上下文信息。我们提出的模块,区域适应转换模块,在掩码的指导下对不同区域应用自适应卷积。此外,我们还引入了一个名为 Scale Affine Layer 的插件,以包含来自各个区域的丰富上下文。虽然之前有一些图像压缩努力使用了分割掩码作为附加的中间输入,但我们的方法与它们有显著区别。我们的优势在于,为了避免额外比特率开销,我们将这些掩码视为特权信息,在模型训练阶段可以访问,但在推理阶段不需要。据我们所知,我们是第一个将类无关掩码作为特权信息并实现像素级质量指标(如峰值信号噪声比,PSNR)优越性能的机构。实验结果表明,与之前表现良好的方法相比,我们的改进程度大约为 8.2% 比特率节省。代码将在此处发布:https:// 这个 URL。
https://arxiv.org/abs/2403.00628
Achieving successful variable bitrate compression with computationally simple algorithms from a single end-to-end learned image or video compression model remains a challenge. Many approaches have been proposed, including conditional auto-encoders, channel-adaptive gains for the latent tensor or uniformly quantizing all elements of the latent tensor. This paper follows the traditional approach to vary a single quantization step size to perform uniform quantization of all latent tensor elements. However, three modifications are proposed to improve the variable rate compression performance. First, multi objective optimization is used for (post) training. Second, a quantization-reconstruction offset is introduced into the quantization operation. Third, variable rate quantization is used also for the hyper latent. All these modifications can be made on a pre-trained single-rate compression model by performing post training. The algorithms are implemented into three well-known image compression models and the achieved variable rate compression results indicate negligible or minimal compression performance loss compared to training multiple models. (Codes will be shared at this https URL)
实现使用单个端到端学习图像或视频压缩模型成功实现变比压缩仍然是一个挑战。已经提出了许多方法,包括条件自动编码器、针对延迟张量或对所有延迟张量进行均匀量化。本文遵循传统方法,通过单步量化的方式对所有延迟张量元素进行均匀量化。然而,本文提出了三种修改方法来提高变比压缩性能。首先,使用多目标优化进行后训练。其次,引入了量化重建偏移量到量化操作中。第三,在超低延迟时也使用变比量化。这些修改都可以通过在预训练的单步压缩模型上进行后训练来完成。将算法实现到三个著名的图像压缩模型中,所获得的变比压缩结果表明,与训练多个模型相比,压缩性能损失非常小。(代码将在此处共享)
https://arxiv.org/abs/2402.18930
This paper provides a comprehensive study on features and performance of different ways to incorporate neural networks into lifting-based wavelet-like transforms, within the context of fully scalable and accessible image compression. Specifically, we explore different arrangements of lifting steps, as well as various network architectures for learned lifting operators. Moreover, we examine the impact of the number of learned lifting steps, the number of channels, the number of layers and the support of kernels in each learned lifting operator. To facilitate the study, we investigate two generic training methodologies that are simultaneously appropriate to a wide variety of lifting structures considered. Experimental results ultimately suggest that retaining fixed lifting steps from the base wavelet transform is highly beneficial. Moreover, we demonstrate that employing more learned lifting steps and more layers in each learned lifting operator do not contribute strongly to the compression performance. However, benefits can be obtained by utilizing more channels in each learned lifting operator. Ultimately, the learned wavelet-like transform proposed in this paper achieves over 25% bit-rate savings compared to JPEG 2000 with compact spatial support.
本文对将神经网络嵌入到基于提升的波浪变换中特征和性能进行了全面的探讨,并考虑了完全可扩展和可访问图像压缩的上下文。具体来说,我们研究了不同提升步数的安排以及各种学习提升操作网络架构。此外,我们还检查了每个学习提升操作中参数的数量、通道数量、层数和内核支持对压缩性能的影响。为了方便研究,我们研究了两种适用于各种提升结构的通用训练方法。实验结果最终表明,保留基于原始波浪变换的基本提升步日是高度有益的。此外,我们还证明了在每种学习提升操作中使用更多的学习步数和层数并不能显著提高压缩性能。然而,通过在每个学习提升操作中使用更多的通道,可以获得比JPEG 2000有更好的压缩性能。最终,本文提出的学波浪变换实现了与紧凑空间支持下的JPEG 2000超过25%的比特率节省。
https://arxiv.org/abs/2402.18761
The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties evoked the attention of both scientific and industrial communities, resulting in the standardization activity JPEG-AI. The verification model for the standardization process of JPEG-AI is already in development and has surpassed the advanced VVC intra codec. To generate reconstructed images with the desired bits per pixel and assess the BD-rate performance of both the JPEG-AI verification model and VVC intra, bit rate matching is employed. However, the current state of the JPEG-AI verification model experiences significant slowdowns during bit rate matching, resulting in suboptimal performance due to an unsuitable model. The proposed methodology offers a gradual algorithmic optimization for matching bit rates, resulting in a fourfold acceleration and over 1% improvement in BD-rate at the base operation point. At the high operation point, the acceleration increases up to sixfold.
基于神经网络(NN)的图像压缩研究已经表明,与经典压缩框架相比具有卓越的性能。与经典框架中的手工程变换不同,NN模型通过学习非线性变换提供更加紧凑的比特表示,并在并行设备上实现更快的学生速度。这些特性引起了科学和工业界的关注,从而促进了JPEG-AI标准化活动的开展。JPEG-AI标准化过程的验证模型已经在开发中,并已经超越了先进的VVC内部编码器。为了生成目标比特每像素的图像,并评估JPEG-AI验证模型和VVC内部编码器的BD-rate性能,采用位率匹配。然而,在位率匹配过程中,JPEG-AI验证模型的状态出现了显著的减速,导致由于不合适的模型导致的性能 suboptimal。所提出的方法提供了一种逐步算法优化位率匹配,从而实现四倍加速和基操作点处超过1%的BD-rate改善。在高端操作点,加速会增加至六倍。
https://arxiv.org/abs/2402.17487
Currently, there is a high demand for neural network-based image compression codecs. These codecs employ non-linear transforms to create compact bit representations and facilitate faster coding speeds on devices compared to the hand-crafted transforms used in classical frameworks. The scientific and industrial communities are highly interested in these properties, leading to the standardization effort of JPEG-AI. The JPEG-AI verification model has been released and is currently under development for standardization. Utilizing neural networks, it can outperform the classic codec VVC intra by over 10% BD-rate operating at base operation point. Researchers attribute this success to the flexible bit distribution in the spatial domain, in contrast to VVC intra's anchor that is generated with a constant quality point. However, our study reveals that VVC intra displays a more adaptable bit distribution structure through the implementation of various block sizes. As a result of our observations, we have proposed a spatial bit allocation method to optimize the JPEG-AI verification model's bit distribution and enhance the visual quality. Furthermore, by applying the VVC bit distribution strategy, the objective performance of JPEG-AI verification mode can be further improved, resulting in a maximum gain of 0.45 dB in PSNR-Y.
目前,基于神经网络的图像压缩码族具有很高需求。这些码族采用非线性变换来创建紧凑的位表示,与经典框架中使用的 hand-crafted 变换相比,有助于在设备上实现更快的编码速度。科学和工业界对这些特性非常感兴趣,导致了 JPEG-AI 的标准化努力。JPEG-AI 验证模型已经发布,目前正在标准化过程中。利用神经网络,它可以在基操作点上比经典编码 VVC 快超过 10% 的 BD-rate。研究人员认为,这一成功是因为在空间域中具有灵活的位分布,而 VVC 中的锚定点是使用常定质量点产生的。然而,我们的研究显示,通过实现各种块大小,VVC 中的位分布结构更加灵活。因此,我们提出了一个空间位分配方法,以优化 JPEG-AI 验证模型的位分布并提高视觉效果。此外,通过应用 VVC 位分布策略,JPEG-AI 验证模式的客观性能可以进一步优化,实现 PSNR-Y 最大增益为 0.45 dB。
https://arxiv.org/abs/2402.17470
With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. In recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on this https URL.
随着存储和通信协议的进化,超低带率图像压缩已成为一个高度要求的技术挑战。然而,现有的压缩算法必须牺牲与地面真实一致性或感知质量在超低带率下的权衡。近年来,大型多模态模型(LMM)的快速发展使得在两个目标之间实现平衡成为可能。为解决这个问题,本文提出了一个名为多模态图像语义压缩(MISC)的方法,它包括一个LMM编码器用于提取图像的语义信息,一个映射编码器用于查找与语义相对应的区域,一个图像编码器生成一个非常压缩的比特流,和一个解码器根据上述信息重构图像。实验结果表明,我们提出的MISC适用于压缩传统自然感知的(NSIs)和新兴的人工生成图像(AIGIs)内容。在节省50%带率的同时,可以实现最优的一致性和感知结果,这在下一代存储和通信领域具有强大的应用潜力。代码发布在https://这个URL上。
https://arxiv.org/abs/2402.16749
Representing the Neural Radiance Field (NeRF) with the explicit voxel grid (EVG) is a promising direction for improving NeRFs. However, the EVG representation is not efficient for storage and transmission because of the terrific memory cost. Current methods for compressing EVG mainly inherit the methods designed for neural network compression, such as pruning and quantization, which do not take full advantage of the spatial correlation of voxels. Inspired by prosperous digital image compression techniques, this paper proposes SPC-NeRF, a novel framework applying spatial predictive coding in EVG compression. The proposed framework can remove spatial redundancy efficiently for better compression performance.Moreover, we model the bitrate and design a novel form of the loss function, where we can jointly optimize compression ratio and distortion to achieve higher coding efficiency. Extensive experiments demonstrate that our method can achieve 32% bit saving compared to the state-of-the-art method VQRF on multiple representative test datasets, with comparable training time.
使用明确体素网格(EVG)表示神经元辐射场(NeRF)是一个改进NeRFs的有前途的方向。然而,由于出色的内存成本,EVG表示并不高效地进行存储和传输。目前用于压缩EVG的方法主要继承了为神经网络压缩设计的算法,如剪枝和量化,这些方法没有充分利用体素之间的空间关联。受到繁荣的数字图像压缩技术的启发,本文提出了SPC-NeRF,一种在EVG压缩中应用空间预测编码的新框架。与现有的方法相比,所提出的框架可以有效地消除空间冗余,从而提高压缩性能。此外,我们建模了带宽和设计了一种新的损失函数,使得我们能够共同优化压缩比和失真,以实现更高的编码效率。大量实验证明,与最先进的VQRF方法相比,我们的方法可以在多个代表性测试数据集上实现32%的带宽节省,具有与训练时间相当的可比训练时间。
https://arxiv.org/abs/2402.16366