This paper presents a Transformer-based image compression system that allows for a variable image quality objective according to the user's preference. Optimizing a learned codec for different quality objectives leads to reconstructed images with varying visual characteristics. Our method provides the user with the flexibility to choose a trade-off between two image quality objectives using a single, shared model. Motivated by the success of prompt-tuning techniques, we introduce prompt tokens to condition our Transformer-based autoencoder. These prompt tokens are generated adaptively based on the user's preference and input image through learning a prompt generation network. Extensive experiments on commonly used quality metrics demonstrate the effectiveness of our method in adapting the encoding and/or decoding processes to a variable quality objective. While offering the additional flexibility, our proposed method performs comparably to the single-objective methods in terms of rate-distortion performance.
本论文介绍了基于Transformer的图像压缩系统,可以根据用户偏好实现变量图像质量目标。优化学习到的编码器针对不同质量目标的结果,生成具有不同视觉特征的重构图像。我们的方法提供了用户可以使用一个共享模型来选择两个图像质量目标之间的权衡。受prompt-tuning技术的成功启发,我们引入了prompt tokens,以Condition我们的Transformer-based自动编码器。这些prompt tokens是通过学习prompt生成网络自适应地生成的,基于用户的偏好和输入图像。在常用的质量度量指标广泛的实验中,证明了我们方法在将编码和/或解码过程适应可变质量目标方面的的有效性。尽管提供了额外的灵活性,我们的方法在比率失真表现方面与单目标方法相当。
https://arxiv.org/abs/2309.12717
We study neural image compression based on the Sparse Visual Representation (SVR), where images are embedded into a discrete latent space spanned by learned visual codebooks. By sharing codebooks with the decoder, the encoder transfers integer codeword indices that are efficient and cross-platform robust, and the decoder retrieves the embedded latent feature using the indices for reconstruction. Previous SVR-based compression lacks effective mechanism for rate-distortion tradeoffs, where one can only pursue either high reconstruction quality or low transmission bitrate. We propose a Masked Adaptive Codebook learning (M-AdaCode) method that applies masks to the latent feature subspace to balance bitrate and reconstruction quality. A set of semantic-class-dependent basis codebooks are learned, which are weighted combined to generate a rich latent feature for high-quality reconstruction. The combining weights are adaptively derived from each input image, providing fidelity information with additional transmission costs. By masking out unimportant weights in the encoder and recovering them in the decoder, we can trade off reconstruction quality for transmission bits, and the masking rate controls the balance between bitrate and distortion. Experiments over the standard JPEG-AI dataset demonstrate the effectiveness of our M-AdaCode approach.
我们研究基于稀疏视觉表示(SVR)的神经网络图像压缩,其中图像被嵌入到一个离散的潜在空间中,由学习的视觉代码库组成。通过与解码器分享代码库,编码器可以将高效、跨平台稳定的整数代码word索引传输,而解码器使用索引来重构嵌入的潜在特征。之前的SVR-based压缩缺乏速率和失真权衡的有效机制,只能追求高重构质量和低传输比特率。我们提出了一种带掩码的自适应代码库学习方法(M-AdaCode),将掩码应用于潜在特征子空间,以平衡比特率和重构质量。我们学习了一组语义类别相关的基代码库,这些基代码库被加权组合,以生成高质量的重构丰富的潜在特征。组合权重自适应地从每个输入图像中提取,提供附加传输成本的逼真信息。通过在编码器中掩盖不重要的权重,并在解码器中恢复它们,我们可以以传输比特率为重构质量 trade-off 交换质量 for 传输比特,并控制掩码率平衡比特率和失真。在标准JPEG-AI数据集上进行的实验表明,我们的M-AdaCode方法具有有效性。
https://arxiv.org/abs/2309.11661
Transform and entropy models are the two core components in deep image compression neural networks. Most existing learning-based image compression methods utilize convolutional-based transform, which lacks the ability to model long-range dependencies, primarily due to the limited receptive field of the convolution operation. To address this limitation, we propose a Transformer-based nonlinear transform. This transform has the remarkable ability to efficiently capture both local and global information from the input image, leading to a more decorrelated latent representation. In addition, we introduce a novel entropy model that incorporates two different hyperpriors to model cross-channel and spatial dependencies of the latent representation. To further improve the entropy model, we add a global context that leverages distant relationships to predict the current latent more accurately. This global context employs a causal attention mechanism to extract long-range information in a content-dependent manner. Our experiments show that our proposed framework performs better than the state-of-the-art methods in terms of rate-distortion performance.
变换和熵模型是深度图像压缩神经网络的两个核心组件。大多数现有的基于学习的图像处理方法都使用卷积变换,但它缺乏建模长距离依赖的能力,主要是因为卷积操作的局限性。为了解决这个限制,我们提出了基于Transformer的非线性变换。这种变换具有非凡的能力,从输入图像高效地捕获本地和全局信息,从而导致更无相关性的隐态表示。此外,我们引入了一种新的熵模型,它融合了两个不同的超先验,以建模隐态表示的交叉通道和空间依赖。为了进一步改进熵模型,我们添加了一个全球上下文,利用远距离关系更准确地预测当前隐态。这个全球上下文使用因果注意力机制,以内容dependent的方式提取长距离信息。我们的实验表明,我们提出的框架在速率失真性能方面比当前先进技术表现更好。
https://arxiv.org/abs/2309.10799
Missions studying the dynamic behaviour of the Sun are defined to capture multi-spectral images of the sun and transmit them to the ground station in a daily basis. To make transmission efficient and feasible, image compression systems need to be exploited. Recently successful end-to-end optimized neural network-based image compression systems have shown great potential to be used in an ad-hoc manner. In this work we have proposed a transformer-based multi-spectral neural image compressor to efficiently capture redundancies both intra/inter-wavelength. To unleash the locality of window-based self attention mechanism, we propose an inter-window aggregated token multi head self attention. Additionally to make the neural compressor autoencoder shift invariant, a randomly shifted window attention mechanism is used which makes the transformer blocks insensitive to translations in their input domain. We demonstrate that the proposed approach not only outperforms the conventional compression algorithms but also it is able to better decorrelates images along the multiple wavelengths compared to single spectral compression.
研究太阳动态行为的任务定义每天要捕获太阳的多光谱图像,并将其传输到地面站。为了提高效率并实现可行性,需要利用图像压缩系统。最近成功优化的端到端神经网络based图像压缩系统已经表明可以非常灵活地使用。在本研究中,我们提出了基于Transformer的多种光谱神经网络图像压缩器,以高效捕捉内/波长之间的冗余。为了释放窗口based自注意力机制的局部性,我们提出了一种跨窗口聚合的多方自注意力。此外,为了使神经网络压缩器自编码器 shift 不变,我们使用了随机移位的窗口注意力机制,从而使Transformer 块在输入域中不敏感于翻译。我们证明,提出的 approach 不仅优于传统的压缩算法,而且能够更好地在多个波长方向上对图像进行去相关化,与单光谱压缩相比。
https://arxiv.org/abs/2309.10791
DNA exhibits remarkable potential as a data storage solution due to its impressive storage density and long-term stability, stemming from its inherent biomolecular structure. However, developing this novel medium comes with its own set of challenges, particularly in addressing errors arising from storage and biological manipulations. These challenges are further conditioned by the structural constraints of DNA sequences and cost considerations. In response to these limitations, we have pioneered a novel compression scheme and a cutting-edge Multiple Description Coding (MDC) technique utilizing neural networks for DNA data storage. Our MDC method introduces an innovative approach to encoding data into DNA, specifically designed to withstand errors effectively. Notably, our new compression scheme overperforms classic image compression methods for DNA-data storage. Furthermore, our approach exhibits superiority over conventional MDC methods reliant on auto-encoders. Its distinctive strengths lie in its ability to bypass the need for extensive model training and its enhanced adaptability for fine-tuning redundancy levels. Experimental results demonstrate that our solution competes favorably with the latest DNA data storage methods in the field, offering superior compression rates and robust noise resilience.
核酸作为数据存储解决方案表现出巨大的潜力,因为其惊人的存储密度和长期稳定性源于其固有的生物分子结构。然而,开发这一新型媒介面临其自身的一系列挑战,特别是解决存储和生物操作引起的错误。这些挑战进一步受到DNA序列的结构限制和成本因素的影响。为了应对这些限制,我们开创了一种新颖的压缩方案和先进的多描述编码技术,利用神经网络为核酸数据存储提供编码。我们的MDC方法提出了一种创新的方法,用于将数据编码到核酸中,特别设计了能够有效抵御错误的方法。值得注意的是,我们的新压缩方案在核酸数据存储中比经典的图像压缩方法表现更好。此外,我们的 approach 还表现出优于传统的基于自编码的MDC方法的优势。其独特之处在于,它能够绕过大量的模型训练的需求,并提高微调冗余Level的适应度。实验结果表明,我们的解决方案在 field 中的竞争对手地位,提供卓越的压缩率和强大的噪声恢复能力。
https://arxiv.org/abs/2309.06956
Recently, neural network (NN)-based image compression studies have actively been made and has shown impressive performance in comparison to traditional methods. However, most of the works have focused on non-scalable image compression (single-layer coding) while spatially scalable image compression has drawn less attention although it has many applications. In this paper, we propose a novel NN-based spatially scalable image compression method, called COMPASS, which supports arbitrary-scale spatial scalability. Our proposed COMPASS has a very flexible structure where the number of layers and their respective scale factors can be arbitrarily determined during inference. To reduce the spatial redundancy between adjacent layers for arbitrary scale factors, our COMPASS adopts an inter-layer arbitrary scale prediction method, called LIFF, based on implicit neural representation. We propose a combined RD loss function to effectively train multiple layers. Experimental results show that our COMPASS achieves BD-rate gain of -58.33% and -47.17% at maximum compared to SHVC and the state-of-the-art NN-based spatially scalable image compression method, respectively, for various combinations of scale factors. Our COMPASS also shows comparable or even better coding efficiency than the single-layer coding for various scale factors.
最近,神经网络(NN)based的图像压缩研究正在进行,并表现出与传统方法相比令人印象深刻的性能。然而,大多数工作都专注于非 scalable 的图像压缩(单层编码),虽然这种压缩方法有很多应用。在本文中,我们提出了一种基于 NN 的空间可扩展图像压缩方法,称为 COMPASS,该方法支持任意尺度的空间扩展。我们提出的 COMpass 具有非常灵活的结构,可以在推理期间任意确定每个层的数量和相应的尺度因子。为了降低相邻层之间的任意尺度因子的Spatial redundancy,我们采用一种基于隐含神经网络表示的任意尺度预测方法,称为 Liff。我们提出了一种综合的RD损失函数,以有效地训练多个层。实验结果表明,我们的 COMPASS 在最大情况下实现了 -58.33% 和 -47.17%的 BD-rate 增益,与 SHVC 和最先进的基于 NN 的空间可扩展图像压缩方法(对应各种尺度因子的组合)相比,该方法的编码效率更高。我们的 COMpass 还表现出与单层编码相比,在各种尺度因子下相同的或更好的编码效率。
https://arxiv.org/abs/2309.07926
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at this https URL.
Transformer-based methods在图像修复任务中表现出令人印象深刻的性能,例如图像超分辨率和去噪。然而,我们发现这些网络只能通过归因分析利用有限的输入信息空间。这暗示着现有的网络中,Transformer的潜力仍未完全利用。为了更好地恢复更多的输入像素,我们提出了一种新的混合注意力Transformer(HAT)方法。它结合了通道注意力和窗口基自注意力方案,从而利用了它们的互补优势。此外,为了更好地整合跨窗口信息,我们引入了一个重叠交叉注意力模块,以增强相邻窗口特征之间的相互作用。在训练阶段,我们还采用相同的任务预训练策略,进一步利用模型的潜力以进一步改进。广泛的实验已经证明了所提出的模块的有效性。我们进一步扩展了模型,以显示SR任务的性能可以 greatly 改善。此外,我们还将HAT扩展到更多的图像修复应用,包括现实世界的图像超分辨率、高斯图像去噪和图像压缩 artifacts减少。在基准数据和现实世界数据集上的实验表明,我们的HAT实现了先进的性能,既定量又定性。代码和模型在该https URL上publicly available。
https://arxiv.org/abs/2309.05239
With neural networks growing deeper and feature maps growing larger, limited communication bandwidth with external memory (or DRAM) and power constraints become a bottleneck in implementing network inference on mobile and edge devices. In this paper, we propose an end-to-end differentiable bandwidth efficient neural inference method with the activation compressed by neural data compression method. Specifically, we propose a transform-quantization-entropy coding pipeline for activation compression with symmetric exponential Golomb coding and a data-dependent Gaussian entropy model for arithmetic coding. Optimized with existing model quantization methods, low-level task of image compression can achieve up to 19x bandwidth reduction with 6.21x energy saving.
随着神经网络越来越深和特征映射越来越大,与外部存储器(或DRAM)和功率限制的限制通信带宽成为实现在移动设备和边缘设备上网络推理的瓶颈。在本文中,我们提出了一种 end-to-end 不同凡响的带宽高效的神经网络推理方法,通过神经网络数据压缩方法进行激活压缩。具体来说,我们提出了一种变换-量化-熵编码 pipeline 以进行激活压缩,采用对称指数级高斯编码,并提出了一种数据相关的高斯熵模型用于算术编码。通过现有模型量化方法的优化,图像压缩的低级任务可以实现 19x 的带宽减少和 6.21x 的能源节省。
https://arxiv.org/abs/2309.02855
We introduce EGIC, a novel generative image compression method that allows traversing the distortion-perception curve efficiently from a single model. Specifically, we propose an implicitly encoded variant of image interpolation that predicts the residual between a MSE-optimized and GAN-optimized decoder output. On the receiver side, the user can then control the impact of the residual on the GAN-based reconstruction. Together with improved GAN-based building blocks, EGIC outperforms a wide-variety of perception-oriented and distortion-oriented baselines, including HiFiC, MRIC and DIRAC, while performing almost on par with VTM-20.0 on the distortion end. EGIC is simple to implement, very lightweight (e.g. 0.18x model parameters compared to HiFiC) and provides excellent interpolation characteristics, which makes it a promising candidate for practical applications targeting the low bit range.
我们引入了 EGIC 一种全新的生成图像压缩方法,该方法能够从单个模型高效穿越失真感知曲线。具体来说,我们提出了一种隐含编码的图像插值变异体,可以预测 MSE 优化和 GAN 优化解码输出之间的残差。在接收端,用户可以控制残差对 GAN 基线重构的影响。与改进的 GAN 基线单元一起,EGIC 击败了各种感知和失真基准,包括 HiFiC、MRIC 和 DIRAC,而在失真端几乎与 VTM-20.0 相当。EGIC 易于实现,非常轻量级(例如与 HiFiC 相比模型参数量 0.18 倍)并提供出色的插值特性,使其成为针对低比特范围实际应用程序的有前途的选择。
https://arxiv.org/abs/2309.03244
The vulnerabilities to backdoor attacks have recently threatened the trustworthiness of machine learning models in practical applications. Conventional wisdom suggests that not everyone can be an attacker since the process of designing the trigger generation algorithm often involves significant effort and extensive experimentation to ensure the attack's stealthiness and effectiveness. Alternatively, this paper shows that there exists a more severe backdoor threat: anyone can exploit an easily-accessible algorithm for silent backdoor attacks. Specifically, this attacker can employ the widely-used lossy image compression from a plethora of compression tools to effortlessly inject a trigger pattern into an image without leaving any noticeable trace; i.e., the generated triggers are natural artifacts. One does not require extensive knowledge to click on the "convert" or "save as" button while using tools for lossy image compression. Via this attack, the adversary does not need to design a trigger generator as seen in prior works and only requires poisoning the data. Empirically, the proposed attack consistently achieves 100% attack success rate in several benchmark datasets such as MNIST, CIFAR-10, GTSRB and CelebA. More significantly, the proposed attack can still achieve almost 100% attack success rate with very small (approximately 10%) poisoning rates in the clean label setting. The generated trigger of the proposed attack using one lossy compression algorithm is also transferable across other related compression algorithms, exacerbating the severity of this backdoor threat. This work takes another crucial step toward understanding the extensive risks of backdoor attacks in practice, urging practitioners to investigate similar attacks and relevant backdoor mitigation methods.
后门攻击的漏洞最近威胁了实际应用中的机器学习模型的可靠性。传统智慧认为,不是每个人都可以成为攻击者,因为设计触发生成算法的过程通常需要大量努力和实验以确保攻击的隐蔽性和有效性。相反,本文表明存在更严重的后门威胁:任何人都可以利用简单易用的算法进行无声后门攻击。具体来说,这个攻击者可以使用大量的压缩工具普遍使用的 lossy 图像压缩方法,轻松地将触发模式注入图像中,而不会留下任何明显的痕迹,即生成的触发点是自然产生的 artifacts。在没有 extensive 知识的情况下,只需点击“转换”或“保存为”按钮使用 lossy 图像压缩工具,攻击者无需设计触发生成器,只需要毒化数据。经验证,提出的攻击方法在多个基准数据集上 consistently 实现了 100% 的攻击成功率,如在 MNIST、CIFAR-10、GTSRB 和CelebA 等数据集上。更重要的是,在干净标签设置下,提出的攻击方法仍然可以几乎实现 100% 的攻击成功率,而且毒化率非常小(大约 10%)。提出的攻击方法使用的产生的触发器也可以跨相关压缩算法转移,加剧了这个后门威胁的严重性。本文迈出了理解后门攻击在实践中广泛风险的关键步骤,敦促从业者研究类似的攻击和方法相关的后门缓解方法。
https://arxiv.org/abs/2308.16684
Neural networks have dramatically increased our capacity to learn from large, high-dimensional datasets across innumerable disciplines. However, their decisions are not easily interpretable, their computational costs are high, and building and training them are uncertain processes. To add structure to these efforts, we derive new mathematical results to efficiently measure the changes in entropy as fully-connected and convolutional neural networks process data, and introduce entropy-based loss terms. Experiments in image compression and image classification on benchmark datasets demonstrate these losses guide neural networks to learn rich latent data representations in fewer dimensions, converge in fewer training epochs, and achieve better test metrics.
神经网络已经显著增加了我们对大量学科中大型高维数据的学习能力。然而,它们的决策并不容易解释,它们的计算成本很高,建设和训练它们是一个不确定的过程。为了增加这些工作的结构,我们推导了新的数学结果,以高效地衡量熵的变化,因为在全连接和卷积神经网络处理数据时,熵的变化可以被有效地测量,并引入了基于熵的损失函数。在基准数据集上的图像压缩和图像分类实验表明,这些损失函数引导神经网络在更少的维度中学习丰富的潜在数据表示,在更少的训练迭代中趋于一致,并实现更好的测试指标。
https://arxiv.org/abs/2308.14938
Compression technology is essential for efficient image transmission and storage. With the rapid advances in deep learning, images are beginning to be used for image recognition as well as for human vision. For this reason, research has been conducted on image coding for image recognition, and this field is called Image Coding for Machines (ICM). There are two main approaches in ICM: the ROI-based approach and the task-loss-based approach. The former approach has the problem of requiring an ROI-map as input in addition to the input image. The latter approach has the problems of difficulty in learning the task-loss, and lack of robustness because the specific image recognition model is used to compute the loss function. To solve these problems, we propose an image compression model that learns object regions. Our model does not require additional information as input, such as an ROI-map, and does not use task-loss. Therefore, it is possible to compress images for various image recognition models. In the experiments, we demonstrate the versatility of the proposed method by using three different image recognition models and three different datasets. In addition, we verify the effectiveness of our model by comparing it with previous methods.
压缩技术对于高效图像传输和存储至关重要。随着深度学习的迅速发展,图像开始被用于图像识别和人类视觉。因此,研究图像识别时的图像处理技术被称为图像编码对机器(ICM)。ICM有两个主要方法:基于 ROI 的方法和任务损失的方法。前者需要 ROI 地图作为输入,而后者由于使用特定的图像识别模型来计算损失函数而具有学习任务损失的困难性和缺乏鲁棒性。为了解决这些问题,我们提出了一种学习物体区域的图像压缩模型。我们的模型不需要额外的信息,如 ROI 地图,并且不使用任务损失。因此,可以为各种图像识别模型压缩图像。在实验中,我们使用三种不同的图像识别模型和三种不同的数据集展示了该方法的灵活性。此外,我们还通过与之前的方法进行比较来验证我们模型的有效性。
https://arxiv.org/abs/2308.13984
Learned image compression methods have shown superior rate-distortion performance and remarkable potential compared to traditional compression methods. Most existing learned approaches use stacked convolution or window-based self-attention for transform coding, which aggregate spatial information in a fixed range. In this paper, we focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. With the adaptive aggregation strategy and the sharing weights mechanism, our method can achieve promising transform capability with acceptable model complexity. Besides, according to the recent progress of entropy model, we define a generalized coarse-to-fine entropy model, considering the coarse global context, the channel-wise, and the spatial context. Based on it, we introduce dynamic kernel in hyper-prior to generate more expressive global context. Furthermore, we propose an asymmetric spatial-channel entropy model according to the investigation of the spatial characteristics of the grouped latents. The asymmetric entropy model aims to reduce statistical redundancy while maintaining coding efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
学习的图像压缩方法相对于传统压缩方法表现出更好的 Rate-distortion 性能和显著的潜力。大多数现有的学习方法都使用堆叠卷积或基于窗口的自我注意力来进行变换编码,将空间信息在固定范围内聚合。在本文中,我们关注扩展空间聚合能力并提出了基于动态内核的变换编码。 proposed 的自适应聚合策略生成内核偏置来捕捉在内容 Condition 范围内有效的信息,以帮助变换。通过自适应聚合策略和共享权重机制,我们的方法可以实现令人满意的变换能力,并具有可以接受模型复杂性。此外,根据熵模型最近的进展,我们定义了一个广义的粗到细熵模型,考虑粗 global 上下文、通道和空间上下文。基于它,我们引入了超先验动态内核,以产生更具表现力的全球上下文。此外,根据对分组隐状态的空间特征研究的探讨,我们提出了一种不对称的空间-通道熵模型。该不对称熵模型旨在减少统计冗余,同时保持编码效率。实验结果显示,与最先进的基于学习的方法和传统压缩方法相比,我们的方法在三个基准测试上表现出更好的 Rate-distortion 性能。
https://arxiv.org/abs/2308.08723
Thriving underwater applications demand efficient extreme compression technology to realize the transmission of underwater images (UWIs) in very narrow underwater bandwidth. However, existing image compression methods achieve inferior performance on UWIs because they do not consider the characteristics of UWIs: (1) Multifarious underwater styles of color shift and distance-dependent clarity, caused by the unique underwater physical imaging; (2) Massive redundancy between different UWIs, caused by the fact that different UWIs contain several common ocean objects, which have plenty of similarities in structures and semantics. To remove redundancy among UWIs, we first construct an exhaustive underwater multi-scale feature dictionary to provide coarse-to-fine reference features for UWI compression. Subsequently, an extreme UWI compression network with reference to the feature dictionary (RFD-ECNet) is creatively proposed, which utilizes feature match and reference feature variant to significantly remove redundancy among UWIs. To align the multifarious underwater styles and improve the accuracy of feature match, an underwater style normalized block (USNB) is proposed, which utilizes underwater physical priors extracted from the underwater physical imaging model to normalize the underwater styles of dictionary features toward the input. Moreover, a reference feature variant module (RFVM) is designed to adaptively morph the reference features, improving the similarity between the reference and input features. Experimental results on four UWI datasets show that our RFD-ECNet is the first work that achieves a significant BD-rate saving of 31% over the most advanced VVC.
繁荣的水下应用需要高效的极端压缩技术来实现在非常狭窄的水下带宽下传输水下图像(UWI)。然而,现有的图像压缩方法在UWI上表现较差,因为它们未考虑UWI的特点:(1)由于独特的水下物理成像,存在多种颜色变换和距离依赖的清晰度,(2)不同UWI之间存在巨大的冗余,由于不同UWI包含多个常见的海洋物体,它们在结构和语义上有很多相似之处。为了消除UWI之间的冗余,我们首先构建了一个详细的水下多尺度特征字典,以提供精细到粗的参考特征,为UIW压缩提供参考。随后,我们创造性地提出了一种基于特征字典的极端UIW压缩网络(RFD-ECNet),该网络利用特征匹配和参考特征变体来显著减少UWI之间的冗余。为了对齐多种水下风格并提高特征匹配的准确性,我们提出了一种水下风格标准化块(USNB),该块利用从水下物理成像模型提取的水下物理先验来标准化水下字典特征向量,使其输入。此外,我们设计了一个参考特征变体模块(RFVM),以自适应地变形参考特征,提高参考和输入特征之间的相似性。在四个UIW数据集上的实验结果显示,我们的RFD-ECNet是第一种能够实现比最先进的VVC节省31%BD-rate的工作。
https://arxiv.org/abs/2308.08721
We propose conditional perceptual quality, an extension of the perceptual quality defined in \citet{blau2018perception}, by conditioning it on user defined information. Specifically, we extend the original perceptual quality $d(p_{X},p_{\hat{X}})$ to the conditional perceptual quality $d(p_{X|Y},p_{\hat{X}|Y})$, where $X$ is the original image, $\hat{X}$ is the reconstructed, $Y$ is side information defined by user and $d(.,.)$ is divergence. We show that conditional perceptual quality has similar theoretical properties as rate-distortion-perception trade-off \citep{blau2019rethinking}. Based on these theoretical results, we propose an optimal framework for conditional perceptual quality preserving compression. Experimental results show that our codec successfully maintains high perceptual quality and semantic quality at all bitrate. Besides, by providing a lowerbound of common randomness required, we settle the previous arguments on whether randomness should be incorporated into generator for (conditional) perceptual quality compression. The source code is provided in supplementary material.
我们提出Conditional Perceptual Quality,即根据 \citet{blau2018perception}定义的感知质量扩展,通过对其加以条件化。具体来说,我们将其扩展为原感知质量 $d(p_{X},p_{\hat{X}})$ 到条件感知质量 $d(p_{X|Y},p_{\hat{X}|Y})$,其中 $X$ 是原始图像,$\hat{X}$ 是重构的,$Y$ 是用户定义的附加信息,$d(.,.)$ 表示差异。我们表明,Conditional Perceptual Quality 具有与 \citep{blau2019re Thinking} 中速率、失真感知权衡类似的理论性质。基于这些理论结果,我们提出了一个最优框架,用于条件感知质量保留压缩。实验结果显示,我们的编码器在所有比特率上都成功地保持了高感知质量和语义质量。此外,通过提供所需的通用随机数的下限,我们解决了之前关于是否应该在生成器中引入随机数的问题。源代码在附录中提供。
https://arxiv.org/abs/2308.08154
The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing the rate-distortion drop observed in out-of-domain datasets. Specifically, we perform low-rank matrix decomposition to update certain adaptation parameters of the client's decoder. These updated parameters, along with image latents, are encoded into a bitstream and transmitted to the decoder in practical scenarios. Due to the low-rank constraint imposed on the adaptation parameters, the resulting bit rate overhead is small. Furthermore, the bit rate allocation of low-rank adaptation is \emph{non-trivial}, considering the diverse inputs require varying adaptation bitstreams. We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. The dynamic adaptation network is optimized end-to-end using rate-distortion loss. Our proposed method exhibits universality across diverse image datasets. Extensive results demonstrate that this paradigm significantly mitigates the domain gap, surpassing non-adaptive methods with an average BD-rate improvement of approximately $19\%$ across out-of-domain images. Furthermore, it outperforms the most advanced instance adaptive methods by roughly $5\%$ BD-rate. Ablation studies confirm our method's ability to universally enhance various image compression architectures.
神经网络图像压缩的最新进展表明,可以超越传统标准编码器的性能,表现出巨大的潜力。然而,用于训练(即自然图像)和用于推理(例如艺术图像)的数据集之间存在明显的领域差异。我们的提议涉及一种低秩适应方法,旨在解决跨领域数据集上出现的率失真降低问题。具体而言,我们进行低秩矩阵分解更新客户编码器的某些适应参数。这些更新参数和图像隐式值被编码为比特流,并在实际应用场景中传输到编码器。由于对适应参数施加的低秩限制, resulting的比特率 overhead 很小。此外,低秩适应的比特率分配是 \emph{重要的},考虑到不同的输入需要不同的适应比特流。因此我们在低秩适应方法的顶部引入了动态门控网络,以决定哪些编码层需要适应。动态适应网络使用率失真损失进行端到端优化。我们的提议在不同图像数据集上表现出通用性。广泛的结果显示,这个范式 significantly 减少了领域差异,超越了非自适应方法,在跨领域图像上平均BD-rate改善约19%。此外,它比最先进的实例自适应方法高出约5%的BD-rate。断点研究确认了我们方法的能力,普遍增强各种图像压缩架构。
https://arxiv.org/abs/2308.07733
The design of a neural image compression network is governed by how well the entropy model matches the true distribution of the latent code. Apart from the model capacity, this ability is indirectly under the effect of how close the relaxed quantization is to the actual hard quantization. Optimizing the parameters of a rate-distortion variational autoencoder (R-D VAE) is ruled by this approximated quantization scheme. In this paper, we propose a feature-level frequency disentanglement to help the relaxed scalar quantization achieve lower bit rates by guiding the high entropy latent features to include most of the low-frequency texture of the image. In addition, to strengthen the de-correlating power of the transformer-based analysis/synthesis transform, an augmented self-attention score calculation based on the Hadamard product is utilized during both encoding and decoding. Channel-wise autoregressive entropy modeling takes advantage of the proposed frequency separation as it inherently directs high-informational low-frequency channels to the first chunks and conditions the future chunks on it. The proposed network not only outperforms hand-engineered codecs, but also neural network-based codecs built on computation-heavy spatially autoregressive entropy models.
神经网络图像压缩网络的设计受到熵模型与潜在代码真实分布的匹配程度的影响。除了模型容量,这种能力还间接受到放松量化是否接近实际硬量化的接近量化方案的影响。优化Rate-distortionVariational Autoencoder (R-D VAE)的参数是基于这个近似量化方案的指导。在本文中,我们提出了特征级别的频率分离,以帮助放松量纲量化实现更低的比特率,通过指导高熵潜在特征包括图像中大部分低频纹理,实现这一点。此外,为了加强基于Transformer的分析/合成转换的反相抑制能力,在编码和解码过程中使用基于哈夫曼乘积的增强self-attention score计算。通道级别的自回归熵建模利用了提出的频率分离,因为它本质上将高信息性的低频率通道指向第一个块,并在此块上条件未来块。提议的网络不仅优于手动构建的codec,而且基于计算量较大的空间自回归熵模型构建的神经网络codec。
https://arxiv.org/abs/2308.02620
Recently, multi-reference entropy model has been proposed, which captures channel-wise, local spatial, and global spatial correlations. Previous works adopt attention for global correlation capturing, however, the quadratic cpmplexity limits the potential of high-resolution image coding. In this paper, we propose the linear complexity global correlations capturing, via the decomposition of softmax operation. Based on it, we propose the MLIC$^{++}$, a learned image compression with linear complexity for multi-reference entropy modeling. Our MLIC$^{++}$ is more efficient and it reduces BD-rate by 12.44% on the Kodak dataset compared to VTM-17.0 when measured in PSNR. Code will be available at this https URL.
最近,提出了一种多参考熵模型,该模型可以捕捉通道wise、局部空间以及全局空间相关性。以前的研究使用注意力来捕捉全局相关性,但是quadratic的CPPMplexity限制了高分辨率图像编码的潜力。在本文中,我们提出了一种线性复杂性的全局相关性捕捉方法,通过softmax操作分解。基于这种方法,我们提出了MLIC$^{++}$,这是一种基于线性复杂性学习的高分辨率图像压缩,用于多参考熵建模。我们的MLIC$^{++}$更加高效,在Kodak数据集上比VTM-17.0在PSNR测量上降低了12.44%。代码将在this https URL上提供。
https://arxiv.org/abs/2307.15421
Accurate navigation is of paramount importance to ensure flight safety and efficiency for autonomous drones. Recent research starts to use Deep Neural Networks to enhance drone navigation given their remarkable predictive capability for visual perception. However, existing solutions either run DNN inference tasks on drones in situ, impeded by the limited onboard resource, or offload the computation to external servers which may incur large network latency. Few works consider jointly optimizing the offloading decisions along with image transmission configurations and adapting them on the fly. In this paper, we propose A3D, an edge server assisted drone navigation framework that can dynamically adjust task execution location, input resolution, and image compression ratio in order to achieve low inference latency, high prediction accuracy, and long flight distances. Specifically, we first augment state-of-the-art convolutional neural networks for drone navigation and define a novel metric called Quality of Navigation as our optimization objective which can effectively capture the above goals. We then design a deep reinforcement learning based neural scheduler at the drone side for which an information encoder is devised to reshape the state features and thus improve its learning ability. To further support simultaneous multi-drone serving, we extend the edge server design by developing a network-aware resource allocation algorithm, which allows provisioning containerized resources aligned with drones' demand. We finally implement a proof-of-concept prototype with realistic devices and validate its performance in a real-world campus scene, as well as a simulation environment for thorough evaluation upon AirSim. Extensive experimental results show that A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.
准确的导航对于确保无人机飞行安全和效率至关重要。最近的研究表明,利用深度学习网络提高无人机的导航能力非常重要,因为无人机的视觉感知预测能力非常出色。然而,现有的解决方案要么在无人机内部运行深度学习推断任务,受到船上资源限制的制约,要么将计算任务分配给外部服务器,这可能导致大型网络延迟。只有少数工作考虑同时优化图像传输配置和卸载决策,并在运行时进行适应。在本文中,我们提出了A3D,一个边缘服务器协助无人机导航框架,可以动态调整任务执行位置、输入分辨率和图像压缩比例,以实现低推断延迟、高预测准确性和较长的飞行距离。具体而言,我们首先增加了无人机导航的最先进的卷积神经网络,并定义了一个名为质量导航的新度量作为我们的优化目标,能够有效地捕捉上述目标。然后,我们在无人机侧设计了基于深度强化学习的神经网络调度器,其中信息编码器设计用于重塑状态特征,从而提高其学习能力。为了进一步支持同时服务多个无人机,我们扩展了边缘服务器设计,开发了一个网络 aware的资源分配算法,允许部署与无人机需求对齐的集装箱资源。最后,我们实现了一个实际设备验证的原型,并在现实校园场景中进行了验证,以及在AirSim中进行彻底的评估。广泛实验结果表明,A3D相比非自适应解决方案可以有效减少end-to-end延迟并扩展飞行距离,高达27.28%。
https://arxiv.org/abs/2307.09880
In this paper, we present ECSIC, a novel learned method for stereo image compression. Our proposed method compresses the left and right images in a joint manner by exploiting the mutual information between the images of the stereo image pair using a novel stereo cross attention (SCA) module and two stereo context modules. The SCA module performs cross-attention restricted to the corresponding epipolar lines of the two images and processes them in parallel. The stereo context modules improve the entropy estimation of the second encoded image by using the first image as a context. We conduct an extensive ablation study demonstrating the effectiveness of the proposed modules and a comprehensive quantitative and qualitative comparison with existing methods. ECSIC achieves state-of-the-art performance among stereo image compression models on the two popular stereo image datasets Cityscapes and InStereo2k while allowing for fast encoding and decoding, making it highly practical for real-time applications.
在本文中,我们介绍了ECSIC,一种用于双视角图像压缩的新颖学习算法。我们提出的算法通过利用双视角图像对 pair 中的两个图像的共通信息,使用一种新颖的双视角交叉注意力(SCA)模块和两个双视角上下文模块进行压缩。SCA 模块仅关注两个图像对应极线,并并行处理它们。双视角上下文模块使用第一个图像作为上下文,以提高第二个编码图像的熵估计,并进行了广泛的 ablation 研究,以证明所提出模块的有效性,以及与现有方法的全面 quantitative 和 qualitative 比较。ECSIC 在两个流行的双视角图像数据集 Cityscapes 和 In Stereo2k 上实现了最先进的性能,同时允许快速编码和解码,使其非常适用于实时应用。
https://arxiv.org/abs/2307.10284