In Learned Image Compression (LIC), a model is trained at encoding and decoding images sampled from a source domain, often outperforming traditional codecs on natural images; yet its performance may be far from optimal on images sampled from different domains. In this work, we tackle the problem of adapting a pre-trained model to multiple target domains by plugging into the decoder an adapter module for each of them, including the source one. Each adapter improves the decoder performance on a specific domain, without the model forgetting about the images seen at training time. A gate network computes the weights to optimally blend the contributions from the adapters when the bitstream is decoded. We experimentally validate our method over two state-of-the-art pre-trained models, observing improved rate-distortion efficiency on the target domains without penalties on the source domain. Furthermore, the gate's ability to find similarities with the learned target domains enables better encoding efficiency also for images outside them.
在学习图像压缩(LIC)中,模型在从源域中训练和编码和解码图像时进行训练,通常在自然图像上优于传统的编码器;然而,在从不同域中抽样时,其性能可能离最优水平还有很大差距。在本文中,我们通过在解码器中插入适配器模块来将预训练的模型适应多个目标域。每个适配器都在特定的域上提高了解码器的性能,而不会让模型忘记在训练时看到的图像。门网络计算在解码时最优地融合适配器的贡献。我们在两个最先进的预训练模型上进行实验验证,发现在没有对源域罚款的情况下,目标域的压缩率-失真效率得到了提高。此外,门的发现与学习到的目标域相似的能力还使得对于这些域之外的照片,编码器的效率也得到了提高。
https://arxiv.org/abs/2404.15591
This paper investigates the challenging problem of learned image compression (LIC) with extreme low bitrates. Previous LIC methods based on transmitting quantized continuous features often yield blurry and noisy reconstruction due to the severe quantization loss. While previous LIC methods based on learned codebooks that discretize visual space usually give poor-fidelity reconstruction due to the insufficient representation power of limited codewords in capturing faithful details. We propose a novel dual-stream framework, HyrbidFlow, which combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme low bitrates. The codebook-based stream benefits from the high-quality learned codebook priors to provide high quality and clarity in reconstructed images. The continuous feature stream targets at maintaining fidelity details. To achieve the ultra low bitrate, a masked token-based transformer is further proposed, where we only transmit a masked portion of codeword indices and recover the missing indices through token generation guided by information from the continuous feature stream. We also develop a bridging correction network to merge the two streams in pixel decoding for final image reconstruction, where the continuous stream features rectify biases of the codebook-based pixel decoder to impose reconstructed fidelity details. Experimental results demonstrate superior performance across several datasets under extremely low bitrates, compared with existing single-stream codebook-based or continuous-feature-based LIC methods.
本文研究了在低比特率下进行学习图像压缩(LIC)的具有挑战性的问题。以前基于传输量化连续特征的LIC方法通常由于量化损失严重而导致模糊和噪声的重建。而以前基于学习码本的LIC方法在捕捉准确细节方面具有不足的表示能力,因此通常会导致低质量的重建。我们提出了一个新型的双流框架HyrbidFlow,它结合了基于连续特征和基于码本的流,以实现低比特率下的高感知质量和高保真度。基于码本的流利用高质量的学习码本先验来提供高质量和清晰度在重构图像中。连续特征流的目标是保持保真度细节。为了实现超低比特率,我们进一步提出了一个掩码标记的Transformer,其中我们仅传输码字索引的掩码部分,并通过连续特征流的标记来恢复缺失的索引。我们还开发了一个平滑修复网络,用于在像素解码中合并这两个流,以便进行最终图像重构。基于连续流特征的码字解码器的偏置被平滑修复网络中的连续流纠正。实验结果表明,在极低比特率下,与现有的单流码本或连续特征 based LIC 方法相比,具有卓越的性能。
https://arxiv.org/abs/2404.13372
To reduce network traffic and support environments with limited resources, a method for transmitting images with low amounts of transmission data is required. Machine learning-based image compression methods, which compress the data size of images while maintaining their features, have been proposed. However, in certain situations, reconstructing a part of semantic information of images at the receiver end may be sufficient. To realize this concept, semantic-information-based communication, called semantic communication, has been proposed, along with an image transmission method using semantic communication. This method transmits only the semantic information of an image, and the receiver reconstructs the image using an image-generation model. This method utilizes one type of semantic information, but reconstructing images similar to the original image using only it is challenging. This study proposes a multi-modal image transmission method that leverages diverse semantic information for efficient semantic communication. The proposed method extracts multi-modal semantic information from an image and transmits only it. Subsequently, the receiver generates multiple images using an image-generation model and selects an output based on semantic similarity. The receiver must select the output based only on the received features; however, evaluating semantic similarity using conventional metrics is challenging. Therefore, this study explored new metrics to evaluate the similarity between semantic features of images and proposes two scoring procedures. The results indicate that the proposed procedures can compare semantic similarities, such as position and composition, between semantic features of the original and generated images. Thus, the proposed method can facilitate the transmission and utilization of photographs through mobile networks for various service applications.
为了减少网络流量并支持资源有限的生态环境,需要一种能够传输图像且传输数据量较少的图像传输方法。基于机器学习的图像压缩方法,在压缩图像数据大小的同时保持其特征,已经被提出。然而,在某些情况下,在接收端重构图像的部分语义信息可能已经足够。为了实现这一概念,提出了基于语义信息的有向通信(称为语义通信)以及使用语义通信的图像传输方法。这种方法仅传输图像的语义信息,接收端使用图像生成模型重构图像。这种方法利用了一种类型的语义信息,但仅基于它重构类似于原始图像的图像具有挑战性。本研究提出了一个多模态图像传输方法,利用多样语义信息进行高效的语义通信。所提出的方法从图像中提取多模态语义信息并仅传输它。随后,接收端使用图像生成模型生成多个图像,并根据语义相似度选择一个输出。接收端只能基于接收到的特征选择输出;然而,使用传统指标评估语义相似度具有挑战性。因此,本研究探索了新的指标来评估图像语义特征之间的相似性,并提出了两个评分程序。结果显示,所提出的程序可以比较原始和生成图像的语义特征之间的相似性,如位置和构图。因此,所提出的方法可以为移动网络提供照片传输和各种服务应用程序的便利。
https://arxiv.org/abs/2404.11280
The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
数字内容的爆发式增长对高效存储和检索方法提出了需求。传统的解决方案很难应对多媒体数据的日益复杂和规模。在本文中,我们提出的框架通过将人工智能原生多模态搜索功能与神经图像压缩相结合来应对这一挑战。首先,我们分析了压缩性和搜索性之间的复杂关系,认识到每个在存储和检索系统的效率中都扮演着关键角色。通过使用简单的适配器来桥接Learned Image Compression(LIC)和Contrastive Language-Image Pre-training(CLIP)的特征,同时保留语义保真度和多模态数据的检索,我们提出了一种方法。在柯达数据集的实验评估中,我们展示了我们方法的有效性,显示了与现有方法相比,压缩效率和搜索精度都有显著提高。我们的工作在大型数据时代的可扩展和高效多模态搜索系统方面迈出了重要的一步。
https://arxiv.org/abs/2404.10234
Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10\% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.
将扩散模型融入图像压缩领域,有望产生真实和详细的重构,尤其是在极低比特率的情况下。以前的方法侧重于使用扩散模型作为具有条件信号量化误差稳健的表达编码器,然而以这种方式实现竞争力的结果需要对扩散模型进行昂贵的训练,并由于递归生成过程,导致推理时间较长。在这项工作中,我们将量化误差消除视为去噪任务,利用扩散来恢复在传输图像潜在中丢失的信息。我们的方法允许我们执行不到10%的完整扩散生成过程,并且不需要对扩散模型进行架构更改,使得基础模型可以作为强大的先验,无需额外对骨干模型进行微调。我们提出的编码在量化现实指标上优于以前的方法,而且我们验证,即使其他方法使用两倍的比特率,我们的重构仍然具有用户满意的质量。
https://arxiv.org/abs/2404.08580
Artificial intelligence (AI) and autonomous edge computing in space are emerging areas of interest to augment capabilities of nanosatellites, where modern sensors generate orders of magnitude more data than can typically be transmitted to mission control. Here, we present the hardware and software design of an onboard AI subsystem hosted on SpIRIT. The system is optimised for on-board computer vision experiments based on visible light and long wave infrared cameras. This paper highlights the key design choices made to maximise the robustness of the system in harsh space conditions, and their motivation relative to key mission requirements, such as limited compute resources, resilience to cosmic radiation, extreme temperature variations, distribution shifts, and very low transmission bandwidths. The payload, called Loris, consists of six visible light cameras, three infrared cameras, a camera control board and a Graphics Processing Unit (GPU) system-on-module. Loris enables the execution of AI models with on-orbit fine-tuning as well as a next-generation image compression algorithm, including progressive coding. This innovative approach not only enhances the data processing capabilities of nanosatellites but also lays the groundwork for broader applications to remote sensing from space.
人工智能(AI)和自主边缘计算在空间是一个正在兴起的兴趣领域,可以增强纳米卫星的性能,其中现代传感器产生的数据比通常发送到地面站的数据要大得多。在这里,我们介绍了在SpIRIT上托管的载有人工智能子系统的硬件和软件设计。该系统针对可见光和长波红外相机进行优化,以进行在轨计算机视觉实验。本文重点介绍了系统在恶劣空间条件下的关键设计选择以及这些选择与关键任务需求(如有限计算资源、宇宙辐射耐受性、极端温度变化、分布偏移和非常低传输带宽)之间的联系。载荷称为Loris,包括六个可见光相机、三个红外相机、一个相机控制板和GPU系统级模块。Loris不仅提高了纳米卫星的数据处理能力,还为从空间遥感的更广泛应用奠定了基础。
https://arxiv.org/abs/2404.08399
Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image classification systems should adapt to and manage data that continuously evolves. This is where continual learning plays an important role. A challenge in continual learning is catastrophic forgetting, where ML models tend to discard old knowledge upon learning new information. While memory-replay algorithms have shown promise in mitigating this problem by storing old data as exemplars, they are hampered by the limited capacity of memory buffers, leading to an imbalance between new and previously learned data. To address this, our work explores the use of neural image compression to extend buffer size and enhance data diversity. We introduced the concept of continuously learning a neural compression model to adaptively improve the quality of compressed data and optimize the bitrates per pixel (bpp) to store more exemplars. Our extensive experiments, including evaluations on food-specific datasets including Food-101 and VFN-74, as well as the general dataset ImageNet-100, demonstrate improvements in classification accuracy. This progress is pivotal in advancing more realistic food recognition systems that are capable of adapting to continually evolving data. Moreover, the principles and methodologies we've developed hold promise for broader applications, extending their benefits to other domains of continual machine learning systems.
食品图像分类系统在通过图像为基础的饮食评估技术对健康状况进行监测和饮食跟踪中发挥着关键作用。然而,现有的食品识别系统依赖于静态数据集,其特征是预定义的固定数量食品类别。这与现实中的食品消费情况存在很大差异,因为食品消费数据具有不断变化的特点。因此,食品图像分类系统应该适应并管理持续变化的数据。这正是持续学习发挥作用的地方。 持续学习的挑战之一是灾难性遗忘,即机器学习模型在学习新信息时倾向于丢弃旧知识。虽然记忆回放算法通过将旧数据存储为示例来减轻这个问题,但由于内存缓冲区的有限容量,导致新学习和旧学习数据之间的不平衡。为了解决这个问题,我们的工作探讨了使用神经图像压缩来扩展缓冲区大小和增强数据多样性的方法。我们引入了连续学习神经压缩模型的概念,以适应性地提高压缩数据的质量并优化每像素(bpp)以存储更多示例。 我们在包括Food-101和VFN-74等食品特定数据集以及ImageNet-100等通用数据集的广泛实验中进行了评估,证明了分类准确度的提高。这一进步对于推动更现实、能够适应不断变化数据的食品识别系统至关重要。此外,我们开发的原则和方法论对于更广泛的应用场景也具有潜在意义,将这些 benefits 扩展到其他连续机器学习系统中。
https://arxiv.org/abs/2404.07507
This study addresses the challenge of, without training or fine-tuning, controlling the global color aspect of images generated with a diffusion model. We rewrite the guidance equations to ensure that the outputs are closer to a known color map, and this without hindering the quality of the generation. Our method leads to new guidance equations. We show in the color guidance context that, the scaling of the guidance should not decrease but remains high throughout the diffusion process. In a second contribution, our guidance is applied in a compression framework, we combine both semantic and general color information on the image to decode the images at low cost. We show that our method is effective at improving fidelity and realism of compressed images at extremely low bit rates, when compared to other classical or more semantic oriented approaches.
本研究旨在解决在没有训练或微调的情况下,控制使用扩散模型生成的图像的全球色彩映射的挑战。我们将指导方程重新写为确保输出更接近已知色彩映射,同时不损害生成质量。我们的方法产生了新的指导方程。在色彩指导背景下,我们证明了,指导的缩放不应降低,而应保持高。 在第二贡献中,我们的指导应用于压缩框架中,将图像的语义和一般色彩信息结合在一起,以低代价解码图像。我们证明了,与其它经典方法或更语义导向方法相比,我们的方法在极低比特率下可以有效提高压缩图像的保真度和现实感。
https://arxiv.org/abs/2404.06865
Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at this https URL .
图像和谐,涉及调整合成图像的前景以实现与背景的统一视觉一致性,可以概念化为一个图像到图像的映射任务。最近,扩散模型推动了许多图像到图像映射任务的快速发展。然而,从零开始训练扩散模型计算量很大。对预训练的 latent 扩散模型的微调涉及处理图像压缩自动加密器引起的图像重建误差,这使得它不适合用于需要像素级评估指标的图像生成任务。为解决这些问题,本文首先将预训练的 latent 扩散模型适应到图像和谐任务中,生成和谐但可能有点模糊的初始图像。然后我们实现了两种策略:在推理过程中使用高分辨率图像,并包含一个额外的细化阶段,以进一步增强最初和谐图像的清晰度。在 iHarmony4 数据集上进行的大量实验证明了我们提出的方法的优越性。代码和模型将公开发布在本文的链接 URL 上。
https://arxiv.org/abs/2404.06139
The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.
扩散模型的输出图像可以达到出色的感知质量。然而,扩散模型很难保证失真,因此将扩散模型与图像压缩模型集成还需要更全面的探索。本文提出了一种基于扩散的图像压缩方法,采用有偏的端到端解码器模型作为校正,在保证失真程度的同时实现更好的感知质量。我们构建了一个扩散模型,并设计了一个新范式,将扩散模型和端到端解码器相结合,其中后者的任务是在编码器侧提取的有偏信息进行传输。具体来说,我们理论分析了扩散模型在原始图像可见的情况下进行编码的重建过程。根据分析,我们引入了一个端到端的卷积解码器,在编码器侧提供对得分函数 $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ 的更好近似,并有效传输组合。实验证明,与之前的所有感知压缩方法相比,我们的方法在失真和感知方面都具有优越性。
https://arxiv.org/abs/2404.04916
In recent years, large-scale adoption of cloud storage solutions has revolutionized the way we think about digital data storage. However, the exponential increase in data volume, especially images, has raised environmental concerns regarding power and resource consumption, as well as the rising digital carbon footprint emissions. The aim of this research is to propose a methodology for cloud-based image storage by integrating image compression technology with SuperResolution Generative Adversarial Networks (SRGAN). Rather than storing images in their original format directly on the cloud, our approach involves initially reducing the image size through compression and downsizing techniques before storage. Upon request, these compressed images will be retrieved and processed by SRGAN to generate images. The efficacy of the proposed method is evaluated in terms of PSNR and SSIM metrics. Additionally, a mathematical analysis is given to calculate power consumption and carbon footprint assesment. The proposed data compression technique provides a significant solution to achieve a reasonable trade off between environmental sustainability and industrial efficiency.
近年来,大规模采用云存储解决方案彻底改变了我们对待数字数据存储的想法。然而,数据量的指数增长,特别是图像,引起了关于能源和资源消耗以及数字碳排放足迹的环保担忧。本研究旨在提出一种将图像压缩技术集成到超分辨率生成对抗网络(SRGAN)中的云图像存储方法。我们不直接将图像存储在云中,而是首先通过压缩和压缩裁剪等方法减小图像尺寸。在请求时,这些压缩图像将由SRGAN检索和处理以生成图像。所提出方法的有效性在PSNR和SSIM指标上进行评估。此外,还给出了计算能耗和碳足迹评估的数学分析。所提出的数据压缩技术为实现工业效率与环保之间的良好平衡提供了一个显著的解决方案。
https://arxiv.org/abs/2404.04642
This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at this https URL.
本文提出了一种针对英特尔数据中心GPU Max 1550的多层感知器(MLPs)实现,该实现针对数据集和并优化了英特尔数据中心GPU Max 1550。为了提高性能,我们的实现通过在通用寄存器和共享本地内存中最大化数据利用率来最小化全局内存访问的延迟。我们使用一个简单的屋顶线模型来证明,这导致算术强度的大幅增加,从而提高了性能,特别是推理。我们还将我们的实现与类似CUDA的MLP实现进行了比较,并证明了在推理和训练方面的性能均优于CUDA实现。此外,本文还展示了我们在三个重要领域:图像压缩、神经辐射场和物理建模机器学习方面的SYCL实现的效率。在所有情况下,我们的实现都胜过同一Intel GPU上的普通英特尔扩展PyTorch(IPEX)实现,其性能提高了30倍以上,而CUDA PyTorch版本在NVIDIA的H100 GPU上的性能提高了19倍。代码可以在该https URL上找到。
https://arxiv.org/abs/2403.17607
While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression, their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation, we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents, thereby facilitating the generation of high-quality images. Moreover, our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space, while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer, which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding, the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs, and the proposed entropy model contributes to notable bitrate savings.
在用条件扩散模型替换高斯解码器以提高神经图像压缩重建的感知质量的同时,它们的缺乏归纳偏见对于图像数据会限制其实现最优感知水平的能力。为了克服这一局限,我们在解码器端采用非均匀扩散模型。这个模型旨在通过区分频率内容来建立归纳偏见,从而促进高质量图像的生成。此外,我们的框架配备了一种新颖的熵模型,该模型通过利用潜在空间中的空间-通道关联精确建模了隐含表示的概率分布,同时加速熵解码步骤。这个通道层面的熵模型利用了每个通道块内的局部和全局空间上下文。全局空间上下文基于Transformer,这是专门为图像压缩任务而设计的。经过设计的Transformer采用了一个Laplacian形状的定位编码,其中可学习参数会根据每个通道簇进行自适应调整。我们的实验结果表明,与最先进的基于生成算法的压缩编码相比,我们所提出的框架具有更好的感知质量,并提出了一种有益的压缩比节省。
https://arxiv.org/abs/2403.16258
Image compression and denoising represent fundamental challenges in image processing with many real-world applications. To address practical demands, current solutions can be categorized into two main strategies: 1) sequential method; and 2) joint method. However, sequential methods have the disadvantage of error accumulation as there is information loss between multiple individual models. Recently, the academic community began to make some attempts to tackle this problem through end-to-end joint methods. Most of them ignore that different regions of noisy images have different characteristics. To solve these problems, in this paper, our proposed signal-to-noise ratio~(SNR) aware joint solution exploits local and non-local features for image compression and denoising simultaneously. We design an end-to-end trainable network, which includes the main encoder branch, the guidance branch, and the signal-to-noise ratio~(SNR) aware branch. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that our joint solution outperforms existing state-of-the-art methods.
图像压缩和去噪在图像处理中是一个基本挑战,许多实际应用都依赖于它。为了解决实际需求,当前的解决方案可以分为两个主要策略:1)序列方法;2)联合方法。然而,序列方法的一个缺点是,多个独立模型的信息损失会导致错误累积。最近,学术界开始尝试通过端到端的联合方法来解决这个问题。大多数方法忽略了噪声图像不同区域具有不同的特征。为了解决这些问题,本文提出的信号噪声比(SNR)感知联合解决方案同时利用了图像压缩和去噪的局部和非局部特征。我们设计了一个端到端的可训练网络,包括主要编码分支、指导分支和信号噪声比(SNR)感知分支。我们在合成和真实世界数据集上进行了广泛的实验,证明了我们的联合解决方案超越了现有最先进的方法。
https://arxiv.org/abs/2403.14135
The widespread adoption of face recognition has led to increasing privacy concerns, as unauthorized access to face images can expose sensitive personal information. This paper explores face image protection against viewing and recovery attacks. Inspired by image compression, we propose creating a visually uninformative face image through feature subtraction between an original face and its model-produced regeneration. Recognizable identity features within the image are encouraged by co-training a recognition model on its high-dimensional feature representation. To enhance privacy, the high-dimensional representation is crafted through random channel shuffling, resulting in randomized recognizable images devoid of attacker-leverageable texture details. We distill our methodologies into a novel privacy-preserving face recognition method, MinusFace. Experiments demonstrate its high recognition accuracy and effective privacy protection. Its code is available at this https URL.
随着人脸识别的广泛应用,隐私问题越来越引起人们的关注,因为未经授权地访问人脸图像会泄露敏感的个人信息。本文探讨了防止观看和恢复攻击的人脸图像保护方法。为了实现这一目标,我们提出了通过在原始人脸和其模型的特征下采样来创建具有视觉上无信息性的人脸图像的方法。通过在图像中鼓励可识别身份特征的识别模型在其高维特征表示上进行共同训练,我们促进了识别模型的可识别性。为了增强隐私,我们通过随机通道洗牌来制作高维表示,从而生成无攻击者利用的纹理细节的随机可识别图像。我们将我们的方法归类为一种新的隐私保护人脸识别方法,称为MinusFace。实验证明,其高识别准确性和有效的隐私保护功能。其代码可在此处访问:https://www. this URL。
https://arxiv.org/abs/2403.12457
Learned Image Compression (LIC) has achieved dramatic progress regarding objective and subjective metrics. MSE-based models aim to improve objective metrics while generative models are leveraged to improve visual quality measured by subjective metrics. However, they all suffer from blurring or deformation at low bit rates, especially at below $0.2bpp$. Besides, deformation on human faces and text is unacceptable for visual quality assessment, and the problem becomes more prominent on small faces and text. To solve this problem, we combine the advantage of MSE-based models and generative models by utilizing region of interest (ROI). We propose Hierarchical-ROI (H-ROI), to split images into several foreground regions and one background region to improve the reconstruction of regions containing faces, text, and complex textures. Further, we propose adaptive quantization by non-linear mapping within the channel dimension to constrain the bit rate while maintaining the visual quality. Exhaustive experiments demonstrate that our methods achieve better visual quality on small faces and text with lower bit rates, e.g., $0.7X$ bits of HiFiC and $0.5X$ bits of BPG.
学习到的图像压缩(LIC)在客观和主观指标方面取得了显著的进步。基于MSE的模型旨在提高客观指标,而基于生成模型的模型则试图利用生成模型的优势来改善客观指标。然而,它们在低比特率下都存在模糊或变形的问题,特别是在低于0.2bpp的比特率下。此外,对于视觉质量评估,面部和文本的变形是不可以接受的,问题在较小和文本上变得更加突出。为了解决这个问题,我们结合了基于MSE模型的优势和生成模型的优势,通过使用区域感兴趣(ROI)。我们提出了Hierarchical-ROI(H-ROI),将图像分割为多个前景区域和一个背景区域,以改善包含面部、文本和复杂纹理的区域的重建。此外,我们通过在通道维度非线性映射来实现自适应量化,以在保持视觉质量的同时约束比特率。充分的实验证明,我们的方法在低比特率下能够实现更好的视觉效果,例如,$0.7X$bits的HiFiC和$0.5X$bits的BPG。
https://arxiv.org/abs/2403.13030
The emerging Learned Compression (LC) replaces the traditional codec modules with Deep Neural Networks (DNN), which are trained end-to-end for rate-distortion performance. This approach is considered as the future of image/video compression, and major efforts have been dedicated to improving its compression efficiency. However, most proposed works target compression efficiency by employing more complex DNNS, which contributes to higher computational complexity. Alternatively, this paper proposes to improve compression by fully exploiting the existing DNN capacity. To do so, the latent features are guided to learn a richer and more diverse set of features, which corresponds to better reconstruction. A channel-wise feature decorrelation loss is designed and is integrated into the LC optimization. Three strategies are proposed and evaluated, which optimize (1) the transformation network, (2) the context model, and (3) both networks. Experimental results on two established LC methods show that the proposed method improves the compression with a BD-Rate of up to 8.06%, with no added complexity. The proposed solution can be applied as a plug-and-play solution to optimize any similar LC method.
学习压缩(LC)作为一种新兴的压缩技术,取代了传统的编码模块,使用了深度神经网络(DNN),这些网络是针对码率失真性能进行端到端训练的。这种方法被认为是图像/视频压缩的未来,并且为提高其压缩效率做出了主要努力。然而,大多数提出的作品通过采用更复杂的DNN来提高压缩效率,导致计算复杂度更高。相反,本文提出了一种通过充分利用现有DNN能力来提高压缩的方法。为此,将潜在特征指导学习更丰富和更多样化的特征,从而实现更好的重构。在LC优化中,设计了一个通道级特征相关损失,并将其集成进去。提出了三种策略并对其进行了评估,它们分别是优化(1)转换网络,(2)上下文模型,(3)两个网络。在两个已有的LC方法上进行实验,结果表明,与所提出的方法相比,压缩率提高了至少8.06%,而没有增加复杂性。所提出的解决方案可以作为一个可插拔的解决方案,用于优化任何类似的LC方法。
https://arxiv.org/abs/2403.10936
Generative adversial network (GAN) is a type of generative model that maps a high-dimensional noise to samples in target distribution. However, the dimension of noise required in GAN is not well understood. Previous approaches view GAN as a mapping from a continuous distribution to another continous distribution. In this paper, we propose to view GAN as a discrete sampler instead. From this perspective, we build a connection between the minimum noise required and the bits to losslessly compress the images. Furthermore, to understand the behaviour of GAN when noise dimension is limited, we propose divergence-entropy trade-off. This trade-off depicts the best divergence we can achieve when noise is limited. And as rate distortion trade-off, it can be numerically solved when source distribution is known. Finally, we verifies our theory with experiments on image generation.
生成对抗网络(GAN)是一种生成模型,它将高维噪声映射到目标分布中的样本。然而,GAN所需的噪声维度并不清楚。之前的方法将GAN视为从连续分布到另一个连续分布的映射。在本文中,我们提出了将GAN视为离散采样器的观点。从这种角度来看,我们建立了最低噪声所需量和图像无损压缩所需的比特数之间的联系。此外,为了了解GAN在噪声维度受限时的行为,我们提出了熵增益 trade-off。这个 trade-off 描述了当噪声受限时我们能实现的最佳熵。当源分布已知时,它可以通过数值求解得到。最后,我们通过图像生成实验验证了我们的理论。
https://arxiv.org/abs/2403.09196
Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed.
现有的基于学习的立体图像编码器采用简单的源于单张图像编码器的熵模型对潜在表示进行编码。然而,这些熵模型很难有效地捕捉立体图像固有的空间差异特点,导致速率失真结果。在本文中,我们提出了一个名为CAMSIC的立体图像压缩框架。CAMSIC将每个图像独立转换为潜在表示,并采用一种强大的无解码器-无关Transformer熵模型来捕捉空间和差异依赖,通过引入一种新颖的内容感知掩膜图像建模(MIM)技术。我们内容感知MIM使前后信息之间实现高效的双向交互,自然地消除了需要额外Transformer解码器的需要。实验结果表明,我们的立体图像编码器在两个立体图像数据集Cityscapes和InStereo2K上实现了与最先进的速率失真性能相同的水平,具有快速的编码和解码速度。
https://arxiv.org/abs/2403.08505
Image compression emerges as a pivotal tool in the efficient handling and transmission of digital images. Its ability to substantially reduce file size not only facilitates enhanced data storage capacity but also potentially brings advantages to the development of continual machine learning (ML) systems, which learn new knowledge incrementally from sequential data. Continual ML systems often rely on storing representative samples, also known as exemplars, within a limited memory constraint to maintain the performance on previously learned data. These methods are known as memory replay-based algorithms and have proven effective at mitigating the detrimental effects of catastrophic forgetting. Nonetheless, the limited memory buffer size often falls short of adequately representing the entire data distribution. In this paper, we explore the use of image compression as a strategy to enhance the buffer's capacity, thereby increasing exemplar diversity. However, directly using compressed exemplars introduces domain shift during continual ML, marked by a discrepancy between compressed training data and uncompressed testing data. Additionally, it is essential to determine the appropriate compression algorithm and select the most effective rate for continual ML systems to balance the trade-off between exemplar quality and quantity. To this end, we introduce a new framework to incorporate image compression for continual ML including a pre-processing data compression step and an efficient compression rate/algorithm selection method. We conduct extensive experiments on CIFAR-100 and ImageNet datasets and show that our method significantly improves image classification accuracy in continual ML settings.
图像压缩成为处理和传输数字图像的高效手段。其大幅减小文件大小不仅提高了数据存储容量,还有助于连续机器学习(ML)系统的开发,这些系统从序列数据中逐步学习新知识。连续ML系统通常需要在有限内存约束下存储代表性样本,也就是实例,以保持对之前学习数据的性能。这些方法称为基于回放的算法,已经在减轻灾难性遗忘的有害影响方面取得了有效成果。然而,有限的内存缓冲区往往无法充分表示整个数据分布。在本文中,我们探讨了将图像压缩作为一种策略来提高缓冲器容量,从而增加实例多样性。然而,直接使用压缩实例在连续ML过程中会导致领域漂移,表现为压缩训练数据和未压缩测试数据之间的差异。此外,确定适当的压缩算法以及为连续ML系统选择最有效的压缩率至关重要。为此,我们引入了一个新的框架,包括预处理数据压缩步骤和高效的压缩率/算法选择方法,用于将图像压缩应用于连续ML。我们在CIFAR-100和ImageNet数据集上进行了广泛的实验,结果表明,我们的方法在连续ML环境中显著提高了图像分类准确性。
https://arxiv.org/abs/2403.06288