Light-Field (LF) image is emerging 4D data of light rays that is capable of realistically presenting spatial and angular information of 3D scene. However, the large data volume of LF images becomes the most challenging issue in real-time processing, transmission, and storage. In this paper, we propose an end-to-end deep LF Image Compression method Using Disentangled Representation and Asymmetrical Strip Convolution (LFIC-DRASC) to improve coding efficiency. Firstly, we formulate the LF image compression problem as learning a disentangled LF representation network and an image encoding-decoding network. Secondly, we propose two novel feature extractors that leverage the structural prior of LF data by integrating features across different dimensions. Meanwhile, disentangled LF representation network is proposed to enhance the LF feature disentangling and decoupling. Thirdly, we propose the LFIC-DRASC for LF image compression, where two Asymmetrical Strip Convolution (ASC) operators, i.e. horizontal and vertical, are proposed to capture long-range correlation in LF feature space. These two ASC operators can be combined with the square convolution to further decouple LF features, which enhances the model ability in representing intricate spatial relationships. Experimental results demonstrate that the proposed LFIC-DRASC achieves an average of 20.5\% bit rate reductions comparing with the state-of-the-art methods.
光场(LF)图像是一种能够真实地呈现3D场景的点光源分布的4D数据。然而,LF图像的大数据量在实时处理、传输和存储过程中成为最具有挑战性的问题。在本文中,我们提出了一种端到端的光场图像压缩方法:利用分离表示和不对称条带卷积(LFIC-DRASC)以提高压缩效率。首先,我们将LF图像压缩问题形式化为学习一个分离的光场表示网络和一个图像编码-解码网络。其次,我们提出两个新的特征提取器,通过将特征整合到不同维度来利用LF数据的结构先验。同时,通过不对称条带卷积来增强LF特征的分离和去耦合。第三,我们提出了LFIC-DRASC,用于LF图像压缩,其中两个不对称条带卷积(ASC)操作水平和垂直被提出,以捕捉LF特征空间中的长距离相关性。这两个ASC操作可以与平方卷积结合以进一步解耦LF特征,从而增强模型的表示能力。实验结果表明,与最先进的方法相比,所提出的LFIC-DRASC具有平均20.5%的带宽降低。
https://arxiv.org/abs/2409.11711
Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than $2\%$ of the parameters.
学会的图像压缩(LIC)已经达到最先进的速率失真性能,被视为下一代图像压缩技术的有希望的成果。然而,预训练的LIC模型通常在应用于非训练领域图像时性能显著下降,这表明其泛化能力较差。为解决这个问题,我们提出了一种基于插件和播放器的LIC模型改进方法,将可插拔的适配器集成到预训练模型中。从潜在通道和频率分量的类比中得到启发,我们研究了LIC中的领域空白,并观察到非训练领域图像会破坏预训练通道的分解。因此,我们引入了一种基于卷积基的适配器和低秩适配器的通道级重新分配方法,这种方法轻巧且与主流LIC方案兼容。在多个领域和多个代表性的LIC方案之间进行广泛的实验,我们的方法显著增强了预训练模型,与仅使用25个目标领域样本的H.266/VVC内部编码的性能相当。此外,我们的方法在参数传输量上与全模型微调相匹敌。
https://arxiv.org/abs/2409.11111
In recent years, deep learning-based image compression, particularly through generative models, has emerged as a pivotal area of research. Despite significant advancements, challenges such as diminished sharpness and quality in reconstructed images, learning inefficiencies due to mode collapse, and data loss during transmission persist. To address these issues, we propose a novel compression model that incorporates a denoising step with diffusion models, significantly enhancing image reconstruction fidelity by sub-information(e.g., edge and depth) from leveraging latent space. Empirical experiments demonstrate that our model achieves superior or comparable results in terms of image quality and compression efficiency when measured against the existing models. Notably, our model excels in scenarios of partial image loss or excessive noise by introducing an edge estimation network to preserve the integrity of reconstructed images, offering a robust solution to the current limitations of image compression.
近年来,基于深度学习的图像压缩,特别是通过生成模型,已成为一个重要的研究热点。尽管取得了显著的进步,但失真和重建图像的质量以及由于模式坍塌导致的学习效率降低等问题仍然存在。为解决这些问题,我们提出了一个新型的压缩模型,该模型结合了扩散模型中的去噪步骤,通过保留潜在空间中的下信息(如边缘和深度)显著增强图像重建保真度。实验结果表明,与现有模型相比,我们的模型在图像质量和压缩效率方面实现卓越或可比较的结果。值得注意的是,我们的模型在部分图像丢失或过度噪声的场景中表现出色,通过引入边缘估计网络来保留重构图像的完整性,为解决图像压缩的当前限制提供了一个稳健的解决方案。
https://arxiv.org/abs/2409.10978
We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.
我们提出了一个新颖的基于频率的自监督学习(SSL)方法,显著增强了其在预训练方面的效果。此方向的先驱工作遮蔽了输入图像中的预定义频率,并使用重构损失来预训练模型。尽管取得了很好的效果,但如我们在论文中所述,这种实现存在两个基本局限。首先,使用预定义频率会忽视图像频率响应的变异性。其次,通过使用频率过滤的图像进行预训练,预训练后的模型在微调过程中需要更多的数据来适应自然外观的图像。为了应对这些缺点,我们提出了Fourrier变换压缩与自知识蒸馏(FOLK)相结合的方法,结合了两个专门的想法。首先,受到图像压缩的启发,我们根据图像频率响应动态选择遮蔽的频率,为预训练创造了更合适的SSL任务。其次,我们使用由知识蒸馏支持的两个分支框架,使模型能够同时接受滤波和原始图像作为输入,从而大大减轻下游任务的负担。我们的实验结果表明,FOLK在实现对许多最先进的SSL方法的竞争性能方面具有有效性,包括图像分类、少样本学习和语义分割等任务。
https://arxiv.org/abs/2409.10362
Neural compression has the potential to revolutionize lossy image compression. Based on generative models, recent schemes achieve unprecedented compression rates at high perceptual quality but compromise semantic fidelity. Details of decompressed images may appear optically flawless but semantically different from the originals, making compression errors difficult or impossible to detect. We explore the problem space and propose a provisional taxonomy of miscompressions. It defines three types of 'what happens' and has a binary 'high impact' flag indicating miscompressions that alter symbols. We discuss how the taxonomy can facilitate risk communication and research into mitigations.
神经压缩具有可能彻底颠覆有损图像压缩。基于生成模型,最近的方案在保持高感知质量的同时实现史无前例的压缩率,但牺牲了语义准确性。解压缩后的图像可能看起来在光学上无懈可击,但在语义上与原始图像不同,使得压缩错误难以或无法检测。我们探索了问题空间,并提出了一个暂时的分类方案来描述错误的压缩。它定义了三种类型的“发生了什么”,并有一个二进制的“高影响”标志,表示会改变符号的错误压缩。我们讨论了分类方案如何促进风险沟通和研究减轻措施。
https://arxiv.org/abs/2409.05490
Multi-view image compression is vital for 3D-related applications. To effectively model correlations between views, existing methods typically predict disparity between two views on a 2D plane, which works well for small disparities, such as in stereo images, but struggles with larger disparities caused by significant view changes. To address this, we propose a novel approach: learning-based multi-view image coding with 3D Gaussian geometric priors (3D-GP-LMVIC). Our method leverages 3D Gaussian Splatting to derive geometric priors of the 3D scene, enabling more accurate disparity estimation across views within the compression model. Additionally, we introduce a depth map compression model to reduce redundancy in geometric information between views. A multi-view sequence ordering method is also proposed to enhance correlations between adjacent views. Experimental results demonstrate that 3D-GP-LMVIC surpasses both traditional and learning-based methods in performance, while maintaining fast encoding and decoding speed.
多视角图像压缩对于3D相关应用至关重要。为了有效地建模视图之间的相关性,现有方法通常预测在二维平面上两个视图之间的差异,这对于小的差异(如立体图像)效果很好,但对于由显著视图变化导致的较大差异力不从心。为了应对这个问题,我们提出了一个新方法:基于学习的多视角图像编码(3D-GP-LMVIC)。我们的方法利用3D高斯平铺来提取3D场景的几何先验,使得在压缩模型中跨视图的差异估计更准确。此外,我们还引入了一个深度图压缩模型来减少视图之间几何信息的冗余。还提出了一个多视角序列排序方法来增强相邻视图之间的相关性。实验结果表明,3D-GP-LMVIC在性能上超过了传统方法和基于学习的方法,同时保持快速编码和解码速度。
https://arxiv.org/abs/2409.04013
Remote-sensing (RS) image compression at extremely low bitrates has always been a challenging task in practical scenarios like edge device storage and narrow bandwidth transmission. Generative models including VAEs and GANs have been explored to compress RS images into extremely low-bitrate streams. However, these generative models struggle to reconstruct visually plausible images due to the highly ill-posed nature of extremely low-bitrate image compression. To this end, we propose an image compression framework that utilizes a pre-trained diffusion model with powerful natural image priors to achieve high-realism reconstructions. However, diffusion models tend to hallucinate small structures and textures due to the significant information loss at limited bitrates. Thus, we introduce vector maps as semantic and structural guidance and propose a novel image compression approach named Map-Assisted Generative Compression (MAGC). MAGC employs a two-stage pipeline to compress and decompress RS images at extremely low bitrates. The first stage maps an image into a latent representation, which is then further compressed in a VAE architecture to save bitrates and serves as implicit guidance in the subsequent diffusion process. The second stage conducts a conditional diffusion model to generate a visually pleasing and semantically accurate result using implicit guidance and explicit semantic guidance. Quantitative and qualitative comparisons show that our method outperforms standard codecs and other learning-based methods in terms of perceptual quality and semantic accuracy. The dataset and code will be publicly available at this https URL.
远程 sensing(RS)图像压缩在极低带宽的情况下一直是一项具有挑战性的任务,尤其是在边缘设备存储和窄带传输等实际场景中。已经探索了使用生成模型(VAEs 和 GANs)压缩 RS 图像以实现极低带宽流。然而,由于极低带宽图像压缩具有高度和不稳定性质,这些生成模型很难重构 visually plausible 图像。为此,我们提出了一个利用预训练扩散模型(具有强大的自然图像priors)实现高现实性重构的图像压缩框架。然而,扩散模型由于在有限带宽下信息损失较大,往往会产生小的结构和纹理伪像。因此,我们引入向量图作为语义和结构指导,并提出了名为 Map-Assisted Generative Compression(MAGC)的新图像压缩方法。MAGC采用两个阶段进行压缩和解压缩 RS 图像,在第一阶段将图像映射到一个潜在表示,然后在一个 VAE 架构中进一步压缩以节省带宽,并在后续扩散过程中充当隐性指导。在第二阶段,采用条件扩散模型生成一个视觉上令人愉悦且语义准确的結果,利用隐性指导和显性语义指导。数量和质量比较结果表明,我们的方法在感知质量和语义准确性方面优于标准编码和其他基于学习的方法。数据集和代码将公开发布在以下链接处:
https://arxiv.org/abs/2409.01935
Learned image compression have attracted considerable interests in recent years. It typically comprises an analysis transform, a synthesis transform, quantization and an entropy coding model. The analysis transform and synthesis transform are used to encode an image to latent feature and decode the quantized feature to reconstruct the image, and can be regarded as coupled transforms. However, the analysis transform and synthesis transform are designed independently in the existing methods, making them unreliable in high-quality image compression. Inspired by the invertible neural networks in generative modeling, invertible modules are used to construct the coupled analysis and synthesis transforms. Considering the noise introduced in the feature quantization invalidates the invertible process, this paper proposes an Approximately Invertible Neural Network (A-INN) framework for learned image compression. It formulates the rate-distortion optimization in lossy image compression when using INN with quantization, which differentiates from using INN for generative modelling. Generally speaking, A-INN can be used as the theoretical foundation for any INN based lossy compression method. Based on this formulation, A-INN with a progressive denoising module (PDM) is developed to effectively reduce the quantization noise in the decoding. Moreover, a Cascaded Feature Recovery Module (CFRM) is designed to learn high-dimensional feature recovery from low-dimensional ones to further reduce the noise in feature channel compression. In addition, a Frequency-enhanced Decomposition and Synthesis Module (FDSM) is developed by explicitly enhancing the high-frequency components in an image to address the loss of high-frequency information inherent in neural network based image compression. Extensive experiments demonstrate that the proposed A-INN outperforms the existing learned image compression methods.
近年来,学习图像压缩引起了相当大的关注。通常它包括一个分析变换、一个合成变换、量化和一个熵编码模型。分析变换和合成变换用于将图像编码为潜在特征并解码量化后的特征以重构图像,可以看作是耦合变换。然而,现有的方法中,分析变换和合成变换是独立设计的,这使得它们在高质量图像压缩中不可靠。受到生成模型中可逆神经网络的启发,使用可逆模块构建耦合分析变换和合成变换。考虑到特征量化中引入的噪声会破坏可逆过程,本文提出了一个近似可逆神经网络(A-INN)框架用于学习图像压缩。它将量化图像压缩中使用INN时的速率恢复优化公式表述为:不同于使用INN进行生成建模时的INN。一般来说,A-INN可以作为任何基于损失压缩的INN理论基础。基于这个公式,A-INN与渐进去噪模块(PDM)相结合以有效降低解码过程中的量化噪声。此外,还设计了一个级联特征恢复模块(CFRM),用于从低维度特征中学习高维度特征以进一步降低压缩过程中的噪声。此外,通过明确增强图像中的高频成分,频率增强分解和合成模块(FDSM)被开发出来,以解决基于神经网络图像压缩中高频信息损失的问题。大量实验证明,与现有学习图像压缩方法相比,A-INN具有优越的性能。
https://arxiv.org/abs/2408.17073
Lossy image compression is essential for efficient transmission and storage. Traditional compression methods mainly rely on discrete cosine transform (DCT) or singular value decomposition (SVD), both of which represent image data in continuous domains and therefore necessitate carefully designed quantizers. Notably, SVD-based methods are more sensitive to quantization errors than DCT-based methods like JPEG. To address this issue, we introduce a variant of integer matrix factorization (IMF) to develop a novel quantization-free lossy image compression method. IMF provides a low-rank representation of the image data as a product of two smaller factor matrices with bounded integer elements, thereby eliminating the need for quantization. We propose an efficient, provably convergent iterative algorithm for IMF using a block coordinate descent (BCD) scheme, with subproblems having closed-form solutions. Our experiments on the Kodak and CLIC 2024 datasets demonstrate that our IMF compression method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates. We also assessed our method's capability to preserve visual semantics by evaluating an ImageNet pre-trained classifier on compressed images. Remarkably, our method improved top-1 accuracy by over 5 percentage points compared to JPEG at bit rates under 0.25 bpp. The project is available at this https URL .
损失图像压缩对于高效的传输和存储至关重要。传统的压缩方法主要依赖于离散余弦变换(DCT)或奇异值分解(SVD),两者都代表连续域中的图像数据,因此需要精心设计的量化器。值得注意的是,基于SVD的压缩方法对于量化误差比基于DCT的类似JPEG方法更敏感。为了解决这个问题,我们引入了一种整数矩阵分解(IMF)的变体,以开发一种全新的无量化损失图像压缩方法。IMF将图像数据的低秩表示为两个有界整数元素的两个较小因子矩阵的乘积,从而消除了量化的需求。我们使用块坐标下降(BCD)方案提出了一种高效的、可证明收敛的IMF压缩算法,其中子问题具有闭式界面的解。我们对Kodak和CLIC 2024数据集的实验证明,我们的IMF压缩方法在低比特率(每像素0.25比特)以下的表现优于JPEG,而且在高比特率下也表现相当。我们还通过在压缩图像上评估预训练的ImageNet分类器来评估我们方法保留视觉语义的能力。值得注意的是,与JPEG相比,我们的方法在低比特率下的 top-1 准确率提高了超过5个百分点。该项目现在可以在这个https://url.com/ 的网址上找到。
https://arxiv.org/abs/2408.12691
Image-based biometrics can aid law enforcement in various aspects, for example in iris, fingerprint and soft-biometric recognition. A critical precondition for recognition is the availability of sufficient biometric information in images. It is visually apparent that strong JPEG compression removes such details. However, latest AI-based image compression seemingly preserves many image details even for very strong compression factors. Yet, these perceived details are not necessarily grounded in measurements, which raises the question whether these images can still be used for biometric recognition. In this work, we investigate how AI compression impacts iris, fingerprint and soft-biometric (fabrics and tattoo) images. We also investigate the recognition performance for iris and fingerprint images after AI compression. It turns out that iris recognition can be strongly affected, while fingerprint recognition is quite robust. The loss of detail is qualitatively best seen in fabrics and tattoos images. Overall, our results show that AI-compression still permits many biometric tasks, but attention to strong compression factors in sensitive tasks is advisable.
图像生物识别技术在执法领域具有各种应用,例如在眼睛、指纹和软生物识别。识别的关键前提是图片中存在足够的生物识别信息。显然,强大的JPEG压缩会去除这些细节。然而,最新的基于AI的图像压缩似乎在非常强大的压缩因素下仍然保留了许多图像细节。然而,这些看似保留的细节并不一定基于测量,这引发了一个问题:这些图片是否还可以用于生物识别。在这项研究中,我们研究了AI压缩对眼睛、指纹和软生物识别(布料和文身)图像的影响。我们还研究了AI压缩对眼睛和指纹图像的识别性能。我们发现,眼睛识别可能会受到严重影响,而指纹识别相当稳健。在布料和文身图像中,细节损失是定性最好的。总体而言,我们的结果表明,AI压缩仍然允许许多生物识别任务,但在敏感任务中关注强压缩因素是有意义的。
https://arxiv.org/abs/2408.10823
Matrix quantization involves encoding matrix elements in a more space-efficient manner to minimize storage requirements, with dequantization used to reconstruct the original matrix for practical use. We define the Quantization Error Minimization (QEM) problem as minimizing the difference between a matrix before and after quantization while ensuring that the quantized matrix occupies the same amount of memory. Matrix quantization is essential in various fields, including weight quantization in Large Language Models (LLMs), vector databases, KV cache quantization, graph compression, and image compression. The growing scale of LLMs, such as GPT-4 and BERT, underscores the need for matrix compression due to the large size of parameters and KV caches, which are stored as matrices. To address the QEM problem, we introduce HETA, an algorithm that leverages the local orderliness of matrix elements by iteratively swapping elements to create a locally ordered matrix. This matrix is then grouped and quantized by columns. To further improve HETA, we present two optimizations: additional quantization of residuals to reduce mean squared error (MSE) and the application of masking and batch processing to accelerate the algorithm. Our experiments show that HETA effectively reduces MSE to 12.3% of its original value at the same compression ratio, outperforming leading baseline algorithms. Our contributions include formalizing the QEM problem, developing the HETA algorithm, and proposing two optimizations to enhance both accuracy and processing speed.
矩阵量化涉及以更空间高效的编码矩阵元素的方式对矩阵元素进行编码,以最小化存储需求。量化用于实现对原始矩阵的重建,以便在实际应用中使用。我们将量化误差最小化(QEM)问题定义为在量化前和量化后之间减小矩阵之间的差异,同时确保量化后的矩阵占用与原始矩阵相同的内存量。矩阵量化在各种领域都至关重要,包括在大型语言模型(LLMs)中的权重量化、向量数据库、KV缓存量化、图压缩和图像压缩。LLMs的大型规模(如GPT-4和BERT)突显了由于参数和KV缓存的大规模而需要进行的矩阵压缩。为了解决QEM问题,我们引入了HETA,一种通过迭代交换元素来创建局部有序矩阵的算法。然后将该矩阵按列进行量化。为了进一步提高HETA,我们提出了两个优化:对残差进行额外量化以降低均方误差(MSE)和应用掩码和批处理来加速算法。我们的实验结果表明,HETA有效地将MSE降低了其原始值的12.3%,超过了领先基线算法。我们的贡献包括 formalizing the QEM problem, developing the HETA algorithm, and proposing two optimizations to enhance both accuracy and processing speed.
https://arxiv.org/abs/2407.03637
We present a new image compression paradigm to achieve ``intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eyes, the image coding for machines (ICM) framework we focus on requires the compressed bitstream to more comply with different downstream intelligent analysis tasks. To this end, we employ LMM to \textcolor{red}{tell codec what to compress}: 1) first utilize the powerful semantic understanding capability of LMMs w.r.t object grounding, identification, and importance ranking via prompts, to disentangle image content before compression, 2) and then based on these semantic priors we accordingly encode and transmit objects of the image in order with a structured bitstream. In this way, diverse vision benchmarks including image classification, object detection, instance segmentation, etc., can be well supported with such a semantically structured bitstream. We dub our method ``\textit{SDComp}'' for ``\textit{S}emantically \textit{D}isentangled \textit{Comp}ression'', and compare it with state-of-the-art codecs on a wide variety of different vision tasks. SDComp codec leads to more flexible reconstruction results, promised decoded visual quality, and a more generic/satisfactory intelligent task-supporting ability.
我们提出了一个新的图像压缩范例,通过巧妙地利用大型多模态模型的共同感觉来实现“为机器智能编码”,为了解决传统图像压缩通常针对人类视觉进行优化的问题。我们关注的目标是大型语言/多模态模型是理解现实世界的强大通用语义预测器。与传统图像压缩不同,我们关注的目标是压缩后的位流需要更符合不同下游智能分析任务的约束。为此,我们使用LMM告诉压缩编码器应该压缩什么: 1) 首先,利用LMM的强大的语义理解能力,通过提示来利用LMM的语义理解能力,分离出压缩前的图像内容, 2) 然后根据这些语义先验,相应地编码和传输图像中的对象,以生成具有结构化位流的图像。 在这种方式下,我们可以很好地支持各种视觉基准,包括图像分类、目标检测、实例分割等。我们将我们的方法称为“SDComp”,即“语义去解码压缩”,并将其与各种不同视觉任务的现有编码器进行比较。SDComp编码器导致更灵活的重构结果,承诺解码视觉质量,并且具有更通用/令人满意的支持智能任务的能力。
https://arxiv.org/abs/2408.08575
Just noticeable distortion (JND), representing the threshold of distortion in an image that is minimally perceptible to the human visual system (HVS), is crucial for image compression algorithms to achieve a trade-off between transmission bit rate and image quality. However, traditional JND prediction methods only rely on pixel-level or sub-band level features, lacking the ability to capture the impact of image content on JND. To bridge this gap, we propose a Semantic-Guided JND (SG-JND) network to leverage semantic information for JND prediction. In particular, SG-JND consists of three essential modules: the image preprocessing module extracts semantic-level patches from images, the feature extraction module extracts multi-layer features by utilizing the cross-scale attention layers, and the JND prediction module regresses the extracted features into the final JND value. Experimental results show that SG-JND achieves the state-of-the-art performance on two publicly available JND datasets, which demonstrates the effectiveness of SG-JND and highlight the significance of incorporating semantic information in JND assessment.
仅仅是可注意到的扭曲(JND),代表了对人类视觉系统(HVS)来说最小可感知到的图像扭曲阈值,对于图像压缩算法来说实现传输比特率与图像质量之间的权衡至关重要。然而,传统的JND预测方法仅依赖于像素级或子带级特征,缺乏捕捉图像内容对JND影响的能力。为了弥合这一空白,我们提出了一种语义引导的JND(SG-JND)网络,以利用语义信息进行JND预测。具体来说,SG-JND由三个基本模块组成:图像预处理模块从图像中提取语义级补丁,特征提取模块通过利用跨尺度关注层提取多层特征,JND预测模块将提取到的特征回归到最终的JND值。实验结果表明,SG-JND在两个公开可用的JND数据集上取得了最先进的性能,这证明了SG-JND的有效性,并突显了在JND评估中包含语义信息的重要性。
https://arxiv.org/abs/2408.04273
Recent advancements in learned image compression (LIC) methods have demonstrated superior performance over traditional hand-crafted codecs. These learning-based methods often employ convolutional neural networks (CNNs) or Transformer-based architectures. However, these nonlinear approaches frequently overlook the frequency characteristics of images, which limits their compression efficiency. To address this issue, we propose a novel Transformer-based image compression method that enhances the transformation stage by considering frequency components within the feature map. Our method integrates a novel Hybrid Spatial-Channel Attention Transformer Block (HSCATB), where a spatial-based branch independently handles high and low frequencies at the attention layer, and a Channel-aware Self-Attention (CaSA) module captures information across channels, significantly improving compression performance. Additionally, we introduce a Mixed Local-Global Feed Forward Network (MLGFFN) within the Transformer block to enhance the extraction of diverse and rich information, which is crucial for effective compression. These innovations collectively improve the transformation's ability to project data into a more decorrelated latent space, thereby boosting overall compression efficiency. Experimental results demonstrate that our framework surpasses state-of-the-art LIC methods in rate-distortion performance.
近年来,学习图像压缩(LIC)方法取得了显著的进展,超越了传统的 Hand-crafted 编码器。这些基于学习的方法通常采用卷积神经网络(CNN)或基于 Transformer 的架构。然而,这些非线性方法通常忽视了图像的频率特征,限制了其压缩效率。为解决这个问题,我们提出了一个新颖的基于 Transformer 的图像压缩方法,通过考虑特征图中的频率成分来增强变换阶段。我们的方法集成了一个新的混合空间通道注意力 Transformer 块(HSCATB),其中基于空间的分支独立处理关注层中的高和低频率,而通道感知的自注意力(CaSA)模块捕捉通道之间的信息,显著提高了压缩性能。此外,我们在 Transformer 块内引入了混合局部-全局前馈网络(MLGFFN),以增强提取不同且丰富的信息,这对于有效的压缩至关重要。这些创新共同提高了变换的能力将数据投影到更相关 latent 空间,从而提高了整体压缩效率。实验结果表明,我们的框架在码率失真性能上超过了最先进的 LIC 方法。
https://arxiv.org/abs/2408.03842
This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to modalities (e.g. images) beyond text, but their billion scale hinders deployment on resource-constrained end devices. While cloud-hosted MLLMs could be available, transmitting raw, uncompressed images captured by end devices to the cloud requires an efficient image compression system. To address this, we focus on emerging neural image compression and propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks. The proposed framework is generic and applicable to multiple application scenarios, where the neural image codec can be (1) pre-trained for human perception without updating, (2) fully updated for joint human and machine perception, or (3) fully updated for only machine perception. The transform-neck trained with the surrogate loss is universal, for it can serve various downstream vision tasks enabled by a variety of MLLMs that share the same visual encoder. Our framework has the striking feature of excluding the downstream MLLMs from training the transform-neck, and potentially the neural image codec as well. This stands out from most existing coding for machine approaches that involve downstream networks in training and thus could be impractical when the networks are MLLMs. Extensive experiments on different neural image codecs and various MLLM-based vision tasks show that our method achieves great rate-accuracy performance with much less complexity, demonstrating its effectiveness.
这篇论文是关于首次研究如何将压缩图像的残留适应下游视觉任务,这些任务采用了多模态大型语言模型(MLLMs)。MLMMs已经把大语言模型扩展到了文本之外的多种模式,如图像。然而,它们的大规模限制了部署到资源受限终端设备的能力。虽然云端主机上的MLMMs是可能的,但将未经压缩的、由终端设备捕获的图像发送到云中需要高效的人工智能编码系统。为解决这一问题,我们专注于新兴的人工智能编码和提出一个新型框架,其中包含一个轻量级变换颈以及一种替代损失来适应用于基于MLLMs的视觉任务的压缩图像残留。所提出的框架是通用的,并适用于多种应用场景,人工人工智能编码器可以通过(1)不更新的人工感知预训练,(2)将人类和机器共同感知的人工智能预训练,或(3)仅将机械感知进行完全更新。变换颈通过替代损失进行训练时是普遍的,因为它可以服务于共享视觉编码器的各种下游视觉任务所要求的所有人工智能应用。我们的框架具有令人着迷的特征,即排除了下游的MLMMs从训练变换颈中删除,以及可能对它本身也会起作用。这与大多数现有机器编码方法不同,这些方法涉及在培训时将下游网络纳入,因此当这些网络是大语言模型时可能会变得不切实际。在不同的人工人工智能编码器和各种基于大型语言模型的视觉任务上进行广泛的实验表明,我们的方法在保持较低复杂性的同时取得惊人的速率-精度性能,展示了其有效性。
https://arxiv.org/abs/2407.19651
Empirically-determined scaling laws have been broadly successful in predicting the evolution of large machine learning models with training data and number of parameters. As a consequence, they have been useful for optimizing the allocation of limited resources, most notably compute time. In certain applications, storage space is an important constraint, and data format needs to be chosen carefully as a consequence. Computer vision is a prominent example: images are inherently analog, but are always stored in a digital format using a finite number of bits. Given a dataset of digital images, the number of bits $L$ to store each of them can be further reduced using lossy data compression. This, however, can degrade the quality of the model trained on such images, since each example has lower resolution. In order to capture this trade-off and optimize storage of training data, we propose a `storage scaling law' that describes the joint evolution of test error with sample size and number of bits per image. We prove that this law holds within a stylized model for image compression, and verify it empirically on two computer vision tasks, extracting the relevant parameters. We then show that this law can be used to optimize the lossy compression level. At given storage, models trained on optimally compressed images present a significantly smaller test error with respect to models trained on the original data. Finally, we investigate the potential benefits of randomizing the compression level.
https://arxiv.org/abs/2407.17954
Representing signals using coordinate networks dominates the area of inverse problems recently, and is widely applied in various scientific computing tasks. Still, there exists an issue of spectral bias in coordinate networks, limiting the capacity to learn high-frequency components. This problem is caused by the pathological distribution of the neural tangent kernel's (NTK's) eigenvalues of coordinate networks. We find that, this pathological distribution could be improved using classical normalization techniques (batch normalization and layer normalization), which are commonly used in convolutional neural networks but rarely used in coordinate networks. We prove that normalization techniques greatly reduces the maximum and variance of NTK's eigenvalues while slightly modifies the mean value, considering the max eigenvalue is much larger than the most, this variance change results in a shift of eigenvalues' distribution from a lower one to a higher one, therefore the spectral bias could be alleviated. Furthermore, we propose two new normalization techniques by combining these two techniques in different ways. The efficacy of these normalization techniques is substantiated by the significant improvements and new state-of-the-arts achieved by applying normalization-based coordinate networks to various tasks, including the image compression, computed tomography reconstruction, shape representation, magnetic resonance imaging, novel view synthesis and multi-view stereo reconstruction.
https://arxiv.org/abs/2407.17834
In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
https://arxiv.org/abs/2407.17060
We present FCNR, a fast compressive neural representation for tens of thousands of visualization images under varying viewpoints and timesteps. The existing NeRVI solution, albeit enjoying a high compression ratio, incurs slow speeds in encoding and decoding. Built on the recent advances in stereo image compression, FCNR assimilates stereo context modules and joint context transfer modules to compress image pairs. Our solution significantly improves encoding and decoding speed while maintaining high reconstruction quality and satisfying compression ratio. To demonstrate its effectiveness, we compare FCNR with state-of-the-art neural compression methods, including E-NeRV, HNeRV, NeRVI, and ECSIC. The source code can be found at this https URL.
https://arxiv.org/abs/2407.16369
Learned image compression (LIC) is currently the cutting-edge method. However, the inherent difference between testing and training images of LIC results in performance degradation to some extent. Especially for out-of-sample, out-of-distribution, or out-of-domain testing images, the performance of LIC dramatically degraded. Classical LIC is a serial image compression (SIC) approach that utilizes an open-loop architecture with serial encoding and decoding units. Nevertheless, according to the theory of automatic control, a closed-loop architecture holds the potential to improve the dynamic and static performance of LIC. Therefore, a circular image compression (CIC) approach with closed-loop encoding and decoding elements is proposed to minimize the gap between testing and training images and upgrade the capability of LIC. The proposed CIC establishes a nonlinear loop equation and proves that steady-state error between reconstructed and original images is close to zero by Talor series expansion. The proposed CIC method possesses the property of Post-Training and plug-and-play which can be built on any existing advanced SIC methods. Experimental results on five public image compression datasets demonstrate that the proposed CIC outperforms five open-source state-of-the-art competing SIC algorithms in reconstruction capacity. Experimental results further show that the proposed method is suitable for out-of-sample testing images with dark backgrounds, sharp edges, high contrast, grid shapes, or complex patterns.
https://arxiv.org/abs/2407.15870