In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations. To reduce the decoding time resulting from the serial autoregressive context model, the parallel context model has been proposed as an alternative that necessitates only two passes during the decoding phase, thus facilitating efficient image compression in real-world scenarios. However, performance degradation occurs due to its incomplete casual context. To tackle this issue, we conduct an in-depth analysis of the performance degradation observed in existing parallel context models, focusing on two aspects: the Quantity and Quality of information utilized for context prediction and decoding. Based on such analysis, we propose the \textbf{Corner-to-Center transformer-based Context Model (C$^3$M)} designed to enhance context and latent predictions and improve rate-distortion performance. Specifically, we leverage the logarithmic-based prediction order to predict more context features from corner to center progressively. In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder to capture the long-range semantic information by assigning the different window shapes in different channels. Extensive experimental evaluations show that the proposed method is effective and outperforms the state-of-the-art parallel methods. Finally, according to the subjective analysis, we suggest that improving the detailed representation in transformer-based image compression is a promising direction to be explored.
在学习图像压缩的框架中,隐式表示之间的依赖关系对上下文模型至关重要。为了减少由序列自回归上下文模型引起的解码时间,提出了一个并行上下文模型作为替代方案,它在解码阶段仅需要两个通过,从而在现实场景中实现高效的图像压缩。然而,由于其不完整的随性上下文,性能下降。为了应对这个问题,我们深入研究了现有并行上下文模型的性能下降,重点关注两个方面:用于上下文预测和解码的信息量以及质量。根据这种分析,我们提出了基于角落到中心的Transformer-based上下文模型(C$^3$M),旨在增强上下文和隐含预测并提高码率失真性能。具体来说,我们利用对数预测来预测从角落到中心逐步更多的上下文特征。此外,为了扩大分析和解码变换的接收范围,我们在编码器/解码器中使用Long-range Crossing Attention Module(LCAM),通过为不同通道分配不同窗口形状来捕捉长距离语义信息。大量的实验评估结果表明,所提出的方法是有效的,并优于最先进的并行方法。最后,根据主观分析,我们认为改进基于Transformer的图像压缩详细表示是一个值得探索的方向。
https://arxiv.org/abs/2311.18103
CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.
CLIP是一种广泛使用的基本视觉语言模型,用于零散拍摄图像识别和其他图像文本对齐任务。我们证明了CLIP在压缩过程中图像质量的改变会导致其可靠性下降。这一令人惊讶的结果进一步使用归因方法——集成梯度进行分析。使用这种归因方法,我们能够更好地理解压缩如何影响该模型零散拍摄图像识别精度的本质。我们在CIFAR-10和STL-10上对其进行了广泛评估。我们的工作为理解CLIP这种漏洞提供了基础,并可以帮助我们开发更有效的方法来提高CLIP和其他视觉语言模型的稳健性。
https://arxiv.org/abs/2311.14029
The explosion of data has resulted in more and more associated text being transmitted along with images. Inspired by from distributed source coding, many works utilize image side information to enhance image compression. However, existing methods generally do not consider using text as side information to enhance perceptual compression of images, even though the benefits of multimodal synergy have been widely demonstrated in research. This begs the following question: How can we effectively transfer text-level semantic dependencies to help image compression, which is only available to the decoder? In this work, we propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features. This is done by predicting a semantic mask to guide the learned text-adaptive affine transformation at the pixel level. Furthermore, we design a text-conditional generative adversarial networks to improve the perceptual quality of reconstructed images. Extensive experiments involving four datasets and ten image quality assessment metrics demonstrate that the proposed approach achieves superior results in terms of rate-perception trade-off and semantic distortion.
数据爆炸导致随着图像一起传输的相关文本越来越多。受到分布式源编码的启发,许多作品利用图像的侧信息来增强图像压缩。然而,现有的方法通常不考虑使用文本作为侧信息来增强图像的感知压缩,即使多模态协同效益在研究中得到了广泛证明。这引发了以下问题:我们如何才能有效地将文本级的语义依赖转移给帮助图像压缩,而这是解码器所拥有的唯一资源?在本文中,我们提出了一种带有文本指导的图像压缩新方法,以实现更好的速率-感知- distortion 权衡。具体来说,我们使用 CLIP 文本编码器和有效的语义-空间感知模块来融合文本和图像特征。这是通过预测语义掩码来引导学习到的文本适应性平移在像素级别。此外,我们还设计了一个文本条件生成对抗网络,以提高重构图像的感知质量。在四个数据集和十个图像质量评估指标的广泛实验中,与所提出的方法相比,具有卓越的速率和感知质量权衡结果。
https://arxiv.org/abs/2311.13847
In this paper, we propose a progressive learning paradigm for transformer-based variable-rate image compression. Our approach covers a wide range of compression rates with the assistance of the Layer-adaptive Prompt Module (LPM). Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively, which are fed as additional information into the Swin Transformer layer of a pre-trained transformer-based image compression model to affect the allocation of attention region and the bits, which in turn changes the target compression ratio of the model. To ensure the network is more lightweight, we involves the integration of prompt networks with less convolutional layers. Exhaustive experiments show that compared to methods based on multiple models, which are optimized separately for different target rates, the proposed method arrives at the same performance with 80% savings in parameter storage and 90% savings in datasets. Meanwhile, our model outperforms all current variable bitrate image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed bitrate image compression methods trained from scratch.
在本文中,我们提出了一个适用于基于Transformer的变比图像压缩的渐进学习范式。我们的方法通过层自适应提示模块(LPM)对压缩率范围进行了广泛的覆盖。受到视觉提示调整的启发,我们使用LPM在编码器和解码器侧提取输入图像和隐藏特征的提示,这些提示作为额外的信息输入到预训练Transformer图像压缩模型的Swin Transformer层中,从而影响分配的注意区域和比特数,进而改变模型的目标压缩比。为了确保网络更加轻量化,我们将提示网络与较少的卷积层集成。全面的实验结果表明,与基于多个模型的方法相比,基于80%的参数存储和90%的数据集节省的方法达到了相同的表现。同时,我们的模型在速率失真性能方面优于所有当前变比图像方法,并接近于从零开始训练的固定比特率图像压缩方法的性能水平。
https://arxiv.org/abs/2311.13846
Traditional animal identification methods such as ear-tagging, ear notching, and branding have been effective but pose risks to the animal and have scalability issues. Electrical methods offer better tracking and monitoring but require specialized equipment and are susceptible to attacks. Biometric identification using time-immutable dermatoglyphic features such as muzzle prints and iris patterns is a promising solution. This project explores cattle identification using 4923 muzzle images collected from 268 beef cattle. Two deep learning classification models are implemented - wide ResNet50 and VGG16\_BN and image compression is done to lower the image quality and adapt the models to work for the African context. From the experiments run, a maximum accuracy of 99.5\% is achieved while using the wide ResNet50 model with a compression retaining 25\% of the original image. From the study, it is noted that the time required by the models to train and converge as well as recognition time are dependent on the machine used to run the model.
传统动物识别方法,如耳标、耳咬和刻字等,已经有效,但对动物存在风险,并且具有可扩展性问题。电气方法提供更好的追踪和监测,但需要专用设备,并且容易受到攻击。使用时间不可变的皮肤纹理特征(如嘴角纹和眼纹)进行生物识别是一种有前途的解决方案。本项目探讨使用从268头牛收集的4923个嘴角图像对牛进行识别。我们实施了两种深度学习分类模型——宽ResNet50和VGG16\_BN,并对图像进行压缩以降低图像质量,以适应非洲语境。实验结果显示,在使用宽ResNet50模型,保留25%原始图像压缩的情况下,达到最高准确度为99.5%。从研究中可以得知,模型的训练和收敛时间以及识别时间都取决于运行模型的机器。
https://arxiv.org/abs/2311.08148
With the rapid development of Artificial Intelligent Internet of Things (AIoT), the image data from AIoT devices has been witnessing the explosive increasing. In this paper, a novel deep image semantic communication model is proposed for the efficient image communication in AIoT. Particularly, at the transmitter side, a high-precision image semantic segmentation algorithm is proposed to extract the semantic information of the image to achieve significant compression of the image data. At the receiver side, a semantic image restoration algorithm based on Generative Adversarial Network (GAN) is proposed to convert the semantic image to a real scene image with detailed information. Simulation results demonstrate that the proposed image semantic communication model can improve the image compression ratio and recovery accuracy by 71.93% and 25.07% on average in comparison with WebP and CycleGAN, respectively. More importantly, our demo experiment shows that the proposed model reduces the total delay by 95.26% in the image communication, when comparing with the original image transmission.
随着人工智能物联网(AIoT)的快速发展,AIoT设备的图像数据见证了爆炸式增长。在本文中,我们提出了一个新颖的深度图像语义通信模型,用于AIoT的高效图像通信。特别地,在发送方,高精度图像语义分割算法被提出,以提取图像的语义信息,从而显著压缩图像数据。在接收方,基于生成对抗网络(GAN)的语义图像恢复算法被提出,将语义图像转换为具有详细信息的真实场景图像。仿真结果表明,与WebP和CycleGAN相比,所提出的图像语义通信模型可以提高图像压缩比和恢复精度分别71.93%和25.07%。更重要的是,我们的演示实验表明,与原始图像传输相比,所提出的模型可以降低图像通信的总延迟95.26%。
https://arxiv.org/abs/2311.02926
Studying the solar system and especially the Sun relies on the data gathered daily from space missions. These missions are data-intensive and compressing this data to make them efficiently transferable to the ground station is a twofold decision to make. Stronger compression methods, by distorting the data, can increase data throughput at the cost of accuracy which could affect scientific analysis of the data. On the other hand, preserving subtle details in the compressed data requires a high amount of data to be transferred, reducing the desired gains from compression. In this work, we propose a neural network-based lossy compression method to be used in NASA's data-intensive imagery missions. We chose NASA's SDO mission which transmits 1.4 terabytes of data each day as a proof of concept for the proposed algorithm. In this work, we propose an adversarially trained neural network, equipped with local and non-local attention modules to capture both the local and global structure of the image resulting in a better trade-off in rate-distortion (RD) compared to conventional hand-engineered codecs. The RD variational autoencoder used in this work is jointly trained with a channel-dependent entropy model as a shared prior between the analysis and synthesis transforms to make the entropy coding of the latent code more effective. Our neural image compression algorithm outperforms currently-in-use and state-of-the-art codecs such as JPEG and JPEG-2000 in terms of the RD performance when compressing extreme-ultraviolet (EUV) data. As a proof of concept for use of this algorithm in SDO data analysis, we have performed coronal hole (CH) detection using our compressed images, and generated consistent segmentations, even at a compression rate of $\sim0.1$ bits per pixel (compared to 8 bits per pixel on the original data) using EUV data from SDO.
研究太阳系和特别是太阳,需要每天从空间任务收集到的数据。这些任务数据密集,压缩这些数据以使其在地面站有效地传输是一个两重决定。通过扭曲数据,使用更强的压缩方法可以提高数据传输率,但会降低数据的准确性,从而影响数据科学分析。另一方面,保留压缩数据中的微小细节需要转移大量数据,从而减少压缩所期望的增益。在这项工作中,我们提出了一个基于神经网络的损失y压缩方法,用于NASA等数据密集型图像任务。我们选择了NASA的SDO使命,它每天传输1.4TB的数据,作为所提出的算法的演示。在这项工作中,我们提出了一个具有局部和非局部注意力的对抗性训练神经网络,以捕捉图像的局部和全局结构,从而在速率失真(RD)方面实现更好的平衡。在本文中使用的RD变分自编码器与通道相关的熵模型作为分析变换和合成变换之间的共享先验,以使潜在代码的熵编码更加有效。我们的神经图像压缩算法在压缩极端紫外(EUV)数据方面的RD性能优于目前使用的和最先进的编码器,如JPEG和JPEG-2000。为了证明将这种算法用于SDO数据分析的可行性,我们在压缩图像上使用我们的算法进行日冕洞(CH)检测,并使用SDO的 EUV 数据生成一致的分割,即使在压缩率为$\sim0.1$比特每像素(与原始数据相比,8比特每像素)的情况下。
https://arxiv.org/abs/2311.02855
With the fast development of modern microscopes and bioimaging techniques, an unprecedentedly large amount of imaging data are being generated, stored, analyzed, and even shared through networks. The size of the data poses great challenges for current data infrastructure. One common way to reduce the data size is by image compression. This present study analyzes classic and deep learning based image compression methods, and their impact on deep learning based image processing models. Deep learning based label-free prediction models (i.e., predicting fluorescent images from bright field images) are used as an example application for comparison and analysis. Effective image compression methods could help reduce the data size significantly without losing necessary information, and therefore reduce the burden on data management infrastructure and permit fast transmission through the network for data sharing or cloud computing. To compress images in such a wanted way, multiple classical lossy image compression techniques are compared to several AI-based compression models provided by and trained with the CompressAI toolbox using python. These different compression techniques are compared in compression ratio, multiple image similarity measures and, most importantly, the prediction accuracy from label-free models on compressed images. We found that AI-based compression techniques largely outperform the classic ones and will minimally affect the downstream label-free task in 2D cases. In the end, we hope the present study could shed light on the potential of deep learning based image compression and the impact of image compression on downstream deep learning based image analysis models.
随着现代显微镜和生物成像技术的快速发展,生成了前所未有的大量图像数据,并通过网络进行存储、分析和共享。数据的大小给现有数据基础设施带来了巨大的挑战。减少数据大小的一种常见方法是图像压缩。本研究基于经典和深度学习的图像压缩方法,及其对基于深度学习的图像处理模型的影响,对图像压缩算法的性能进行了分析。基于标签的免费预测模型(即从明场图像预测荧光图像)被用作比较和分析的例子应用。有效的图像压缩方法可以在不丢失必要信息的情况下显著减少数据大小,从而减轻数据管理基础设施的负担,并允许通过网络快速传输数据共享或云计算。为了以希望的方式压缩图像,我们使用 Python 对 CompressAI 工具箱中提供的多个经典有损图像压缩技术和几个基于 AI 的压缩模型进行了比较。这些不同的压缩技术在压缩比、多图像相似度衡量和,最重要的是,基于标签的免费模型的压缩图像上的预测准确性进行了比较。我们发现,基于 AI 的压缩技术在2D情况下远远超过了经典方法,对下游的免费标签模型影响最小。最后,我们希望本研究能够阐明基于图像压缩的深度学习的潜力和图像压缩对下游深度图像分析模型的影响。
https://arxiv.org/abs/2311.01352
In the blind single image super-resolution (SISR) task, existing works have been successful in restoring image-level unknown degradations. However, when a single video frame becomes the input, these works usually fail to address degradations caused by video compression, such as mosquito noise, ringing, blockiness, and staircase noise. In this work, we for the first time, present a video compression-based degradation model to synthesize low-resolution image data in the blind SISR task. Our proposed image synthesizing method is widely applicable to existing image datasets, so that a single degraded image can contain distortions caused by the lossy video compression algorithms. This overcomes the leak of feature diversity in video data and thus retains the training efficiency. By introducing video coding artifacts to SISR degradation models, neural networks can super-resolve images with the ability to restore video compression degradations, and achieve better results on restoring generic distortions caused by image compression as well. Our proposed approach achieves superior performance in SOTA no-reference Image Quality Assessment, and shows better visual quality on various datasets. In addition, we evaluate the SISR neural network trained with our degradation model on video super-resolution (VSR) datasets. Compared to architectures specifically designed for the VSR purpose, our method exhibits similar or better performance, evidencing that the presented strategy on infusing video-based degradation is generalizable to address more complicated compression artifacts even without temporal cues.
在盲单图像超分辨率(SISR)任务中,现有的工作已经成功地恢复了图像级的未知降噪效果。然而,当单个视频帧成为输入时,这些工作通常无法解决由视频压缩引起的降噪问题,例如蚊子噪声、振荡、模糊和楼梯噪声。在这项工作中,我们首次提出了一个基于视频压缩的降噪模型,用于在盲SISR任务中合成低分辨率图像数据。我们提出的图像合成方法适用于现有的图像数据集,因此单幅降噪图像可以包含由于失真视频压缩算法引起的扭曲。这克服了视频数据中特征多样性的泄漏,从而保留了训练效率。通过将视频编码伪迹添加到SISR降噪模型中,神经网络具有恢复视频压缩降噪的能力,并在还原图像压缩引起的通用扭曲方面实现更好的效果。与专门为VSR目的设计的架构相比,我们的方法表现出类似或更好的性能,表明在无需时序提示的情况下,将视频为基础的降噪策略是普遍可用的。此外,我们在视频超分辨率(VSR)数据集上对训练有素的SISR神经网络进行了评估。与专门为VSR设计的架构相比,我们的方法表现出类似或更好的性能,表明所提出的策略在解决更复杂的压缩伪迹方面是可扩展的,即使没有时序提示。
https://arxiv.org/abs/2311.00996
Among applications of deep learning (DL) involving low cost sensors, remote image classification involves a physical channel that separates edge sensors and cloud classifiers. Traditional DL models must be divided between an encoder for the sensor and the decoder + classifier at the edge server. An important challenge is to effectively train such distributed models when the connecting channels have limited rate/capacity. Our goal is to optimize DL models such that the encoder latent requires low channel bandwidth while still delivers feature information for high classification accuracy. This work proposes a three-step joint learning strategy to guide encoders to extract features that are compact, discriminative, and amenable to common augmentations/transformations. We optimize latent dimension through an initial screening phase before end-to-end (E2E) training. To obtain an adjustable bit rate via a single pre-deployed encoder, we apply entropy-based quantization and/or manual truncation on the latent representations. Tests show that our proposed method achieves accuracy improvement of up to 1.5% on CIFAR-10 and 3% on CIFAR-100 over conventional E2E cross-entropy training.
在涉及低成本传感器的深度学习(DL)应用中,远程图像分类涉及一个物理通道,该通道将边缘传感器和云分类器分开。传统的DL模型必须将编码器分配给传感器和边缘服务器上的解码器和分类器。一个重要的挑战是在连接通道的带宽有限的情况下,有效地训练这种分布式模型。我们的目标是优化具有低通道带宽的深度学习(DL)模型,同时仍然为高分类准确率提供特征信息。本文提出了一种三步联合学习策略,以指导编码器提取紧凑、有区分性和适用于常见增强/变换的特征。在端到端(E2E)训练之前进行初始筛选阶段来优化潜在维度。为了通过单个预部署的编码器获得可调整的比特率,我们对潜在表示进行熵基量化或手动截断。测试表明,与传统的E2E交叉熵训练相比,我们提出的方法在CIFAR-10和CIFAR-100上的准确率提高了1.5%至3%。
https://arxiv.org/abs/2310.19675
Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in BD-rate on the Kodak, Tecnick, and CLIC datasets.
近年来,学习图像压缩(LIC)作为一种有效的图像存储和传输解决方案已经得到了广泛关注。然而,现有的LIC方法在潜在表示中冗余,因为它们无法捕捉到模擬頻率分量并保留方向細節。为了克服这些挑战,我们提出了一个新颖的频率感知Transformer(FAT)模块,实现了LIC的多尺度方向分析。FAT模块包括频谱分解窗口注意(FDWA)模块,用于捕捉自然图像的多尺度和高方向频分量。此外,我们还引入了频谱调制递归网络(FMFFN)以自适应地调整不同频分量,从而提高码率失真性能。此外,我们提出了一个基于Transformer的通道逐级自回归(T-CA)模型,有效利用了通道依赖关系。实验证明,我们的方法与现有LIC方法相比实现了最先进的码率失真性能,在Kodak、Tecnick和CLIC数据集上分别比最新标准的VTM-12.1提高了14.5%、15.1%和13.0%。
https://arxiv.org/abs/2310.16387
Remote medical diagnosis has emerged as a critical and indispensable technique in practical medical systems, where medical data are required to be efficiently compressed and transmitted for diagnosis by either professional doctors or intelligent diagnosis devices. In this process, a large amount of redundant content irrelevant to the diagnosis is subjected to high-fidelity coding, leading to unnecessary transmission costs. To mitigate this, we propose diagnosis-oriented medical image compression, a special semantic compression task designed for medical scenarios, targeting to reduce the compression cost without compromising the diagnosis accuracy. However, collecting sufficient medical data to optimize such a compression system is significantly expensive and challenging due to privacy issues and the lack of professional annotation. In this study, we propose DMIC, the first efficient transfer learning-based codec, for diagnosis-oriented medical image compression, which can be effectively optimized with only few-shot annotated medical examples, by reusing the knowledge in the existing reinforcement learning-based task-driven semantic coding framework, i.e., HRLVSC [1]. Concretely, we focus on tuning only the partial parameters of the policy network for bit allocation within HRLVSC, which enables it to adapt to the medical images. In this work, we validate our DMIC with the typical medical task, Coronary Artery Segmentation. Extensive experiments have demonstrated that our DMIC can achieve 47.594%BD-Rate savings compared to the HEVC anchor, by tuning only the A2C module (2.7% parameters) of the policy network with only 1 medical sample.
远程医疗诊断已成为实际医疗系统中关键且不可或缺的技术,医疗数据需要有效地压缩和传输以供专业医生或智能诊断设备进行诊断。在此过程中,大量与诊断无关的冗余内容受到高保真度编码,导致不必要的传输成本。为了减轻这种负担,我们提出了以诊断为导向的医疗图像压缩,这是专门针对医疗场景的语义压缩任务,旨在降低压缩成本的同时保持诊断准确性。然而,为了收集足够的医疗数据来优化这样的压缩系统,代价高昂且具有挑战性,由于隐私问题和缺乏专业注释,导致专业医疗标注非常困难。在这项研究中,我们提出了DMIC,第一个基于转移学习的语义压缩码,用于诊断导向的医疗图像压缩,可以通过仅使用几个带有标签的医疗样本来有效优化。具体来说,我们仅关注HRLVSC中位移量网络的部分参数调整,该网络能够适应医疗图像。在本研究中,我们用冠状动脉分割来验证我们的DMIC。大量实验证明,我们的DMIC可以实现47.594%BD-Rate的节省,仅通过调整策略网络的A2C模块(2.7%参数)来实现。
https://arxiv.org/abs/2310.13250
It is not only sufficient to construct computational models that can accurately classify or detect fake images from real images taken from a camera, but it is also important to ensure whether these computational models are fair enough or produce biased outcomes that can eventually harm certain social groups or cause serious security threats. Exploring fairness in forensic algorithms is an initial step towards correcting these biases. Since visual transformers are recently being widely used in most image classification based tasks due to their capability to produce high accuracies, this study tries to explore bias in the transformer based image forensic algorithms that classify natural and GAN generated images. By procuring a bias evaluation corpora, this study analyzes bias in gender, racial, affective, and intersectional domains using a wide set of individual and pairwise bias evaluation measures. As the generalizability of the algorithms against image compression is an important factor to be considered in forensic tasks, this study also analyzes the role of image compression on model bias. Hence to study the impact of image compression on model bias, a two phase evaluation setting is followed, where a set of experiments is carried out in the uncompressed evaluation setting and the other in the compressed evaluation setting.
构建能够准确分类或检测来自相机拍摄的现实图像中的虚假图像的计算模型,不仅 sufficient,而且也很重要。还需要确保这些计算模型是否足够公平,或者是否产生有偏见的结果,最终可能对某些社会群体造成伤害,或导致严重的安全威胁。研究公正性在法医算法中是一个初始步骤,有助于纠正这些偏见。 由于视觉变压器最近在大多数基于图像分类的任务中被广泛使用,因为它们能够产生高准确度,因此 this 研究试图探讨基于变压器的图像法医算法中的偏见。通过采购偏见评估数据集,本研究采用一系列个人和成对偏见评估指标分析性别、种族、情感和交集领域中的偏见。 在法医任务中,图像压缩的通用性是一个需要考虑的重要因素。因此,本研究还分析了图像压缩对模型偏见的作用。因此,为了研究图像压缩对模型偏见的影响,进行了一个两阶段评估设置,其中在未压缩评估设置中进行了一系列实验,在压缩评估设置中进行了一系列实验。
https://arxiv.org/abs/2310.12076
In recent research, Learned Image Compression has gained prominence for its capacity to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods incorporate convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision advocate for a transformative shift towards fully transformer-based architectures grounded in the attention mechanism. This paper investigates the feasibility of image compression exclusively using attention layers within our novel model, QPressFormer. We introduce the concept of learned image queries to aggregate patch information via cross-attention, followed by quantization and coding techniques. Through extensive evaluations, our work demonstrates competitive performance achieved by convolution-free architectures across the popular Kodak, DIV2K, and CLIC datasets.
在最近的研究中,Learned Image Compression(LIC)因其能够在低比特率下超越传统手工管道而受到关注。虽然现有的方法包括使用卷积先验和偶尔的注意力块来解决长距离依赖,但计算机视觉领域的最近进展主张向基于注意机制的全 transformer 架构进行转变。本文研究了在新型模型 QPressFormer 中使用注意力层进行图像压缩的可行性。我们引入了学习图像查询的概念,通过跨注意来聚合补丁信息,然后进行量化和解码技术。通过广泛评估,我们的工作证明了在 Kodak、DIV2K 和 CLIC 数据集上,使用卷积free架构获得的竞争性性能。
https://arxiv.org/abs/2310.11265
We propose a new scheme to re-compress JPEG images in a lossless way. Using a JPEG image as an input the algorithm partially decodes the signal to obtain quantized DCT coefficients and then re-compress them in a more effective way.
我们提出了一个新方法来无损失地重新压缩JPEG图像。使用JPEG图像作为输入,算法在解码信号以获得量化DCT系数后,以更有效的方式重新压缩它们。
https://arxiv.org/abs/2310.10517
Image codecs are typically optimized to trade-off bitrate vs, distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to decode with iterative diffusion models, instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for 'perceptual compression', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods.
图像编码器通常会优化以在比特率与失真度之间做出权衡。在低比特率下,这会导致压缩伪影,即使使用感知或 adversarial 损失进行训练,这些伪影也容易察觉。为了提高图像质量,使它更少地依赖于比特率,我们提出了使用迭代扩散模型进行解码,而不是在大多数神经编码器中使用的 MSE 或 LPIPS 失真度的 feed-forward 解码器。此外,我们还对模型进行条件,使其基于向量量化图像表示,并提供额外上下文。我们将我们的模型称为 PerCo,称之为“感知压缩”,并将其与从 0.1 比特/像素到 0.003 比特/像素的先进编码器进行比较。后者的比特率是前述工作中的一个数量级更小的数。在当前比特率下,512x768 千兆比特的 Kodak 图像仅用 153 字节编码。尽管如此,我们的方法仍然具有重建真实图像的能力。我们发现,我们的模型产生的图像具有与 FID 和 KID 测量的前沿视觉质量相媲美的重构效果,而视觉质量对比特率的变化不敏感。
https://arxiv.org/abs/2310.10325
Over-fitting-based image compression requires weights compactness for compression and fast convergence for practical use, posing challenges for deep convolutional neural networks (CNNs) based methods. This paper presents a simple re-parameterization method to train CNNs with reduced weights storage and accelerated convergence. The convolution kernels are re-parameterized as a weighted sum of discrete cosine transform (DCT) kernels enabling direct optimization in the frequency domain. Combined with L1 regularization, the proposed method surpasses vanilla convolutions by achieving a significantly improved rate-distortion with low computational cost. The proposed method is verified with extensive experiments of over-fitting-based image restoration on various datasets, achieving up to -46.12% BD-rate on top of HEIF with only 200 iterations.
基于过拟合的图像压缩需要压缩和快速收敛的权重紧凑性,这对基于深度卷积神经网络(CNN)的方法提出了挑战。本文提出了一种简单的重新参数化方法来训练具有减少权重存储和加速收敛的CNN。对卷积内核进行重新参数化,使其等于加权离散余弦变换(DCT)卷积核,实现直接在频域中优化。结合L1正则化,所提出的方法在低计算成本下显著超越了普通卷积。通过在各种数据集上进行过拟合图像修复的实验,该方法在仅200次迭代后实现了HEIF的-46.12%的BD-率。
https://arxiv.org/abs/2310.08068
This research presents a novel framework for the compression and decompression of medical images utilizing the Latent Diffusion Model (LDM). The LDM represents advancement over the denoising diffusion probabilistic model (DDPM) with a potential to yield superior image quality while requiring fewer computational resources in the image decompression process. A possible application of LDM and Torchvision for image upscaling has been explored using medical image data, serving as an alternative to traditional image compression and decompression algorithms. The experimental outcomes demonstrate that this approach surpasses a conventional file compression algorithm, and convolutional neural network (CNN) models trained with decompressed files perform comparably to those trained with original image files. This approach also significantly reduces dataset size so that it can be distributed with a smaller size, and medical images take up much less space in medical devices. The research implications extend to noise reduction in lossy compression algorithms and substitute for complex wavelet-based lossless algorithms.
这项研究提出了一种利用潜在扩散模型(LDM)对医疗图像进行压缩和解压缩的新框架。LDM在去噪扩散概率模型(DDPM)的基础上,具有在图像压缩过程中节省计算资源并产生更好图像质量的潜力。利用医疗图像数据研究了LDM和Torchvision在图像升级中的应用,作为传统图像压缩和解压缩算法的替代方案。实验结果表明,这种方法超越了传统文件压缩算法,并且用解压缩文件训练的卷积神经网络(CNN)模型与用原始图像文件训练的模型表现相当。这种方法还显著减少了数据集的大小,使其可以以更小的尺寸进行分配,并且医疗图像在医疗设备上占用的小空间更少。该研究的结论还扩展到失真压缩算法中的噪声减少,并替代了复杂的基于小波的损失less算法。
https://arxiv.org/abs/2310.05299
Telemedicine applications have recently received substantial potential and interest, especially after the COVID-19 pandemic. Remote experience will help people get their complex surgery done or transfer knowledge to local surgeons, without the need to travel abroad. Even with breakthrough improvements in internet speeds, the delay in video streaming is still a hurdle in telemedicine applications. This imposes using image compression and region of interest (ROI) techniques to reduce the data size and transmission needs. This paper proposes a Deep Reinforcement Learning (DRL) model that intelligently adapts the ROI size and non-ROI quality depending on the estimated throughput. The delay and structural similarity index measure (SSIM) comparison are used to assess the DRL model. The comparison findings and the practical application reveal that DRL is capable of reducing the delay by 13% and keeping the overall quality in an acceptable range. Since the latency has been significantly reduced, these findings are a valuable enhancement to telemedicine applications.
远程医疗应用最近得到了很大的潜力和关注,尤其是在新冠病毒大流行之后。远程手术可以帮助人们完成复杂的手术,或者将知识传授给当地的医生,而无需前往国外。即使在互联网速度取得重大突破的情况下,视频直播延迟仍然是一个障碍,在远程医疗应用中。这迫使我们使用图像压缩和区域兴趣(ROI)技术来减小数据大小和传输需求。本文提出了一种基于深度强化学习(DRL)模型的ROI大小和非ROI质量自适应模型。延迟和结构相似性指数(SSIM)比较用于评估DRL模型。比较结果和实际应用表明,DRL模型可以降低延迟13%,并将整体质量保持在可接受范围内。由于延迟已经显著降低,这些发现为远程医疗应用提供了宝贵的改进。
https://arxiv.org/abs/2310.05099
Lossy image coding standards such as JPEG and MPEG have successfully achieved high compression rates for human consumption of multimedia data. However, with the increasing prevalence of IoT devices, drones, and self-driving cars, machines rather than humans are processing a greater portion of captured visual content. Consequently, it is crucial to pursue an efficient compressed representation that caters not only to human vision but also to image processing and machine vision tasks. Drawing inspiration from the efficient coding hypothesis in biological systems and the modeling of the sensory cortex in neural science, we repurpose the compressed latent representation to prioritize semantic relevance while preserving perceptual distance. Our proposed method, Compressed Perceptual Image Patch Similarity (CPIPS), can be derived at a minimal cost from a learned neural codec and computed significantly faster than DNN-based perceptual metrics such as LPIPS and DISTS.
像JPEG和MPEG等 Lossy 图像编码标准已经成功实现了对人类多媒体数据的高压缩率。然而,随着物联网设备、无人机和自动驾驶汽车的普及,机器正在处理越来越多的捕获的视觉内容。因此,追求高效的压缩表示是至关重要的,这种压缩表示不仅要满足人类视觉需求,还要满足图像处理和机器视觉任务的需求。从生物系统中高效的编码假设和神经科学中感官皮层的建模中汲取灵感,我们重新规划了压缩隐状态表示,以优先考虑语义相关性,同时保持感知距离。我们提出的方法名为“压缩感知图像块相似性”(CPIPS),可以从学习的神经编码中以最小成本计算,而计算速度比基于深度学习感知 metrics(如LPIPS和distS)更快。
https://arxiv.org/abs/2310.00559