Abstract
Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we propose a compression framework that leverages text information mainly by text-adaptive encoding and training with joint image-text loss. By doing so, we avoid decoding based on text-guided generative models -- known for high generative diversity -- and effectively utilize the semantic information of text at a global level. Experimental results on various datasets show that our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions. In particular, our method outperforms all baselines in terms of LPIPS, with some room for even more improvements when we use more carefully generated captions.
Abstract (translated)
近年来,在文本引导图像压缩方面的先进技术揭示了增强重构图像感知质量的很大潜力。然而,这些方法往往导致在像素级别显著降低的保真度,限制了其实用性。为了填补这一空白,我们开发了一种新的文本引导图像压缩算法,实现了高感知质量和像素级别的保真度。 特别是,我们提出了一个压缩框架,主要通过文本自适应编码和与联合图像-文本损失的训练来利用文本信息。通过这种方式,我们避免了基于文本引导的生成模型--以其高生成多样性而闻名--的解码,并有效利用了文本的语义信息。在各种数据集上的实验结果表明,我们的方法可以在人类或机器生成的视频中实现高像素级别和感知质量。特别是,当我们使用更仔细生成的注释时,我们的方法在LPIPS方面超越了所有基线。
URL
https://arxiv.org/abs/2403.02944