Paper Reading AI Learner

VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

2025-12-17 18:52:55
Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang

Abstract

Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at this https URL

Abstract (translated)

评估图像压缩性能时,包括人类偏好的测试通常发现,像MSE(均方误差)这样的简单失真函数与人的感知不完全一致。为了使压缩模型更好地符合人类的视觉感知,以前的工作采用了一种可微分的感知损失方法——这种方法使用经过大规模的人类心理视觉判断数据集校准的神经网络。 我们研究发现,令人惊讶的是,最先进的视觉-语言模型(VLM)在被要求分析图像对之间的差异时,可以零样本复制人类的二元两择一强迫选择(2AFC)判断。受到这一能力的启发,我们提出了基于视觉语言模型的图像压缩系统(VLIC),这是一个基于扩散模型设计的图像压缩系统,旨在通过使用VLM进行的二元判断来进行后期训练。 VLIC利用现有的用于扩散模型偏好后训练的技术,而不是将VLM的判断提炼成独立的感知损失网络。我们的实验表明,在特定数据集上,根据感知指标和大规模用户研究的结果,校准后的VLIC系统在与人类视觉感知对齐的图像压缩性能方面可以达到竞争水平或最先进的表现。 此外,我们还进行了详尽的基于VLM奖励设计和训练过程的分析,并分享了重要的见解。有关更多可视化资料,请访问提供的链接。

URL

https://arxiv.org/abs/2512.15701

PDF

https://arxiv.org/pdf/2512.15701.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot