Paper Reading AI Learner

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

2025-06-18 17:14:07
Shuo Xing, Lanqing Guo, Hongyuan Hua, Seoyoung Lee, Peiran Li, Yufei Wang, Zhangyang Wang, Zhengzhong Tu

Abstract

Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.

Abstract (translated)

近期的多模态大型语言模型(MLLM)在视觉-语言基准任务上表现出色,但关于输入图像质量如何影响其响应仍知之甚少。更高的感知质量图像是否已转化为更好的MLLM理解?我们进行了首个系统研究,涵盖了领先的MLLM和一系列视觉-语言基准,并对每张图片应用了受控退化和风格变化。 令人惊讶的是,我们揭示了一个视觉质量悖论:当图像偏离人类感知的保真度时,模型、任务甚至个别实例的表现可能会有所提高。现成的修复管道无法解决这些特异性的偏好。为缩小这一差距,我们引入了视觉质量测试时间调整(VQ-TTT)——一个轻量级适应模块,它: 1. 在冻结的视觉编码器之前插入可学习的低秩核来调节频率内容; 2. 仅通过LoRA微调浅层视觉编码器层次。 VQ-TTT在单个前向传递中动态调整每个输入图像,使其与特定任务的模型偏好相匹配。在评估的所有MLLM和数据集中,VQ-TTT显著提高了平均准确率,并且不需要外部模型、缓存特征或额外训练数据。 这些发现重新定义了MLLM中的“更好”的视觉输入,并强调在新的AI时代作为主要数据客户的背景下,需要适应性而非普遍的“清洁”图像。

URL

https://arxiv.org/abs/2506.15645

PDF

https://arxiv.org/pdf/2506.15645.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot