Abstract
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.
Abstract (translated)
近期的多模态大型语言模型(MLLM)在视觉-语言基准任务上表现出色,但关于输入图像质量如何影响其响应仍知之甚少。更高的感知质量图像是否已转化为更好的MLLM理解?我们进行了首个系统研究,涵盖了领先的MLLM和一系列视觉-语言基准,并对每张图片应用了受控退化和风格变化。 令人惊讶的是,我们揭示了一个视觉质量悖论:当图像偏离人类感知的保真度时,模型、任务甚至个别实例的表现可能会有所提高。现成的修复管道无法解决这些特异性的偏好。为缩小这一差距,我们引入了视觉质量测试时间调整(VQ-TTT)——一个轻量级适应模块,它: 1. 在冻结的视觉编码器之前插入可学习的低秩核来调节频率内容; 2. 仅通过LoRA微调浅层视觉编码器层次。 VQ-TTT在单个前向传递中动态调整每个输入图像,使其与特定任务的模型偏好相匹配。在评估的所有MLLM和数据集中,VQ-TTT显著提高了平均准确率,并且不需要外部模型、缓存特征或额外训练数据。 这些发现重新定义了MLLM中的“更好”的视觉输入,并强调在新的AI时代作为主要数据客户的背景下,需要适应性而非普遍的“清洁”图像。
URL
https://arxiv.org/abs/2506.15645