Paper Reading AI Learner

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

2025-06-18 17:23:36
Ankan Deria, Adinath Madhavrao Dukre, Feilong Tang, Sara Atito, Sudipta Roy, Muhammad Awais, Muhammad Haris Khan, Imran Razzak

Abstract

Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.

Abstract (translated)

尽管在视觉语言模型(VLMs)的推理时间搜索方面取得了显著进展,现有方法仍然既计算成本高昂又容易产生不受惩罚的低置信度生成结果,这往往会导致持续的幻觉现象。我们引入了**基于边际奖励的价值引导推理(ViMaR)**,这是一种两阶段推理框架,通过结合时差价值模型和边际感知的奖励调整来提高效率和输出保真度。 在第一阶段中,我们进行一次遍历以从多样化的候选方案中识别出最具价值的描述。第二阶段则有针对性地优化那些被忽视或视觉基础较弱的部分,从而消除频繁受到奖励评估的影响。一个校准过的边际惩罚机制会抑制低置信度的延续生成,同时保留描述的丰富性。 在多种VLM架构上的广泛实验表明,ViMaR能够产生更加可靠、事实准确、详细且具有解释性的描述,并与现有基于价值引导的方法相比,在速度上实现了超过4倍的加速。特别是,我们展示了仅使用LLaVA Mistral-7B训练的ViMaR可以**有效指导解码在未见过的强大模型中**进行操作。为了进一步验证这一点,我们将ViMaR调整为在LLaVA-OneVision-Qwen2-7B中引导生成,从而提高了描述质量的一致性,并展示了跨模型指导的稳健性能。 这种跨模型泛化突显了ViMaR的灵活性和模块化特性,使其成为一种可扩展且具有迁移性的推理时间解码策略。此外,在使用由ViMaR生成的描述进行自我训练时,底层模型在一系列视觉理解基准测试中实现了显著提升,这强调了快速、准确且自我改进的VLM管道的巨大潜力。 简而言之,ViMaR不仅提高了VLM输出的质量和效率,还展示了其强大的跨模型泛化能力和作为高效自我改善策略的潜在价值。

URL

https://arxiv.org/abs/2506.15649

PDF

https://arxiv.org/pdf/2506.15649.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot