Abstract
Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning'' to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.
Abstract (translated)
视觉推理指的是解决关于视觉信息的问题。目前的视觉推理方法通常采用预训练的视觉-语言模型(VLM)策略或深度神经网络方法。然而,现有努力受到解释能力有限的限制,并且在问题文本中存在欠规范化的现象,这进一步阻碍了进展。此外,缺乏细粒度的视觉知识也限制了对视觉推理任务中主体行为的精确理解。为了解决这些问题,我们提出了VIKSER(基于视觉知识的自我强化推理框架)。具体而言,VIKSER通过使用从大型语言模型中提炼的知识进行训练,并借助视觉关系检测技术提取细粒度的视觉知识。随后,VIKSER利用这些细粒度的视觉知识对欠规范化的提问进行改写。此外,我们设计了一种新的提示方法叫做证据链(CoE),该方法通过发挥“用于推理的证据”的作用赋予VIKSER可解释的推理能力。同时,自我反思技术的集成使VIKSER能够从错误中学习并提高性能。在广泛使用的数据集上进行的实验表明,VIKSER在相关任务中达到了新的最先进的(SOTA)结果。
URL
https://arxiv.org/abs/2502.00711