Paper Reading AI Learner

VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

2025-02-02 07:54:55
Chunbai Zhang, Chao Wang, Yang Zhou, Yan Peng

Abstract

Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning'' to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.

Abstract (translated)

视觉推理指的是解决关于视觉信息的问题。目前的视觉推理方法通常采用预训练的视觉-语言模型(VLM)策略或深度神经网络方法。然而,现有努力受到解释能力有限的限制,并且在问题文本中存在欠规范化的现象,这进一步阻碍了进展。此外,缺乏细粒度的视觉知识也限制了对视觉推理任务中主体行为的精确理解。为了解决这些问题,我们提出了VIKSER(基于视觉知识的自我强化推理框架)。具体而言,VIKSER通过使用从大型语言模型中提炼的知识进行训练,并借助视觉关系检测技术提取细粒度的视觉知识。随后,VIKSER利用这些细粒度的视觉知识对欠规范化的提问进行改写。此外,我们设计了一种新的提示方法叫做证据链(CoE),该方法通过发挥“用于推理的证据”的作用赋予VIKSER可解释的推理能力。同时,自我反思技术的集成使VIKSER能够从错误中学习并提高性能。在广泛使用的数据集上进行的实验表明,VIKSER在相关任务中达到了新的最先进的(SOTA)结果。

URL

https://arxiv.org/abs/2502.00711

PDF

https://arxiv.org/pdf/2502.00711.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot