Paper Reading AI Learner

Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

2023-10-13 14:39:34
Sheng Zhou, Dan Guo, Jia Li, Xun Yang, Meng Wang

Abstract

Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.

Abstract (translated)

基于文本的视觉问题回答(TextVQA)面临着避免冗余关系推断的重要挑战。具体来说,大量检测到的物体和光学字符识别(OCR)标记会导致丰富的视觉关系。现有的工作将所有视觉关系都考虑在内来预测答案。然而,有三个观察结果:(1)图像中的单个主题很容易被认为是有多个具有不同边界框的重复物体(被视为重复物体);这些重复物之间的关联对于答案推理来说毫无价值;(2)在图像中检测到的距离较远的 OCR 标记通常在答案推理中具有弱的语义依赖性;(3)附近物体和标记的共现可能表明预测答案的重要视觉线索。因此,我们没有将所有这些信息都用于答案预测,而是努力识别最重要的连接或消除冗余的连接。我们提出了一个稀疏空间图网络(SSGN),它引入了一种空间感知关系剪枝技术来解决这个问题。作为关系测量的空间因素,我们使用空间距离、几何维度、重叠面积和 DIoU来进行空间感知剪枝。我们考虑三种图形关系进行图学习:物体-物体,OCR-OCR 标记和物体-OCR 标记关系。SSGN 是一种渐进式图学习架构,验证了相关物体-标记稀疏图中的关键关系,然后在每个物体基础稀疏图和标记基础稀疏图上。TextVQA 和 ST-VQA 数据集的实验结果表明,SSGN 取得了很好的性能。一些可视化结果进一步证明了我们的方法具有可解释性。

URL

https://arxiv.org/abs/2310.09147

PDF

https://arxiv.org/pdf/2310.09147.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot