Abstract
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.
Abstract (translated)
基于文本的视觉问题回答(TextVQA)面临着避免冗余关系推断的重要挑战。具体来说,大量检测到的物体和光学字符识别(OCR)标记会导致丰富的视觉关系。现有的工作将所有视觉关系都考虑在内来预测答案。然而,有三个观察结果:(1)图像中的单个主题很容易被认为是有多个具有不同边界框的重复物体(被视为重复物体);这些重复物之间的关联对于答案推理来说毫无价值;(2)在图像中检测到的距离较远的 OCR 标记通常在答案推理中具有弱的语义依赖性;(3)附近物体和标记的共现可能表明预测答案的重要视觉线索。因此,我们没有将所有这些信息都用于答案预测,而是努力识别最重要的连接或消除冗余的连接。我们提出了一个稀疏空间图网络(SSGN),它引入了一种空间感知关系剪枝技术来解决这个问题。作为关系测量的空间因素,我们使用空间距离、几何维度、重叠面积和 DIoU来进行空间感知剪枝。我们考虑三种图形关系进行图学习:物体-物体,OCR-OCR 标记和物体-OCR 标记关系。SSGN 是一种渐进式图学习架构,验证了相关物体-标记稀疏图中的关键关系,然后在每个物体基础稀疏图和标记基础稀疏图上。TextVQA 和 ST-VQA 数据集的实验结果表明,SSGN 取得了很好的性能。一些可视化结果进一步证明了我们的方法具有可解释性。
URL
https://arxiv.org/abs/2310.09147