Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning

Abstract
Abstract (translated)
URL
PDF

Abstract

A framework performing Visual Commonsense Reasoning(VCR) needs to choose an answer and further provide a rationale justifying based on the given image and question, where the image contains all the facts for reasoning and requires to be sufficiently understood. Previous methods use a detector applied on the image to obtain a set of visual objects without considering the exact positions of them in the scene, which is inadequate for properly understanding spatial and semantic relationships between objects. In addition, VCR samples are quite diverse, and parameters of the framework tend to be trained suboptimally based on mini-batches. To address above challenges, pseudo 3D perception Transformer with multi-level confidence optimization named PPTMCO is proposed for VCR in this paper. Specifically, image depth is introduced to represent pseudo 3-dimension(3D) positions of objects along with 2-dimension(2D) coordinates in the image and further enhance visual features. Then, considering that relationships between objects are influenced by depth, depth-aware Transformer is proposed to do attention mechanism guided by depth differences from answer words and objects to objects, where each word is tagged with pseudo depth value according to related objects. To better optimize parameters of the framework, a model parameter estimation method is further proposed to weightedly integrate parameters optimized by mini-batches based on multi-level reasoning confidence. Experiments on the benchmark VCR dataset demonstrate the proposed framework performs better against the state-of-the-art approaches.

Abstract (translated)

一个执行视觉常识推理的框架需要根据给定的图像和问题选择答案,并进一步提供合理的解释,其中图像包含了推理所需的所有事实,需要充分理解。以前的方法使用图像检测器在图像上应用,以获得一组视觉对象,但不考虑它们在场景中的确切位置,这不足以正确理解物体之间的空间和语义关系。此外,VCR样本非常多样化,框架的参数往往基于小批量优化。为了解决上述挑战,我们提出了一种名为PPTMCO的伪三维感知Transformer框架,该框架使用多层次信任优化来优化VCR。具体来说,图像深度引入来表示图像中物体的伪三维位置,并与2D坐标一起增强视觉特征,然后考虑物体之间的关系受深度的影响,我们提出了深度意识的Transformer进行注意力机制的指导,根据相关物体的深度差异来标签每个单词,每个单词根据相关物体的标签附加伪深度值。为了更好地优化框架的参数,我们提出了一种模型参数估计方法,以基于多层次推理信任的加权集成方法来整合小批量优化的参数。在基准VCR数据集上的实验表明,我们提出的框架比当前最佳方法表现更好。

URL

https://arxiv.org/abs/2301.13335

PDF

https://arxiv.org/pdf/2301.13335.pdf