Abstract
State of the art visual relation detection methods have been relying on features extracted from RGB images including objects' 2D positions. In this paper, we argue that the 3D positions of objects in space can provide additional valuable information about object relations. This information helps not only to detect spatial relations, such as "standing behind", but also non-spatial relations, such as "holding". Since 3D information of a scene is not easily accessible, we propose incorporating a pre-trained RGB-to-Depth model within visual relation detection frameworks. We discuss different feature extraction strategies from depth maps and show their critical role in relation detection. Our experiments confirm that the performance of state-of-the-art visual relation detection approaches can significantly be improved by utilizing depth map information.
Abstract (translated)
最先进的视觉关系检测方法依赖于从RGB图像中提取的特征,包括物体的二维位置。本文认为,物体在空间中的三维位置可以提供有关物体关系的其他有价值的信息。这些信息不仅有助于检测空间关系,如“站在后面”,也有助于检测非空间关系,如“保持”。由于场景的三维信息不易获取,我们建议在视觉关系检测框架中加入一个预先训练的RGB到深度模型。我们讨论了深度图的不同特征提取策略,并说明了它们在关系检测中的关键作用。我们的实验证实,利用深度图信息可以显著提高最先进的视觉关系检测方法的性能。
URL
https://arxiv.org/abs/1905.00966