SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

Abstract (translated)

基于视频的视觉关系检测任务，如视频场景图生成和视频场景关系检测，在精细视频理解中发挥着重要作用。然而，当前的视频视觉关系检测数据集存在两个主要限制，阻碍了该领域的研究进展。首先，它们没有在多人人际场景中探索复杂的人际互动。其次，现有数据集中的关系类型具有较低级的语义，并且通常可以通过外观或简单的先验信息来识别，而无需详细的空间时间上下文推理。然而，理解人类之间的高级互动对于理解复杂的人际视频（如体育和监视视频）至关重要。为了解决这个问题，我们提出了一个新的视频视觉关系检测任务：视频人际互动检测，并为此构建了一个名为SportsHHI的数据集。SportsHHI包含了篮球和排球运动中的34个高级互动类别。在11,398个关键帧上，有118,075个人体边界框和50,649个互动实例被注释。为了进行基准，我们提出了一个两阶段基线方法，并通过广泛的实验揭示了成功的人际交互检测的关键因素。我们希望SportsHHI能够刺激在视频中的人际互动理解的研究，并推动在视频视觉关系检测中发展空间时间上下文建模技术。

URL

https://arxiv.org/abs/2404.04565

PDF

https://arxiv.org/pdf/2404.04565.pdf

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

Abstract

Abstract (translated)

URL

PDF Copy

PDF