Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

Abstract
Abstract (translated)
URL
PDF

Abstract

Event camera, as an emerging biologically-inspired vision sensor for capturing motion dynamics, presents new potential for 3D human pose tracking, or video-based 3D human pose estimation. However, existing works in pose tracking either require the presence of additional gray-scale images to establish a solid starting pose, or ignore the temporal dependencies all together by collapsing segments of event streams to form static image frames. Meanwhile, although the effectiveness of Artificial Neural Networks (ANNs, a.k.a. dense deep learning) has been showcased in many event-based tasks, the use of ANNs tends to neglect the fact that compared to the dense frame-based image sequences, the occurrence of events from an event camera is spatiotemporally much sparser. Motivated by the above mentioned issues, we present in this paper a dedicated end-to-end \textit{sparse deep learning} approach for event-based pose tracking: 1) to our knowledge this is the first time that 3D human pose tracking is obtained from events only, thus eliminating the need of accessing to any frame-based images as part of input; 2) our approach is based entirely upon the framework of Spiking Neural Networks (SNNs), which consists of Spike-Element-Wise (SEW) ResNet and our proposed spiking spatiotemporal transformer; 3) a large-scale synthetic dataset is constructed that features a broad and diverse set of annotated 3D human motions, as well as longer hours of event stream data, named SynEventHPD. Empirical experiments demonstrate the superiority of our approach in both performance and efficiency measures. For example, with comparable performance to the state-of-the-art ANNs counterparts, our approach achieves a computation reduction of 20\% in FLOPS. Our implementation is made available at this https URL and dataset will be released upon paper acceptance.

Abstract (translated)

事件相机作为新兴的生物学灵感视觉传感器,用于捕获运动动态,提供了3D人类姿态跟踪或视频based3D人类姿态估计的新潜力。然而,现有的关于姿态跟踪的工作要么需要额外的灰度图像以建立稳定的起始姿态,要么忽略时间依赖关系,通过合并事件流Segments以形成静态图像帧。同时,尽管人工智能神经网络(ANNs,也称为密集深度学习)的有效性在许多事件任务中已被展示,但使用ANNs的倾向往往忽视了这个事实,与密集帧based图像序列相比,从事件相机发生的事件在时间和质量上都更稀疏。基于以上提到的问题,在本文中,我们提出了一种专门end-to-end \textit{稀疏深度学习}方法,用于事件based姿态跟踪: 1)据我们所知,这是首次从事件获取3D人类姿态跟踪,从而消除了访问任何帧based图像作为输入的必要性; 2)我们的方法是完全基于Spiking Neural Networks(SNNs)的框架,其中包括Spiking-Element-Wise(SEW)ResNet和我们的提议的Spiking spatiotemporalTransformer; 3)建立了一个大规模的合成数据集,其中包括广泛的和多样化的注释3D人类运动,以及更长的事件流数据,名为SynEventHPD。实证实验证明了我们方法的性能效率和 measures的优越性。例如,与最先进的ANNs替代品相媲美,我们的方法实现了FLOPS的20\%减少。我们的实现在此httpsURL上提供,数据集将在论文接受后发布。

URL

https://arxiv.org/abs/2303.09681

PDF

https://arxiv.org/pdf/2303.09681.pdf