Paper Reading AI Learner

Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer

2023-03-16 22:56:12
Shihao Zou, Yuxuan Mu, Xinxin Zuo, Sen Wang, Li Cheng


Event camera, as an emerging biologically-inspired vision sensor for capturing motion dynamics, presents new potential for 3D human pose tracking, or video-based 3D human pose estimation. However, existing works in pose tracking either require the presence of additional gray-scale images to establish a solid starting pose, or ignore the temporal dependencies all together by collapsing segments of event streams to form static image frames. Meanwhile, although the effectiveness of Artificial Neural Networks (ANNs, a.k.a. dense deep learning) has been showcased in many event-based tasks, the use of ANNs tends to neglect the fact that compared to the dense frame-based image sequences, the occurrence of events from an event camera is spatiotemporally much sparser. Motivated by the above mentioned issues, we present in this paper a dedicated end-to-end \textit{sparse deep learning} approach for event-based pose tracking: 1) to our knowledge this is the first time that 3D human pose tracking is obtained from events only, thus eliminating the need of accessing to any frame-based images as part of input; 2) our approach is based entirely upon the framework of Spiking Neural Networks (SNNs), which consists of Spike-Element-Wise (SEW) ResNet and our proposed spiking spatiotemporal transformer; 3) a large-scale synthetic dataset is constructed that features a broad and diverse set of annotated 3D human motions, as well as longer hours of event stream data, named SynEventHPD. Empirical experiments demonstrate the superiority of our approach in both performance and efficiency measures. For example, with comparable performance to the state-of-the-art ANNs counterparts, our approach achieves a computation reduction of 20\% in FLOPS. Our implementation is made available at this https URL and dataset will be released upon paper acceptance.

Abstract (translated)

事件相机作为新兴的生物学灵感视觉传感器,用于捕获运动动态,提供了3D人类姿态跟踪或视频based3D人类姿态估计的新潜力。然而,现有的关于姿态跟踪的工作要么需要额外的灰度图像以建立稳定的起始姿态,要么忽略时间依赖关系,通过合并事件流Segments以形成静态图像帧。同时,尽管人工智能神经网络(ANNs,也称为密集深度学习)的有效性在许多事件任务中已被展示,但使用ANNs的倾向往往忽视了这个事实,与密集帧based图像序列相比,从事件相机发生的事件在时间和质量上都更稀疏。基于以上提到的问题,在本文中,我们提出了一种专门end-to-end \textit{稀疏深度学习}方法,用于事件based姿态跟踪: 1)据我们所知,这是首次从事件获取3D人类姿态跟踪,从而消除了访问任何帧based图像作为输入的必要性; 2)我们的方法是完全基于Spiking Neural Networks(SNNs)的框架,其中包括Spiking-Element-Wise(SEW)ResNet和我们的提议的Spiking spatiotemporalTransformer; 3)建立了一个大规模的合成数据集,其中包括广泛的和多样化的注释3D人类运动,以及更长的事件流数据,名为SynEventHPD。实证实验证明了我们方法的性能效率和 measures的优越性。例如,与最先进的ANNs替代品相媲美,我们的方法实现了FLOPS的20\%减少。我们的实现在此httpsURL上提供,数据集将在论文接受后发布。



3D Action Action_Localization Action_Recognition Activity Adversarial Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot