Paper Reading AI Learner

Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN

2024-04-22 15:28:42
Baoheng Zhang, Yizhao Gao, Jingyuan Li, Hayden Kwok-Hay So

Abstract

Eye-tracking technology is integral to numerous consumer electronics applications, particularly in the realm of virtual and augmented reality (VR/AR). These applications demand solutions that excel in three crucial aspects: low-latency, low-power consumption, and precision. Yet, achieving optimal performance across all these fronts presents a formidable challenge, necessitating a balance between sophisticated algorithms and efficient backend hardware implementations. In this study, we tackle this challenge through a synergistic software/hardware co-design of the system with an event camera. Leveraging the inherent sparsity of event-based input data, we integrate a novel sparse FPGA dataflow accelerator customized for submanifold sparse convolution neural networks (SCNN). The SCNN implemented on the accelerator can efficiently extract the embedding feature vector from each representation of event slices by only processing the non-zero activations. Subsequently, these vectors undergo further processing by a gated recurrent unit (GRU) and a fully connected layer on the host CPU to generate the eye centers. Deployment and evaluation of our system reveal outstanding performance metrics. On the Event-based Eye-Tracking-AIS2024 dataset, our system achieves 81% p5 accuracy, 99.5% p10 accuracy, and 3.71 Mean Euclidean Distance with 0.7 ms latency while only consuming 2.29 mJ per inference. Notably, our solution opens up opportunities for future eye-tracking systems. Code is available at this https URL.

Abstract (translated)

眼动技术是许多消费电子产品(尤其是虚拟和增强现实)的重要组成部分。这些应用需要具备三个关键方面的解决方案:低延迟、低功耗和高精度。然而,在所有这些方面实现最佳性能仍然是一个具有挑战性的任务,需要平衡复杂的算法和高效的后台硬件实现之间的权衡。在这项研究中,我们通过与事件相机协同设计的系统来应对这个挑战。我们利用事件数据固有的稀疏性,为子manifold稀疏卷积神经网络(SCNN)实现了一个新的稀疏FPGA数据流加速器。该加速器对每个事件切片的表现进行处理,仅在激活非零的情况下处理。然后,这些向量通过门控循环单元(GRU)和主机CPU上的全连接层进行进一步处理,生成眼心。部署和评估我们的系统揭示了出色的性能指标。在基于事件的Eye-Tracking-AIS2024数据集上,我们的系统实现81%的p5准确率、99.5%的p10准确率和3.71Mean Euclidean Distance,具有0.7ms的延迟,而仅消耗2.29mJ的推理。值得注意的是,我们的解决方案为未来的眼动系统提供了可能。代码可以从该链接的URL中获取。

URL

https://arxiv.org/abs/2404.14279

PDF

https://arxiv.org/pdf/2404.14279.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot