Paper Reading AI Learner

MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

2024-04-18 11:09:25
Zhong Wang, Zengyu Wan, Han Han, Bohao Liao, Yuliang Wu, Wei Zhai, Yang Cao, Zheng-jun Zha

Abstract

Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.

Abstract (translated)

基于事件的眼跟踪在具有高时间分辨率和高容错性的事件相机提供的功能方面表现出巨大的潜力。然而,眼运动模式的多样性和突然性,包括眨眼、固定、扫视和流畅跟踪,对眼定位提出了严重的挑战。为了实现一个稳定的基于事件的眼跟踪系统,本文提出了双向长时序列建模和时间可变状态选择机制,以充分利用眼睛运动变化对上下文时间信息的响应。具体来说,提出了MambaPupil网络,它由多层卷积编码器提取事件表示的 features,双向Gated Recurrent Unit (GRU) 和线性时间可变状态空间模块 (LTV-SSM) 组成,用于选择性地捕捉上下文关系中的局部相关性。此外,Bina-rep被用作紧凑的事件表示,而提出的数据增强技术,称为事件裁剪,通过应用空间随机掩码对事件图像进行空间随机遮盖,来增强模型的鲁棒性。在 ThreeET-plus 基准上进行的评估显示,MambaPupil 的性能优越,它在 CVPR'2024 AIS Event-based Eye Tracking挑战中获得了第 1 名。

URL

https://arxiv.org/abs/2404.12083

PDF

https://arxiv.org/pdf/2404.12083.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot