Abstract
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.
Abstract (translated)
基于事件的眼跟踪在具有高时间分辨率和高容错性的事件相机提供的功能方面表现出巨大的潜力。然而,眼运动模式的多样性和突然性,包括眨眼、固定、扫视和流畅跟踪,对眼定位提出了严重的挑战。为了实现一个稳定的基于事件的眼跟踪系统,本文提出了双向长时序列建模和时间可变状态选择机制,以充分利用眼睛运动变化对上下文时间信息的响应。具体来说,提出了MambaPupil网络,它由多层卷积编码器提取事件表示的 features,双向Gated Recurrent Unit (GRU) 和线性时间可变状态空间模块 (LTV-SSM) 组成,用于选择性地捕捉上下文关系中的局部相关性。此外,Bina-rep被用作紧凑的事件表示,而提出的数据增强技术,称为事件裁剪,通过应用空间随机掩码对事件图像进行空间随机遮盖,来增强模型的鲁棒性。在 ThreeET-plus 基准上进行的评估显示,MambaPupil 的性能优越,它在 CVPR'2024 AIS Event-based Eye Tracking挑战中获得了第 1 名。
URL
https://arxiv.org/abs/2404.12083