Paper Reading AI Learner

Spatio-Temporal State Space Model For Efficient Event-Based Optical Flow

2025-06-09 15:51:06
Muhammad Ahmed Humais, Xiaoqian Huang, Hussain Sajwani, Sajid Javed, Yahya Zweiri

Abstract

Event cameras unlock new frontiers that were previously unthinkable with standard frame-based cameras. One notable example is low-latency motion estimation (optical flow), which is critical for many real-time applications. In such applications, the computational efficiency of algorithms is paramount. Although recent deep learning paradigms such as CNN, RNN, or ViT have shown remarkable performance, they often lack the desired computational efficiency. Conversely, asynchronous event-based methods including SNNs and GNNs are computationally efficient; however, these approaches fail to capture sufficient spatio-temporal information, a powerful feature required to achieve better performance for optical flow estimation. In this work, we introduce Spatio-Temporal State Space Model (STSSM) module along with a novel network architecture to develop an extremely efficient solution with competitive performance. Our STSSM module leverages state-space models to effectively capture spatio-temporal correlations in event data, offering higher performance with lower complexity compared to ViT, CNN-based architectures in similar settings. Our model achieves 4.5x faster inference and 8x lower computations compared to TMA and 2x lower computations compared to EV-FlowNet with competitive performance on the DSEC benchmark. Our code will be available at this https URL

Abstract (translated)

事件相机解锁了传统帧基摄像头无法实现的新领域。一个典型的例子是低延迟运动估计(光学流),这对许多实时应用来说至关重要。在这些应用场景中,算法的计算效率尤为重要。尽管最近的深度学习范式如CNN、RNN或ViT展现了卓越的表现力,但它们往往缺乏所需的计算效率。相反,异步事件基方法包括SNN和GNN虽然计算效率高,但却无法捕捉足够的时空信息,这是实现更好的光学流估计性能的关键特征之一。 在这项工作中,我们引入了时空状态空间模型(STSSM)模块,并开发了一种新的网络架构,以构建一个极其高效的解决方案并具有竞争性的性能。我们的STSSM模块利用状态空间模型有效地捕获事件数据中的时空相关性,在相似设置下与ViT和基于CNN的架构相比,实现了更高性能的同时降低了复杂度。我们模型在DSEC基准测试中达到比TMA快4.5倍的推理速度,并且计算量减少了8倍;相较于EV-FlowNet,计算量减少了一半,同时保持了竞争性的性能。 我们的代码将在以下网址提供:[此链接应由原作者补充]

URL

https://arxiv.org/abs/2506.07878

PDF

https://arxiv.org/pdf/2506.07878.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot