Paper Reading AI Learner

MemFlow: Optical Flow Estimation and Prediction with Memory

2024-04-07 04:56:58
Qiaole Dong, Yanwei Fu

Abstract

Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input, whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead, typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation, compromising real-time applicability in safety-critical scenarios. To this end, we present MemFlow, a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore, we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides, our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation, our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission, MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: this https URL.

Abstract (translated)

光学流是一种经典任务,对视觉社区非常重要。经典的Optical flow估计使用两个帧作为输入,而一些最近的方法考虑多个帧以明确建模长距离信息。前者限制了其在视频序列中充分利用时间一致性的能力;而后者则导致计算开销巨大,通常不适用于实时流估计。一些基于多帧的方法甚至需要观察到的未来帧来进行当前估计,从而在安全关键场景中降低了实时应用的可行性。为此,我们提出了MemFlow,一种在内存中进行光学流估计和预测的实时方法。我们的方法允许在实时过程中聚合历史运动信息。此外,我们还采用分辨率自适应缩放,以适应不同的视频分辨率。此外,我们的方法还扩展到基于过去观察进行光学流未来预测。通过有效的历史运动聚合,我们的方法在Sintel和KITTI-15数据集上的性能优于VideoFlow,具有更少的参数和更快的推理速度。到提交时,MemFlow还在1080p Spring数据集上领先。代码和模型将在此处提供:https://这个链接。

URL

https://arxiv.org/abs/2404.04808

PDF

https://arxiv.org/pdf/2404.04808.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot