Paper Reading AI Learner

Online Temporal Action Localization with Memory-Augmented Transformer

2024-08-06 04:55:33
Youngkil Song, Dongkeun Kim, Minsu Cho, Suha Kwak

Abstract

Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.

Abstract (translated)

在线时序动作局部化(On-TAL)是将给定流式视频中的多个动作实例进行识别的任务。由于现有的方法仅以每次迭代固定的视频片段作为输入,因此它们在考虑长期上下文方面存在局限,需要仔细调整片段大小。为了克服这些限制,我们提出了记忆增强Transformer(MATR)。MATR利用选择性地保留过去片段特征的内存队列,允许利用长期上下文进行推理。我们还提出了一个新颖的动作局部化方法,它观察当前输入片段来预测正在进行的动作的结束时间,并访问内存队列估计动作的开始时间。在THUMOS14和MUSES数据集上,我们的方法超越了现有方法,不仅在在线设置中超过了TAL方法,而且在离线设置中的一些方法中也表现优异。

URL

https://arxiv.org/abs/2408.02957

PDF

https://arxiv.org/pdf/2408.02957.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot