Abstract
Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.
Abstract (translated)
在线时序动作局部化(On-TAL)是将给定流式视频中的多个动作实例进行识别的任务。由于现有的方法仅以每次迭代固定的视频片段作为输入,因此它们在考虑长期上下文方面存在局限,需要仔细调整片段大小。为了克服这些限制,我们提出了记忆增强Transformer(MATR)。MATR利用选择性地保留过去片段特征的内存队列,允许利用长期上下文进行推理。我们还提出了一个新颖的动作局部化方法,它观察当前输入片段来预测正在进行的动作的结束时间,并访问内存队列估计动作的开始时间。在THUMOS14和MUSES数据集上,我们的方法超越了现有方法,不仅在在线设置中超过了TAL方法,而且在离线设置中的一些方法中也表现优异。
URL
https://arxiv.org/abs/2408.02957