Abstract
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of-the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture.
Abstract (translated)
在时间行为本地化中,给定输入视频,的目标是预测它包含哪些动作,这些动作开始和结束的位置。训练和测试当前最先进的深度学习模型需要访问大量的数据和计算资源。然而,收集这些数据是困难的,计算资源可能是有限的。这项工作探索并衡量当前深度时间行为本地化模型在数据或计算资源受限的情况下的表现。我们通过训练每个模型在训练集的子集上来衡量数据效率。我们发现,在数据限制的情况下,TemporalMaxer比其他模型表现更好。此外,当训练时间有限时,我们建议使用TriDet。为了测试模型在推理期间的效率,我们通过每个模型传递不同长度的视频。我们发现TemporalMaxer需要最少的计算资源,可能是因为其简单的架构。
URL
https://arxiv.org/abs/2308.13082