Abstract
The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-agnostic method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. We incorporate additional regularizers to improve the feature diversity of the synthesized videos as well as the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished.
Abstract (translated)
深度学习模型的成功导致了其适应和采用 prominent 视频理解方法。这些方法大多数都采取了 joint 空间-时间模式,该模式的特点是内部运作和学习表示难以视觉解释。我们提出了LEarned Preconscious Synthesis(LEAPS),一种无架构方法,用于从模型内部时间和空间表示中合成视频。使用刺激视频和目标类,我们初始化一个固定的空间-时间模型,并迭代优化一个以随机噪声初始化的视频。我们引入了额外的正则化,以提高合成视频的特征多样性和交叉帧时间一致性。我们通过反转训练在Kinetics-400上训练的各种时间和空间卷积和注意力架构,评估了 LEAPS 的适用性。据我们所知,此前从未实现过。
URL
https://arxiv.org/abs/2303.09941