Abstract
Real-world image sequences can often be naturally decomposed into a small number of frames depicting interesting, highly stochastic moments (its $\textit{keyframes}$) and the low-variance frames in between them. In image sequences depicting trajectories to a goal, keyframes can be seen as capturing the $\textit{subgoals}$ of the sequence as they depict the high-variance moments of interest that ultimately led to the goal. In this paper, we introduce a video prediction model that discovers the keyframe structure of image sequences in an unsupervised fashion. We do so using a hierarchical Keyframe-Intermediate model (KeyIn) that stochastically predicts keyframes and their offsets in time and then uses these predictions to deterministically predict the intermediate frames. We propose a differentiable formulation of this problem that allows us to train the full hierarchical model using a sequence reconstruction loss. We show that our model is able to find meaningful keyframe structure in a simulated dataset of robotic demonstrations and that these keyframes can serve as subgoals for planning. Our model outperforms other hierarchical prediction approaches for planning on a simulated pushing task.
Abstract (translated)
现实世界中的图像序列通常可以自然分解为少量的帧,这些帧描述有趣的高度随机的时刻(它的$ extit关键帧$),以及它们之间的低方差帧。在描述目标轨迹的图像序列中,关键帧可以被视为捕获序列的$ extit子目标,因为它们描述了最终导致目标的高方差感兴趣时刻。本文介绍了一种视频预测模型,该模型可以在无监督的情况下发现图像序列的关键帧结构。我们使用一个分层的关键帧中间模型(keyin)来实现这一点,该模型随机地预测关键帧及其时间偏移,然后使用这些预测来确定地预测中间帧。我们提出了这个问题的一个可微公式,它允许我们使用序列重建损失训练完整的层次模型。我们证明了我们的模型能够在机器人演示的模拟数据集中找到有意义的关键帧结构,并且这些关键帧可以作为规划的子目标。在模拟推送任务的规划方面,我们的模型优于其他层次预测方法。
URL
https://arxiv.org/abs/1904.05869