Abstract
Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.
Abstract (translated)
从第一人称视角生成视频在增强现实和具身智能领域具有广泛的应用前景。在这项工作中,我们探讨了跨视图视频预测任务:给定一个外向型(exo-centric)的视频、相应的第一人称(ego-centric)视频的第一个帧以及文本指令,目标是根据这些信息生成后续第一人称视角下的视频帧。 鉴于手部与物体互动(Hand-Object Interaction, HOI)在第一人称视频中代表了当前操作者的首要意图和动作,我们提出了EgoExo-Gen模型,该模型明确地建模了手部与物体之间的动态关系以进行跨视图视频预测。EgoExo-Gen包含两个阶段: 1. 我们设计了一个跨视图HOI掩码预测模型,通过建模空间和时间上的第一人称和外向型视角对应关系来预估未来第一人称帧的HOI掩码。 2. 接下来,我们采用视频扩散模型根据给定的第一人称第一个帧和文本指令来预测未来的第一人称帧,并结合HOI掩码作为结构指导以增强生成的质量。 为了促进训练过程,我们开发了一种自动化管线,利用视觉基础模型为ego-视频和exo-视频自动生成伪HOI掩码。大量的实验表明,我们的EgoExo-Gen模型在Ego-Exo4D和H2O基准数据集上相比之前的视频预测模型具有更好的预测性能,并且通过引入HOI掩码显著提高了第一人称视角下手部及互动物体的生成质量。
URL
https://arxiv.org/abs/2504.11732