Paper Reading AI Learner

EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

2025-04-16 03:12:39
Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

Abstract

Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

Abstract (translated)

从第一人称视角生成视频在增强现实和具身智能领域具有广泛的应用前景。在这项工作中,我们探讨了跨视图视频预测任务:给定一个外向型(exo-centric)的视频、相应的第一人称(ego-centric)视频的第一个帧以及文本指令,目标是根据这些信息生成后续第一人称视角下的视频帧。 鉴于手部与物体互动(Hand-Object Interaction, HOI)在第一人称视频中代表了当前操作者的首要意图和动作,我们提出了EgoExo-Gen模型,该模型明确地建模了手部与物体之间的动态关系以进行跨视图视频预测。EgoExo-Gen包含两个阶段: 1. 我们设计了一个跨视图HOI掩码预测模型,通过建模空间和时间上的第一人称和外向型视角对应关系来预估未来第一人称帧的HOI掩码。 2. 接下来,我们采用视频扩散模型根据给定的第一人称第一个帧和文本指令来预测未来的第一人称帧,并结合HOI掩码作为结构指导以增强生成的质量。 为了促进训练过程,我们开发了一种自动化管线,利用视觉基础模型为ego-视频和exo-视频自动生成伪HOI掩码。大量的实验表明,我们的EgoExo-Gen模型在Ego-Exo4D和H2O基准数据集上相比之前的视频预测模型具有更好的预测性能,并且通过引入HOI掩码显著提高了第一人称视角下手部及互动物体的生成质量。

URL

https://arxiv.org/abs/2504.11732

PDF

https://arxiv.org/pdf/2504.11732.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot