Paper Reading AI Learner

Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

2025-04-30 19:06:09
Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, Noah Snavely

Abstract

The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on this https URL

Abstract (translated)

沉浸式视觉体验的日益流行增加了对立体3D视频生成的兴趣。尽管在视频合成方面取得了显著进展,但由于缺乏3D视频数据,创建高质量的3D视频仍然颇具挑战性。我们提出了一种简单的方法,将文本到视频的生成器转变为视频到立体视图的生成器。给定输入视频后,我们的框架能够自动生成从稍有不同的视角拍摄的画面帧,从而产生引人入胜的3D效果。 以往和目前针对此任务的方法通常分为多个阶段:首先估计视频中的视差或深度信息;然后根据这些信息将视频画面扭曲以产生第二视角;最后进行不可见区域的填充。这种方法在处理包含镜面反射表面或透明物体的场景时会失效,因为单一层次的视差估算在这种情况下是不够的,会导致错误的像素偏移和伪影。 我们的工作通过直接合成新的视角来绕过这些限制,而无需依赖任何中间步骤。我们利用预先训练好的视频模型对几何、材质、光学及语义的理解,在不依赖外部几何模型或手动分离几何信息的情况下实现这一点。我们在包含各种物体材料和组成的复杂现实场景中展示了我们方法的优势。 请参见此网址上的相关视频:[链接](在实际应用中,请插入正确的URL地址)。

URL

https://arxiv.org/abs/2505.00135

PDF

https://arxiv.org/pdf/2505.00135.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot