Paper Reading AI Learner

Drag4D: Align Your Motion with Text-Driven 3D Scene Generation

2025-09-26 05:23:45
Minjun Kang, Inkyu Shin, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon

Abstract

We introduce Drag4D, an interactive framework that integrates object motion control within text-driven 3D scene generation. This framework enables users to define 3D trajectories for the 3D objects generated from a single image, seamlessly integrating them into a high-quality 3D background. Our Drag4D pipeline consists of three stages. First, we enhance text-to-3D background generation by applying 2D Gaussian Splatting with panoramic images and inpainted novel views, resulting in dense and visually complete 3D reconstructions. In the second stage, given a reference image of the target object, we introduce a 3D copy-and-paste approach: the target instance is extracted in a full 3D mesh using an off-the-shelf image-to-3D model and seamlessly composited into the generated 3D scene. The object mesh is then positioned within the 3D scene via our physics-aware object position learning, ensuring precise spatial alignment. Lastly, the spatially aligned object is temporally animated along a user-defined 3D trajectory. To mitigate motion hallucination and ensure view-consistent temporal alignment, we develop a part-augmented, motion-conditioned video diffusion model that processes multiview image pairs together with their projected 2D trajectories. We demonstrate the effectiveness of our unified architecture through evaluations at each stage and in the final results, showcasing the harmonized alignment of user-controlled object motion within a high-quality 3D background.

Abstract (translated)

我们介绍了一种名为Drag4D的交互式框架,该框架集成了基于文本驱动的三维场景生成中的物体运动控制。此框架允许用户为从单张图像生成的三维物体定义三维轨迹,并将它们无缝地融入高质量的三维背景中。 Drag4D的工作流程分为三个阶段: 1. **增强型文本到三维背景生成**:我们利用2D高斯光斑技术结合全景图和修复后的新型视角,增强了由文本驱动生成三维场景的能力。这使得生成的三维重建更加密集且视觉上完整。 2. **基于参考图像的目标物体3D复制与粘贴方法**:给定目标物体的参考图像后,在这一阶段中,我们将该实例从全3D网格模型中提取出来,并将其无缝地合并到生成的三维场景中。我们采用现成的图像到三维模型转换技术来实现这一点。接下来,通过我们的物理感知对象位置学习方法,将物体网格放置在三维场景中的正确位置上,确保了空间对齐的准确性。 3. **时间动画处理**:最后一步是根据用户定义的三维轨迹,在时间维度上动态调整已经精确对齐的对象。为了减少运动错觉并保证多视角的一致性,我们开发了一种部分增强、以运动条件为基础的视频扩散模型。该模型能同时处理多视角图像对及其投影后的二维轨迹。 通过在每个阶段以及最终结果中的评估,我们展示了用户控制物体运动与高质量三维背景之间的和谐统一,证明了我们统一架构的有效性和优越性。

URL

https://arxiv.org/abs/2509.21888

PDF

https://arxiv.org/pdf/2509.21888.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot