Abstract
We introduce Drag4D, an interactive framework that integrates object motion control within text-driven 3D scene generation. This framework enables users to define 3D trajectories for the 3D objects generated from a single image, seamlessly integrating them into a high-quality 3D background. Our Drag4D pipeline consists of three stages. First, we enhance text-to-3D background generation by applying 2D Gaussian Splatting with panoramic images and inpainted novel views, resulting in dense and visually complete 3D reconstructions. In the second stage, given a reference image of the target object, we introduce a 3D copy-and-paste approach: the target instance is extracted in a full 3D mesh using an off-the-shelf image-to-3D model and seamlessly composited into the generated 3D scene. The object mesh is then positioned within the 3D scene via our physics-aware object position learning, ensuring precise spatial alignment. Lastly, the spatially aligned object is temporally animated along a user-defined 3D trajectory. To mitigate motion hallucination and ensure view-consistent temporal alignment, we develop a part-augmented, motion-conditioned video diffusion model that processes multiview image pairs together with their projected 2D trajectories. We demonstrate the effectiveness of our unified architecture through evaluations at each stage and in the final results, showcasing the harmonized alignment of user-controlled object motion within a high-quality 3D background.
Abstract (translated)
我们介绍了一种名为Drag4D的交互式框架,该框架集成了基于文本驱动的三维场景生成中的物体运动控制。此框架允许用户为从单张图像生成的三维物体定义三维轨迹,并将它们无缝地融入高质量的三维背景中。 Drag4D的工作流程分为三个阶段: 1. **增强型文本到三维背景生成**:我们利用2D高斯光斑技术结合全景图和修复后的新型视角,增强了由文本驱动生成三维场景的能力。这使得生成的三维重建更加密集且视觉上完整。 2. **基于参考图像的目标物体3D复制与粘贴方法**:给定目标物体的参考图像后,在这一阶段中,我们将该实例从全3D网格模型中提取出来,并将其无缝地合并到生成的三维场景中。我们采用现成的图像到三维模型转换技术来实现这一点。接下来,通过我们的物理感知对象位置学习方法,将物体网格放置在三维场景中的正确位置上,确保了空间对齐的准确性。 3. **时间动画处理**:最后一步是根据用户定义的三维轨迹,在时间维度上动态调整已经精确对齐的对象。为了减少运动错觉并保证多视角的一致性,我们开发了一种部分增强、以运动条件为基础的视频扩散模型。该模型能同时处理多视角图像对及其投影后的二维轨迹。 通过在每个阶段以及最终结果中的评估,我们展示了用户控制物体运动与高质量三维背景之间的和谐统一,证明了我们统一架构的有效性和优越性。
URL
https://arxiv.org/abs/2509.21888