Abstract
We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. this https URL.
Abstract (translated)
我们介绍了H OIDiNi,这是一个用于合成逼真且合理的物体-人交互(HOI)的文本驱动扩散框架。HOI生成极具挑战性,因为它需要高度准确的身体接触以及多种运动模式。虽然现有文献在现实性和物理正确性之间做出权衡,但H OIDiNi通过使用预训练扩散模型中的噪声空间进行直接优化(利用Diffusion Noise Optimization, DNO),能够同时实现这两点。这成为可能的原因在于我们观察到该问题可以分为两个阶段:以物体为中心的阶段主要做离散的手-物接触位置选择;以人为中心的阶段则细化全身运动,从而实现这一蓝图。这种结构化的办法允许在不牺牲动作自然性的情况下进行精确的手部与物体接触控制。 仅在GRAB数据集上的定量、定性和主观评估清晰地表明H OIDiNi在接触准确性、物理有效性以及整体质量方面超越了先前的工作和基准。我们的结果展示了生成复杂且可控的交互(包括抓取、放置及全身协调)的能力,这完全由文本提示驱动。 您可以在此链接中找到更多相关信息:[此链接](https://example.com)(原文中未提供实际链接,请根据实际情况替换)。
URL
https://arxiv.org/abs/2506.15625