Paper Reading AI Learner

HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

2025-06-18 16:54:56
Roey Ron, Guy Tevet, Haim Sawdayee, Amit H. Bermano

Abstract

We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. this https URL.

Abstract (translated)

我们介绍了H OIDiNi,这是一个用于合成逼真且合理的物体-人交互(HOI)的文本驱动扩散框架。HOI生成极具挑战性,因为它需要高度准确的身体接触以及多种运动模式。虽然现有文献在现实性和物理正确性之间做出权衡,但H OIDiNi通过使用预训练扩散模型中的噪声空间进行直接优化(利用Diffusion Noise Optimization, DNO),能够同时实现这两点。这成为可能的原因在于我们观察到该问题可以分为两个阶段:以物体为中心的阶段主要做离散的手-物接触位置选择;以人为中心的阶段则细化全身运动,从而实现这一蓝图。这种结构化的办法允许在不牺牲动作自然性的情况下进行精确的手部与物体接触控制。 仅在GRAB数据集上的定量、定性和主观评估清晰地表明H OIDiNi在接触准确性、物理有效性以及整体质量方面超越了先前的工作和基准。我们的结果展示了生成复杂且可控的交互(包括抓取、放置及全身协调)的能力,这完全由文本提示驱动。 您可以在此链接中找到更多相关信息:[此链接](https://example.com)(原文中未提供实际链接,请根据实际情况替换)。

URL

https://arxiv.org/abs/2506.15625

PDF

https://arxiv.org/pdf/2506.15625.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot