Paper Reading AI Learner

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

2024-05-07 14:51:05
Junyi Ma, Jingyi Xu, Xieyuanli Chen, Hesheng Wang

Abstract

Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously predict hand trajectories and object affordances on human egocentric videos. They are regarded as the representation of future hand-object interactions, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. The experimental results show that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our proposed new evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D will be released at this https URL.

Abstract (translated)

理解人类在双手与物体交互过程中的行为对服务机器人操作和扩展现实应用至关重要。为了实现这一目标,一些最近的工作提出了在人类自中心视频中同时预测手轨迹和物体姿态。它们被认为是未来双手与物体交互的表示,表明潜在的人类运动和动机。然而,现有的方法主要采用自回归范式进行单向预测,缺乏全局未来序列中的相互约束,并在时间轴上累积误差。同时,这些作品基本上忽略了相机自转运动对第一人称视角预测的影响。为了克服这些限制,我们提出了名为Diff-IP2D的新颖扩散基于交互预测方法,以同时预测未来手轨迹和物体姿态的迭代非自回归方式。我们将顺序2D图像转换为潜在特征空间,并设计了一个去噪扩散模型,根据过去的特征预测未来的潜在交互特征。为了使Diff-IP2D能够关注到佩戴摄像头的用户的动态,我们进一步将运动特征融入条件去噪过程中。实验结果表明,我们的方法在标准指标和非标准指标上均显著优于最先进的基准模型。这突出了利用生成范式进行2D手与物体交互预测的有效性。Diff-IP2D的代码将在该链接上发布。

URL

https://arxiv.org/abs/2405.04370

PDF

https://arxiv.org/pdf/2405.04370.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot