Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously predict hand trajectories and object affordances on human egocentric videos. They are regarded as the representation of future hand-object interactions, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. The experimental results show that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our proposed new evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D will be released at this https URL.

Abstract (translated)

理解人类在双手与物体交互过程中的行为对服务机器人操作和扩展现实应用至关重要。为了实现这一目标，一些最近的工作提出了在人类自中心视频中同时预测手轨迹和物体姿态。它们被认为是未来双手与物体交互的表示，表明潜在的人类运动和动机。然而，现有的方法主要采用自回归范式进行单向预测，缺乏全局未来序列中的相互约束，并在时间轴上累积误差。同时，这些作品基本上忽略了相机自转运动对第一人称视角预测的影响。为了克服这些限制，我们提出了名为Diff-IP2D的新颖扩散基于交互预测方法，以同时预测未来手轨迹和物体姿态的迭代非自回归方式。我们将顺序2D图像转换为潜在特征空间，并设计了一个去噪扩散模型，根据过去的特征预测未来的潜在交互特征。为了使Diff-IP2D能够关注到佩戴摄像头的用户的动态，我们进一步将运动特征融入条件去噪过程中。实验结果表明，我们的方法在标准指标和非标准指标上均显著优于最先进的基准模型。这突出了利用生成范式进行2D手与物体交互预测的有效性。Diff-IP2D的代码将在该链接上发布。

URL

https://arxiv.org/abs/2405.04370

PDF

https://arxiv.org/pdf/2405.04370.pdf

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Abstract

Abstract (translated)

URL

PDF Copy

PDF