Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.

Abstract (translated)

对人类手作为机器人任务中的媒介进行分析和使用个人视角的视频是困难的，这因为手和人类手与机器人末端执行器之间的视觉不匹配而 occlusion。从这个意义上说，人类手是一个麻烦。然而，通常手也提供有价值的信号，例如手的姿势可能暗示着正在握着什么物体。在这项工作中，我们提议提取一个Factored Representation，将agent(人类手)和环境分开。这可以减轻 occlusion 和不匹配，同时保留信号，从而简化后续机器人任务中模型的设计。在这个观点的中心是我们所提出的视频扩散模型(VIDM)，它利用现实世界图像的先验知识和视频早期帧中物体的外观(通过注意力)。我们的实验证明了 VIDM 在改善个人视角视频涂色质量和我们Factored Representation 对于许多任务的有效性：物体检测、3D重建操纵物体、从视频中学习奖励函数、政策和可用性。

URL

https://arxiv.org/abs/2305.16301

PDF

https://arxiv.org/pdf/2305.16301.pdf