Abstract
In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we first crowd-source programs for a variety of activities that happen in people's homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models. We further showcase examples of our agent performing tasks in our VirtualHome based on language descriptions.
Abstract (translated)
在本文中,我们感兴趣的是对典型家庭中发生的复杂活动进行建模。我们建议使用程序,即原子动作和交互序列作为复杂任务的高级表示。程序很有趣,因为它们提供了一个非模糊的任务表示,并允许代理执行它们。但是,现在没有提供这种信息的数据库。为了实现这个目标,我们首先通过类似游戏的界面来为人们家中发生的各种活动提供众包源程序,用于教孩子们如何编码。使用收集的数据集,我们展示了我们如何学习直接从自然语言描述或视频中提取程序。然后,我们在Unity3D游戏引擎中执行最常见的原子(间)行为,并使用我们的程序“驱动”仿真代理以在模拟家庭环境中执行任务。我们的VirtualHome模拟器使我们能够创建具有丰富基本事实的大型活动视频数据集,从而实现视频理解模型的培训和测试。我们进一步展示了基于语言描述的我们的代理在我们的虚拟主机中执行任务的示例。
URL
https://arxiv.org/abs/1806.07011