Paper Reading AI Learner

PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

2025-05-19 05:00:47
Yingchen He, Christian D. Weilbach, Martyna E. Wojciechowska, Yuxuan Zhang, Frank Wood

Abstract

Advances in deep generative modelling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants.\footnote{We have done a privacy review for the public release of an initial 200-hour subset of the dataset, with plans to release most of the dataset over time.} Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.

Abstract (translated)

深度生成模型的进步使得训练具有人类水平的具身智能体变得越来越可行。然而,由于缺乏大规模、实时、多模态且能够体现自然环境感官-运动复杂性的社会互动数据集,这一领域的进展受到了限制。为了解决这个问题,我们提出了PLAICraft,这是一个新颖的数据采集平台和数据集,捕捉了多人《我的世界》(Minecraft)游戏跨五个时间对齐的模式之间的交互:视频、游戏输出音频、麦克风输入音频、鼠标操作和键盘动作。每个模态都以毫秒级的时间精度记录下来,从而能够研究在丰富的开放环境中同步发生的具身行为。该数据集包含了超过10,000小时的游戏时长,由来自全球的超过10,000名参与者贡献。 \footnote{我们已经对首批公开发布的200小时子集进行了隐私审查,并计划随着时间推移逐步发布整个数据集的大部分内容。} 除了数据集外,我们还提供了一个评估套件,用于测试模型在物体识别、空间感知、语言理解和长期记忆方面的性能。PLAICraft为训练和评估能够实时流畅且有目的行动的智能体铺平了道路,这是迈向真正具身的人工智能的重要一步。

URL

https://arxiv.org/abs/2505.12707

PDF

https://arxiv.org/pdf/2505.12707.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot