Abstract
The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.
Abstract (translated)
人工智能系统的开发正在从创建静态、任务特定的模型转向创建具有良好泛化能力的动态、基于智能体的人工智能系统。我们提出了一个名为交互式智能体基金会模型的框架,用于在广泛的领域、数据集和任务中对AI智能体进行训练。我们的训练范式统一了各种预训练策略,包括视觉遮蔽自动编码器、语言建模和下一动作预测,使人工智能框架具有多样化和可适应性。我们在机器人、游戏AI和医疗领域证明了我们的框架。我们的模型在各个领域都表现出生成有意义且上下文相关的输出。我们方法的优势在于其普遍性,利用了各种数据源,如机器人序列、游戏数据、大型视频数据集和文本信息,实现多模态和多任务学习。我们的方法为开发通用、具有行动能力的多模态系统提供了光明的前景。
URL
https://arxiv.org/abs/2402.05929