Abstract
Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.
Abstract (translated)
近年来在深度强化学习方面的进步已经展示了其在解决复杂任务上的潜力。然而,在视觉控制任务的实验中,最先进的强化学习模型在分布式泛化方面存在困难。相反,使用语言表达高级概念和全局上下文相对容易。借鉴大型语言模型的最近成功,我们的主要目标是在强化学习中将语言用于稳健动作选择,以提高状态抽象技术。具体来说,我们关注学习基于语言的视觉特征来增强世界建模,这是一种基于模型的强化学习技术。为了明确表达我们的假设,我们遮盖了图像观察中的一些物体,并为这些遮盖的物体提供文本提示。随后,我们预测遮盖的物体以及周围的区域作为像素重建,类似于基于Transformer的遮盖自动编码器方法。我们提出的LangGWM:语言 grounded world模型在igibson点导航任务的100K个交互步骤基准测试中实现了最先进的性能。此外,我们提出的显式语言 grounded visual representation learning技术具有提高机器人与人类交互模型的潜力,因为我们所提取的视觉特征是语言 grounded的。
URL
https://arxiv.org/abs/2311.17593