Paper Reading AI Learner

LanGWM: Language Grounded World Model

2023-11-29 12:41:55
Rudra P.K. Poudel, Harit Pandya, Chao Zhang, Roberto Cipolla

Abstract

Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.

Abstract (translated)

近年来在深度强化学习方面的进步已经展示了其在解决复杂任务上的潜力。然而,在视觉控制任务的实验中,最先进的强化学习模型在分布式泛化方面存在困难。相反,使用语言表达高级概念和全局上下文相对容易。借鉴大型语言模型的最近成功,我们的主要目标是在强化学习中将语言用于稳健动作选择,以提高状态抽象技术。具体来说,我们关注学习基于语言的视觉特征来增强世界建模,这是一种基于模型的强化学习技术。为了明确表达我们的假设,我们遮盖了图像观察中的一些物体,并为这些遮盖的物体提供文本提示。随后,我们预测遮盖的物体以及周围的区域作为像素重建,类似于基于Transformer的遮盖自动编码器方法。我们提出的LangGWM:语言 grounded world模型在igibson点导航任务的100K个交互步骤基准测试中实现了最先进的性能。此外,我们提出的显式语言 grounded visual representation learning技术具有提高机器人与人类交互模型的潜力,因为我们所提取的视觉特征是语言 grounded的。

URL

https://arxiv.org/abs/2311.17593

PDF

https://arxiv.org/pdf/2311.17593.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot