Paper Reading AI Learner

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

2024-03-18 21:41:09
Linus Nwankwo, Elmar Rueckert

Abstract

In this paper, we extended the method proposed in [17] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at this https URL.

Abstract (translated)

在本文中,我们将 [17] 中提出的方法扩展,以实现人类通过语音和文本对话与自主机器人自然交互。我们的扩展方法利用了预训练的大型语言模型(LLMs)、多模态视觉语言模型(VLMs)和语音识别(SR)模型的固有功能,将高级自然语言对话的解码和机器人任务的语义理解进行解密,并将其抽象为机器人的操作指令或查询。我们对使用不同种族背景和英语口音的参与者对机器人进行自然语音对话理解进行了定量评估。参与者使用口头和文本指令与机器人交互。根据记录的交互数据,我们的框架获得了87.55%的语音命令解码准确率、86.27%的指令执行成功率和从接收参与者语音聊天指令到启动机器人实际物理动作的平均延迟为0.89秒。本文的视频演示可在此链接查看:https://www.youtube.com/watch?v=。

URL

https://arxiv.org/abs/2403.12273

PDF

https://arxiv.org/pdf/2403.12273.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot