Paper Reading AI Learner

Human Latency Conversational Turns for Spoken Avatar Systems

2024-04-11 20:20:48
Derek Jacoby, Tianyi Zhang, Aanchan Mohan, Yvonne Coady

Abstract

A problem with many current Large Language Model (LLM) driven spoken dialogues is the response time. Some efforts such as Groq address this issue by lightning fast processing of the LLM, but we know from the cognitive psychology literature that in human-to-human dialogue often responses occur prior to the speaker completing their utterance. No amount of delay for LLM processing is acceptable if we wish to maintain human dialogue latencies. In this paper, we discuss methods for understanding an utterance in close to real time and generating a response so that the system can comply with human-level conversational turn delays. This means that the information content of the final part of the speaker's utterance is lost to the LLM. Using the Google NaturalQuestions (NQ) database, our results show GPT-4 can effectively fill in missing context from a dropped word at the end of a question over 60% of the time. We also provide some examples of utterances and the impacts of this information loss on the quality of LLM response in the context of an avatar that is currently under development. These results indicate that a simple classifier could be used to determine whether a question is semantically complete, or requires a filler phrase to allow a response to be generated within human dialogue time constraints.

Abstract (translated)

许多当前的 Large Language Model (LLM) 驱动的会话存在一个响应时间的问题。一些努力(如Groq)通过闪电般的处理LLM解决了这个问题,但根据认知心理学文献,人类之间的对话中,响应通常在说话人完成其陈述之前发生。如果我们希望保持人类对话的延迟,对于LLM处理过程中的任何延迟都是不可接受的。在本文中,我们讨论了在接近实时理解和生成响应以使系统符合人类级的会话轮次延迟的方法。这意味着说话人最后部分的话语内容的最终部分将被LLM丢失。使用谷歌自然问题(NQ)数据库,我们的结果表明,GPT-4在超过60%的时间内可以有效地填补问题中的单词末尾丢失的上下文。我们还提供了一些例子,以及这种信息丢失对正在开发中的虚拟助手中的 LLM 响应质量的影响。这些结果表明,一个简单的分类器可以用来确定一个问题是否具有语义完整性,或者是否需要填充短语以便在人类对话时间内生成响应。

URL

https://arxiv.org/abs/2404.16053

PDF

https://arxiv.org/pdf/2404.16053.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot