Paper Reading AI Learner

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

2023-12-23 18:14:56
Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

Abstract

Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM utilizes text and speech modality to better model the linguistic content and paralinguistic attribute of spoken response. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multi-modal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels to be the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.

Abstract (translated)

大语言模型(LLMs)在诸如聊天、推理和问题回答等任务上表现出了卓越的能力。然而,标准的LLM可能会忽略关键的会话语言信息,例如情感、情绪和口语风格,这些信息对于实现自然、人性化的人际口语对话至关重要,尤其是在这种信息通过语音提示传达时。因此,我们提出了Paralinguistics-enhanced Generative Pretrained Transformer(ParalinGPT),一种LLM利用文本和语音模态更好地建模口语响应的语义内容和会话属性。该模型以文本、语音嵌入和会话属性作为输入提示,在序列多模态框架中进行会话上下文建模。具体来说,我们的框架将任务序列化为当前会话属性预测、响应会话属性预测和响应文本生成与自回归条件。我们利用Switchboard-1语料库,包括其情感标签作为会话属性,作为我们的口语对话数据集。实验结果表明,与典型序列分类技术相比,所提出的序列多任务方法在当前和响应情感分类上表现出色。此外,利用会话上下文和语音嵌入 significantly 改善了响应文本生成和情感预测。我们提出的框架在当前情感准确性、响应情感准确性和响应文本BLEU评分方面分别实现了相对改进6.7%、12.0%和3.5%。

URL

https://arxiv.org/abs/2312.15316

PDF

https://arxiv.org/pdf/2312.15316.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot