Abstract
Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM utilizes text and speech modality to better model the linguistic content and paralinguistic attribute of spoken response. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multi-modal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels to be the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.
Abstract (translated)
大语言模型(LLMs)在诸如聊天、推理和问题回答等任务上表现出了卓越的能力。然而,标准的LLM可能会忽略关键的会话语言信息,例如情感、情绪和口语风格,这些信息对于实现自然、人性化的人际口语对话至关重要,尤其是在这种信息通过语音提示传达时。因此,我们提出了Paralinguistics-enhanced Generative Pretrained Transformer(ParalinGPT),一种LLM利用文本和语音模态更好地建模口语响应的语义内容和会话属性。该模型以文本、语音嵌入和会话属性作为输入提示,在序列多模态框架中进行会话上下文建模。具体来说,我们的框架将任务序列化为当前会话属性预测、响应会话属性预测和响应文本生成与自回归条件。我们利用Switchboard-1语料库,包括其情感标签作为会话属性,作为我们的口语对话数据集。实验结果表明,与典型序列分类技术相比,所提出的序列多任务方法在当前和响应情感分类上表现出色。此外,利用会话上下文和语音嵌入 significantly 改善了响应文本生成和情感预测。我们提出的框架在当前情感准确性、响应情感准确性和响应文本BLEU评分方面分别实现了相对改进6.7%、12.0%和3.5%。
URL
https://arxiv.org/abs/2312.15316