Abstract
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
Abstract (translated)
与虚拟助手互动通常从预定义的触发短语开始,然后是用户命令。为了使与助手的互动更加直观,我们探讨是否可以取消用户必须以触发短语开始每个命令的要求。我们在三种方式下探讨这个问题:首先,我们使用仅来自音频波形的获得的声学信息训练分类器。其次,我们将自动语音识别(ASR)系统的解码器输出,如1最佳假设,作为大型语言模型(LLM)的输入特征。最后,我们探讨了一个多模态系统,该系统结合了声学和词汇特征,以及ASR解码器信号在LLM中。使用多模态信息,我们得到了超过文本和音频模型的相对等误率改进,最高达39%和61%。增加LLM的大小和使用低秩自适应训练进一步减少了我们的数据集中的相对等误率,最高达18%。
URL
https://arxiv.org/abs/2403.14438