Robot systems in education can leverage Large language models' (LLMs) natural language understanding capabilities to provide assistance and facilitate learning. This paper proposes a multimodal interactive robot (PhysicsAssistant) built on YOLOv8 object detection, cameras, speech recognition, and chatbot using LLM to provide assistance to students' physics labs. We conduct a user study on ten 8th-grade students to empirically evaluate the performance of PhysicsAssistant with a human expert. The Expert rates the assistants' responses to student queries on a 0-4 scale based on Bloom's taxonomy to provide educational support. We have compared the performance of PhysicsAssistant (YOLOv8+GPT-3.5-turbo) with GPT-4 and found that the human expert rating of both systems for factual understanding is the same. However, the rating of GPT-4 for conceptual and procedural knowledge (3 and 3.2 vs 2.2 and 2.6, respectively) is significantly higher than PhysicsAssistant (p < 0.05). However, the response time of GPT-4 is significantly higher than PhysicsAssistant (3.54 vs 1.64 sec, p < 0.05). Hence, despite the relatively lower response quality of PhysicsAssistant than GPT-4, it has shown potential for being used as a real-time lab assistant to provide timely responses and can offload teachers' labor to assist with repetitive tasks. To the best of our knowledge, this is the first attempt to build such an interactive multimodal robotic assistant for K-12 science (physics) education.
机器人系统在教育领域可以利用大型语言模型的自然语言理解能力来提供帮助和促进学习。本文提出了一种基于YOLOv8目标检测、相机、语音识别和聊天机器人使用大型语言模型的多模态交互机器人(PhysicsAssistant),以帮助学生进行物理实验室。我们对十名8年级学生进行用户研究,以实证评估PhysicsAssistant与人类专家的性能。根据布卢姆的分类法,专家根据学生问题的回答给0-4评分,以提供教育支持。我们比较了PhysicsAssistant(YOLOv8+GPT-3.5-turbo)与GPT-4的性能,发现两个系统的 factual understanding 方面的专家评分相同。然而,GPT-4的形而上学和程序知识评分(3.2和3.5, respectively)远高于PhysicsAssistant(2.2和2.6, respectively)。因此,尽管PhysicsAssistant的响应质量相对于GPT-4较低,但它已经展示了在实时实验室助理中提供及时响应并减轻教师劳动力的潜力。据我们所知,这是第一个为K-12科学(物理)教育构建的具有多模态的交互式机器人助手。
https://arxiv.org/abs/2403.18721
We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus. The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation where Students brainstorm ideas for a certain topic and then discuss it with an Interlocutor. The meetings cover different topics and are divided into phases with different language setups. The corpus presents a challenging set for automatic speech recognition (ASR), including two languages (Arabic and English) with Arabic spoken in multiple variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English used with various accents. Adding to the complexity of the corpus, there is also code-switching between these languages and dialects. As part of our work, we take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages. We further enrich the corpus with two layers of annotations; (1) dialectness level annotation for the portion of the corpus where mixing occurs between different variants of Arabic, and (2) automatic morphological annotations, including tokenization, lemmatization, and part-of-speech tagging.
我们提出了ZAEBUC-Spoken,一个多语言多方言的阿拉伯-英语会话语料库。这个语料库包括十二个小时的Zoom会议,其中多人扮演特定主题下的工作场景,学生为该主题脑筋急转弯并随后与交流者讨论。会议涵盖了不同的主题,并分为不同的语言设置阶段。这个语料库为自动语音识别(ASR)带来了具有挑战性的数据集,包括两种语言(阿拉伯语和英语)的多种变体(现代标准阿拉伯语、海湾阿拉伯语和埃及阿拉伯语)以及各种口音的英语。此外,码间切换在这份语料库中也很普遍。 为了进一步简化这个语料库,我们受到了已有的转录指南的启发,提供了一份处理两种语言 conversational speech、code-switching 和 orthography问题的指南。我们还通过添加两个注释层来丰富这个语料库:(1)对于语料库中混杂使用不同阿拉伯语变体的地方,进行了方言级别注释;(2)包括词标、词性标注和句法标注的自动语素标注。
https://arxiv.org/abs/2403.18182
End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on phonetic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.
端到端自动语音识别(E2E ASR)系统通常会因为领域特定短语(如命名实体)的误转而产生混淆,有时会导出下游任务的灾难性失败。最近,提出了一种快速轻量级的命名实体修正(NEC)模型家族,这些模型通常基于语音级别编辑距离算法,并在命名实体识别(NE)方面取得了令人印象深刻的性能。然而,随着命名实体的列表不断增长,NE列表中的语音混淆问题变得更加突出;例如,同音异义词混淆增加了很多。针对这个问题,我们提出了一个新的描述增强命名实体修正(DANCER)模型,该模型利用实体描述为NEC转录提供额外的信息,从而减轻语音混淆。为此,我们引入了一个高效实体描述增强掩码语言模型(EDA-MLM),使得MLM能够快速适应领域特定的实体,从而在NEC任务上取得成功。在AisHELL-1和Homophone数据集上进行的一系列实验证实了我们的建模方法的有效性。DANCER在AisHELL-1数据集上的性能优于强大的基线模型——基于语音编辑距离的命名实体修正模型(PED-NEC)。更值得注意的是,在测试含有高语音混淆的Homophone数据集时,DANCER相对于PED-NEC的CER减少程度高达46%。
https://arxiv.org/abs/2403.17645
Automatic Speech Recognition (ASR) technology is fundamental in transcribing spoken language into text, with considerable applications in the clinical realm, including streamlining medical transcription and integrating with Electronic Health Record (EHR) systems. Nevertheless, challenges persist, especially when transcriptions contain noise, leading to significant drops in performance when Natural Language Processing (NLP) models are applied. Named Entity Recognition (NER), an essential clinical task, is particularly affected by such noise, often termed the ASR-NLP gap. Prior works have primarily studied ASR's efficiency in clean recordings, leaving a research gap concerning the performance in noisy environments. This paper introduces a novel dataset, BioASR-NER, designed to bridge the ASR-NLP gap in the biomedical domain, focusing on extracting adverse drug reactions and mentions of entities from the Brief Test of Adult Cognition by Telephone (BTACT) exam. Our dataset offers a comprehensive collection of almost 2,000 clean and noisy recordings. In addressing the noise challenge, we present an innovative transcript-cleaning method using GPT4, investigating both zero-shot and few-shot methodologies. Our study further delves into an error analysis, shedding light on the types of errors in transcription software, corrections by GPT4, and the challenges GPT4 faces. This paper aims to foster improved understanding and potential solutions for the ASR-NLP gap, ultimately supporting enhanced healthcare documentation practices.
自动语音识别(ASR)技术是将口头语言转录成文本的基本技术,在临床领域具有很大的应用价值,包括简化医疗转录和与电子健康记录(EHR)系统的集成。然而,挑战仍然存在,尤其是在转录包含噪音时,当应用自然语言处理(NLP)模型时,性能下降显著。命名实体识别(NER)是一个关键的临床任务,特别受到这类噪音的影响,常被称为ASR-NLP差距。之前的工作主要研究ASR在干净录音中的效率,留下了在噪音环境中性能的研究空白。本文介绍了一个名为BioASR-NER的新数据集,旨在在生物医学领域弥合ASR-NLP差距,重点提取电话短期认知测试(BTACT)中的不良药物反应和实体。我们的数据集包含了几乎2,000个干净和噪音录音。为解决噪音挑战,我们使用GPT4提出了一种创新性的转录清理方法,研究了零 shots和少 shots方法。我们的研究进一步深入了转录软件中的错误分析,揭示了转录软件中的错误类型、GPT4的修正以及GPT4所面临的挑战。本文旨在促进对ASR-NLP差距的更好理解和可能的解决方案,最终支持提高 healthcare documentation practices。
https://arxiv.org/abs/2403.17363
Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how the adapter parameters are allocated. The adapter consists of a single shared controller network and multiple task-level adapter heads to reduce the per-task parameter overhead without performance regression on downstream tasks. The adapter is also recurrent so the entire adapter parameters are reused across different layers of the pre-trained model. Our Hierarchical Recurrent Adapter (HRA) outperforms the previous adapter-based approaches as well as full model fine-tuning baseline in both single and multi-task adaptation settings when evaluated on automatic speech recognition tasks.
参数高效的适应方法已成为训练大型预训练模型的下游任务的关键机制。然而,当下游任务的数量较大时,它们的每个任务参数开销仍然被认为是较高的。我们引入了一个适应器模块,其在大型多任务适应场景中的效率更高。我们的适应器在适应器参数的分配方面具有分层结构。适应器由一个共享控制器网络和多个任务级别的适应器头组成,以在下游任务上减少每个任务的参数开销,同时不导致性能下降。适应器还具有循环结构,以便在整个预训练模型的层之间重用整个适应器参数。在自动语音识别任务上评估我们的分层循环适应器(HRA)能够超越基于适应器的先前方法和全模型微调基线。
https://arxiv.org/abs/2403.19709
Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through the conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR and speech recognition are utilized to transform the images and speech signals into text content. All these variety of mechanisms of text generation also introduce errors into the captured text. This project aims at analyzing different kinds of error that occurs in text documents. The work employs two of the advanced deep neural network-based language models, namely, BART and MarianMT, to rectify the anomalies present in the text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both models can bring down the erroneous sentences by 20+%, BART can handle spelling errors far better (24.6%) than grammatical errors (8.8%).
文本 remains 是一种相关形式的信息表示形式。文本文件可以通过数字原生平台或转换其他媒体文件(如图像和语音)来创建。虽然数字原生文本是通过物理或虚拟键盘获得的,但使用 OCR 和语音识别等技术将图像和语音信号转换为文本内容。所有这些文本生成机制也引入了错误到捕获到的文本中。本项目旨在分析文本文件中发生的不同类型的错误。该工作采用两个先进的基于深度神经网络的语言模型,即 BART 和 MarianMT,来纠正文本中的异常。使用这些模型的可转移学习来微调其纠正错误的能力。对这两种模型在处理定义的错误类别的效果进行了比较研究。观察到,虽然两种模型都可以将错误的句子降低 20% 以上,但 BART 在处理拼写错误方面远比语法错误(8.8%)要好(24.6%)。
https://arxiv.org/abs/2403.16655
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
与虚拟助手互动通常从预定义的触发短语开始,然后是用户命令。为了使与助手的互动更加直观,我们探讨是否可以取消用户必须以触发短语开始每个命令的要求。我们在三种方式下探讨这个问题:首先,我们使用仅来自音频波形的获得的声学信息训练分类器。其次,我们将自动语音识别(ASR)系统的解码器输出,如1最佳假设,作为大型语言模型(LLM)的输入特征。最后,我们探讨了一个多模态系统,该系统结合了声学和词汇特征,以及ASR解码器信号在LLM中。使用多模态信息,我们得到了超过文本和音频模型的相对等误率改进,最高达39%和61%。增加LLM的大小和使用低秩自适应训练进一步减少了我们的数据集中的相对等误率,最高达18%。
https://arxiv.org/abs/2403.14438
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
语音识别和翻译系统在嘈杂的输入中表现不佳,这些输入在现实环境中很常见。通过视觉信号增强这些系统有望提高对抗噪声的能力。然而,音频视觉(AV)数据仅限于有限数量,并且支持的语言比音频资源更少。为了填补这个空白,我们提出了XLAVS-R,一种跨语言的音频视觉语音表示模型,用于在100多种语言中进行噪声抗性的语音识别和翻译。它旨在利用有限的多语言AV预训练数据,通过在音频only多语言预训练的基础上简化现有的预训练方案。对MuAViC基准进行的广泛评估显示,XLAVS-R在下游音频视觉语音识别和翻译任务中的实力,其在嘈杂AV输入上的性能优于前人水平,其中在嘈杂AV输入上的WER提高了18.5%,BLEU提高了4.7倍。此外,XLAVS-R还具有与音频only微调相同的零散音频视觉能力。
https://arxiv.org/abs/2403.14402
Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the spoken and written words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M$^3$AV makes it a challenging dataset.
发布开源学术视频是一种新兴且普遍的知识共享在线方法。这些视频携带丰富的多模态信息,包括演讲者的面部和身体动作,以及幻灯片上的文本和图片,甚至可能还有论文。尽管已经构建和发布了许多学术视频数据集,但大多数都不支持多模态内容识别和理解任务,这部分原因是缺乏高质量的人类注释。在本文中,我们提出了一个名为M$^3$AV的多模态、多流派和多功能音频-视觉学术讲座数据集(M$^3$AV),它包含了来自五个来源的超过367小时的视频,涵盖了计算机科学、数学和医学生物主题。由于高质量的人类注释,特别是高价值的名词,这个数据集可以用于多个音频-视觉识别和理解任务。在上下文语音识别、语音合成和幻灯和脚本生成任务的评估表明,M$^3$AV的多样性使得它成为一个具有挑战性的数据集。
https://arxiv.org/abs/2403.14168
This paper presents a new software framework for HRI experimentation with the sixth version of the common NAO robot produced by the United Robotics Group. Embracing the common demand of researchers for better performance and new features for NAO, the authors took advantage of the ability to run ROS2 onboard on the NAO to develop a framework independent of the APIs provided by the manufacturer. Such a system provides NAO with not only the basic skills of a humanoid robot such as walking and reproducing movements of interest but also features often used in HRI such as: speech recognition/synthesis, face and object detention, and the use of Generative Pre-trained Transformer (GPT) models for conversation. The developed code is therefore configured as a ready-to-use but also highly expandable and improvable tool thanks to the possibilities provided by the ROS community.
本文介绍了一种新的机器人人机交互(HRI)实验软件框架,用于使用第六版由美国机器人集团开发的共同机器人NAO。为了满足研究人员对NAO的更好性能和新技术的需求,作者利用了NAO上运行ROS2的能力,开发了一个独立于制造商提供的API的框架。这样的系统不仅提供了类似于人类机器人的人类基本技能(如行走和感兴趣的运动复制),还提供了HRI中经常使用的功能,如语音识别/合成、面部和物体检测以及使用生成预训练Transformer(GPT)模型进行对话。因此,所开发的代码被配置为一种易于使用但高度可扩展和可修改的工具,得益于机器人社区提供的可能性。
https://arxiv.org/abs/2403.13960
In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.
在本文中,我们提出了两种新方法,将长内容信息整合到分解神经转换器(FNT)的架构中,基于非流式(称为LongFNT)和流式(称为SLongFNT)场景。我们首先调查是否长内容转录可以提高普通 conformer 转换器(C-T)模型。我们的实验结果表明,当使用长内容转录时,普通 C-T 模型并没有表现出更好的性能,这可能是因为 C-T 模型的预测网络并不是一个纯语言模型。相反,FNT展示了利用长内容信息的潜力,我们提出了 LongFNT 模型,并探讨了长内容信息在文本(LongFNT-Text)和语音(LongFNT-Speech)场景下的影响。所提出的 LongFNT-Text 和 LongFNT-Speech 模型进一步互补彼此以实现更好的性能,转录历史对模型来说具有更高的价值。我们对 LongFNT 方法的 effectiveness在 LibriSpeech 和 GigaSpeech 数据集上进行了评估,相对19%和12%的单词错误率降低。此外,我们将 LongFNT 模型扩展到流式场景,名为 SLongFNT,包括 SLongFNT-Text 和 SLongFNT-Speech 方法,用于利用长内容文本和语音信息。实验结果表明,相对于 FNT 基线,所提出的 SLongFNT 模型在 LibriSpeech 和 GigaSpeech 数据集上分别实现了相对26%和17%的WER降低,同时保持较快的延迟。总体而言,我们提出的 LongFNT 和 SLongFNT 方法突出了考虑长内容语音和转录知识对于改进非流式和流式语音识别系统的重要性。
https://arxiv.org/abs/2403.13423
Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subsequent to the dubbing process. Previous approaches have focused on aligning the number of characters and words in the source and target language texts of Machine Translation models. However, our approach aims to align the number of phonemes instead, as they are closely associated with speech duration. In this paper, we present the development of an isometric NMT system using Reinforcement Learning (RL), with a focus on optimizing the alignment of phoneme counts in the source and target language sentence pairs. To evaluate our models, we propose the Phoneme Count Compliance (PCC) score, which is a measure of length compliance. Our approach demonstrates a substantial improvement of approximately 36% in the PCC score compared to the state-of-the-art models when applied to English-Hindi language pairs. Moreover, we propose a student-teacher architecture within the framework of our RL approach to maintain a trade-off between the phoneme count and translation quality.
传统自动视频配音(AVD)流程包括三个关键模块,即自动语音识别(ASR)、神经机器翻译(NMT)和文本转语音(TTS)。在AVD流程中,等距-NMT算法用于调节合成输出文本的长度。这是为了确保在配音过程后,视频和音频的对齐。之前的解决方案专注于将机器翻译模型的源语言和目标语言文本中的字符和单词对齐。然而,我们的方法旨在将等距-NMT算法的注意力放在对齐源语言和目标语言句子对中的音素计数上,因为它们与语音持续时间密切相关。在本文中,我们介绍了使用强化学习(RL)开发等距NMT系统的研究,重点优化源语言和目标语言句子对中的音素计数。为了评估我们的模型,我们提出了音素计数符合(PCC)分数,这是一种衡量长度符合的指标。我们的方法在将AHD-English和AHD-Hindi语言对应用到我们的RL方法时,PCC分数比最先进的模型大约提高了36%。此外,我们还提出了一种学生-教师架构,在我们的RL方法中保持音素计数和翻译质量之间的平衡。
https://arxiv.org/abs/2403.15469
The success of a specific neural network architecture is closely tied to the dataset and task it tackles; there is no one-size-fits-all solution. Thus, considerable efforts have been made to quickly and accurately estimate the performances of neural architectures, without full training or evaluation, for given tasks and datasets. Neural architecture encoding has played a crucial role in the estimation, and graphbased methods, which treat an architecture as a graph, have shown prominent performance. For enhanced representation learning of neural architectures, we introduce FlowerFormer, a powerful graph transformer that incorporates the information flows within a neural architecture. FlowerFormer consists of two key components: (a) bidirectional asynchronous message passing, inspired by the flows; (b) global attention built on flow-based masking. Our extensive experiments demonstrate the superiority of FlowerFormer over existing neural encoding methods, and its effectiveness extends beyond computer vision models to include graph neural networks and auto speech recognition models. Our code is available at this http URL.
特定神经网络架构的成功与所处理的数据集和任务息息相关;没有一种万无一失的解决方案。因此,为了快速且准确地估计给定任务和数据集上的神经网络性能,已经做了很多努力。神经网络架构编码在估计过程中发挥了关键作用,基于流的方法,将架构视为一个图,表现突出。为了增强神经架构的表示学习,我们引入了FlowerFormer,一种强大的图变换器,它包含了神经架构中的信息流。FlowerFormer由两个关键组件组成:(a)双向异步消息传递,受到流的影响;(b)基于流掩码的全球注意力。我们广泛的实验证明,FlowerFormer优于现有的神经编码方法,且其效果不仅限于计算机视觉模型,还包括图神经网络和自动语音识别模型。我们的代码可以从该链接的http URL获得。
https://arxiv.org/abs/2403.12821
Real-time speech extraction is an important challenge with various applications such as speech recognition in a human-like avatar/robot. In this paper, we propose the real-time extension of a speech extraction method based on independent low-rank matrix analysis (ILRMA) and rank-constrained spatial covariance matrix estimation (RCSCME). The RCSCME-based method is a multichannel blind speech extraction method that demonstrates superior speech extraction performance in diffuse noise environments. To improve the performance, we introduce spatial regularization into the ILRMA part of the RCSCME-based speech extraction and design two regularizers. Speech extraction experiments demonstrated that the proposed methods can function in real time and the designed regularizers improve the speech extraction performance.
实时语音提取是一个具有各种应用的重要挑战,如类似于人类虚拟角色/机器人的语音识别。在本文中,我们提出了基于独立低秩矩阵分析(ILRMA)和秩约束的空间协方差矩阵估计(RCSCME)的实时扩展语音提取方法。基于RCSCME的实时方法是一种多通道盲语音提取方法,在扩散噪声环境中表现出卓越的语音提取性能。为了提高性能,我们将空间正则化引入到ILRMA基于RCSCME的语音提取部分中,并设计两个正则器。语音提取实验证明,所提出的方法可以在实时中运行,并且设计好的正则器可以提高语音提取性能。
https://arxiv.org/abs/2403.12477
In this paper, we extended the method proposed in [17] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at this https URL.
在本文中,我们将 [17] 中提出的方法扩展,以实现人类通过语音和文本对话与自主机器人自然交互。我们的扩展方法利用了预训练的大型语言模型(LLMs)、多模态视觉语言模型(VLMs)和语音识别(SR)模型的固有功能,将高级自然语言对话的解码和机器人任务的语义理解进行解密,并将其抽象为机器人的操作指令或查询。我们对使用不同种族背景和英语口音的参与者对机器人进行自然语音对话理解进行了定量评估。参与者使用口头和文本指令与机器人交互。根据记录的交互数据,我们的框架获得了87.55%的语音命令解码准确率、86.27%的指令执行成功率和从接收参与者语音聊天指令到启动机器人实际物理动作的平均延迟为0.89秒。本文的视频演示可在此链接查看:https://www.youtube.com/watch?v=。
https://arxiv.org/abs/2403.12273
Automatic speech recognition (ASR) plays a pivotal role in our daily lives, offering utility not only for interacting with machines but also for facilitating communication for individuals with either partial or profound hearing impairments. The process involves receiving the speech signal in analogue form, followed by various signal processing algorithms to make it compatible with devices of limited capacity, such as cochlear implants (CIs). Unfortunately, these implants, equipped with a finite number of electrodes, often result in speech distortion during synthesis. Despite efforts by researchers to enhance received speech quality using various state-of-the-art signal processing techniques, challenges persist, especially in scenarios involving multiple sources of speech, environmental noise, and other circumstances. The advent of new artificial intelligence (AI) methods has ushered in cutting-edge strategies to address the limitations and difficulties associated with traditional signal processing techniques dedicated to CIs. This review aims to comprehensively review advancements in CI-based ASR and speech enhancement, among other related aspects. The primary objective is to provide a thorough overview of metrics and datasets, exploring the capabilities of AI algorithms in this biomedical field, summarizing and commenting on the best results obtained. Additionally, the review will delve into potential applications and suggest future directions to bridge existing research gaps in this domain.
自动语音识别(ASR)在我们的日常生活中扮演着关键角色,不仅为我们与机器的互动提供了便利,还为患有部分或严重听力障碍的个人提供了一种便利,以进行交流。该过程涉及将语音信号以模拟形式接收,然后通过各种信号处理算法使其与具有有限容量的设备(如人工耳蜗)兼容。然而,这些植入物配备的有限数量的电极往往会导致合成过程中出现语音 distortion。尽管研究人员通过使用各种最先进的信号处理技术来提高接收到的语音质量,但挑战仍然存在,尤其是在涉及多个语音来源、环境噪声和其他情况的情况下。ASR技术的出现为解决与传统信号处理技术相关的CIs限制和困难带来了尖端策略。 本次综述旨在全面回顾基于CI的ASR和语音增强以及其他相关方面的发展。主要目标是为读者提供对这一生物医学领域AI算法的深入概述,总结和评论最佳结果。此外,本次综述将深入探讨潜在应用,并为该领域未来的研究方向提供建议,以弥合现有研究空白。
https://arxiv.org/abs/2403.15442
Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.
基于能量的模型(EBMs)是一种重要的概率模型,也被称为随机场和无向图形模型。EBMs 是不归一化的,因此与广受欢迎的自归一化概率模型(如隐马尔可夫模型(HMMs)、自回归模型、生成对抗网络(GANs)和变分自编码器(VAEs)等)具有显著的区别。在过去的几年里,EBMs 不仅吸引了来自核心机器学习社区越来越多的关注,还吸引了来自应用领域(如语音、视觉、自然语言处理等)的更多关注, due to 显著的理论和算法的进步。语音和语言的序列特性也带来了特殊的挑战,需要与处理固定维数据(如图像)不同的处理方式。因此,本著作的目的是系统地介绍基于能量的模型,包括理论进展和语音和语言处理中的应用。首先介绍 EBM 的基本原理,包括经典模型、通过神经网络参数化的最近模型、采样方法和各种经典学习算法到最先进模型的各种学习方法。然后,分别介绍 EBM 在三种不同场景中的应用,即分别建模序列数据、目标序列的联合分布以及同时建模观察序列和目标序列的联合分布及其在半监督学习和标定自然语言理解中的应用。1)用于语言建模的序列数据的 EBM,主要关注序列本身的后验分布;2)用于建模给定观察序列的目标序列联合分布,应用于语音识别、序列标签和文本生成;3)用于同时建模观察序列和目标序列的联合分布及其在半监督学习和标定自然语言理解中的应用。
https://arxiv.org/abs/2403.10961
This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages where the baseline language model is insufficient for generating inclusive lattices. We minimally augment the baseline language model with word unigram counts that are present in a larger text corpus of the target language but absent in the baseline. The lattices generated after decoding with such an augmented baseline language model are more comprehensive. We obtain 21.8% (Telugu) and 41.8% (Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable to 21.5% (Telugu) and 45.9% (Kannada) relative word error reduction obtained by decoding with full Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate that our method is comparable with various text selection-based language model augmentation and also consistent for data sets of different sizes. Our approach is applicable for training speech recognition systems under low resource conditions where speech data and compute resources are insufficient, while there is a large text corpus that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is simple and yet computationally less expensive.
本文旨在探讨在低资源语言中使用格子评分法提高语音识别准确性的问题,其中基线语言模型不足以生成包括所有词汇的包容性格子。我们通过在目标语言的大型文本语料库中存在的单词单词计数来最小化基线语言模型的扩展,这样扩展的基线语言模型可以更全面地生成格子。使用这种扩展的基线语言模型生成的格子更全面。我们提出的方法使泰米尔语(Telugu)和坎纳达语(Kannada)的相对单词错误率分别降低了21.8%和41.8%。这种减少的单词错误率与通过完整维基百科文本增强语言模式进行解码获得的相对单词错误率(分别为21.5%和45.9%)相当。我们证明了我们的方法与各种基于文本选择的语言模型增强方法相当,同时也与不同规模的数据集保持一致。我们的方法适用于在低资源条件下训练语音识别系统,其中语音数据和计算资源不足,而目标语言中存在大量文本语料库。我们的研究涉及解决基线词汇中的非词语词汇问题,并不针对解决命名实体缺失的问题。我们提出的方法简单而且计算成本较低。
https://arxiv.org/abs/2403.10937
Benchmarking plays a pivotal role in assessing and enhancing the performance of compact deep learning models designed for execution on resource-constrained devices, such as microcontrollers. Our study introduces a novel, entirely artificially generated benchmarking dataset tailored for speech recognition, representing a core challenge in the field of tiny deep learning. SpokeN-100 consists of spoken numbers from 0 to 99 spoken by 32 different speakers in four different languages, namely English, Mandarin, German and French, resulting in 12,800 audio samples. We determine auditory features and use UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) as a dimensionality reduction method to show the diversity and richness of the dataset. To highlight the use case of the dataset, we introduce two benchmark tasks: given an audio sample, classify (i) the used language and/or (ii) the spoken number. We optimized state-of-the-art deep neural networks and performed an evolutionary neural architecture search to find tiny architectures optimized for the 32-bit ARM Cortex-M4 nRF52840 microcontroller. Our results represent the first benchmark data achieved for SpokeN-100.
基准测试在评估和增强资源受限设备上设计的紧凑型深度学习模型的性能中发挥着重要作用,例如微控制器。我们的研究介绍了一个新的、完全人工生成的基准测试数据集,专门针对语音识别,代表了该领域中最小的深度学习挑战。SpokeN-100 包括来自 0 到 99 的语音数字,由 32 名不同的说话者用英语、普通话、德语和法语讲述了,共产生 12,800 个音频样本。我们确定音频特征,并使用 UMAP(统一曼哈顿近似和投影降维)作为降维方法,以展示数据集的多样性和丰富性。为了突出该数据集的使用案例,我们引入了两个基准任务:给定一个音频样本,分类(i)使用的语言,(ii)说话的数字。我们优化了最先进的深度神经网络,并进行了进化神经架构搜索,以找到针对 32 位 ARM Cortex-M4 nRF52840 微控制器的最佳架构。我们的结果代表了 SpokeN-100 第一个基准数据。
https://arxiv.org/abs/2403.09753
This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.
本文探讨了 singing领域中语音识别(speech recognition)的挑战和进展。与标准语音识别领域不同,唱歌领域具有独特的挑战,包括广泛的音高变化、多样化的歌唱风格和背景音乐干扰。我们探讨了关键领域,如音素识别、歌曲中的语言识别、关键词捕捉和完整歌词转录。我将在讨论这些任务开始得到广泛关注时描述我自己的一些经验,但也会展示深度学习和大规模数据集的最近发展如何推动这一领域的发展。我的目标是阐明将语音识别应用于唱歌的复杂性,评估现有能力,并指出未来的研究方向。
https://arxiv.org/abs/2403.09298