Speech

Differential contributions of machine learning and statistical analysis to language and cognitive sciences

2024-04-22 10:06:21

Kun Sun, Rong Wang

arXiv_CL

arXiv_CL Speech
Abstract

Data-driven approaches have revolutionized scientific research. Machine learning and statistical analysis are commonly utilized in this type of research. Despite their widespread use, these methodologies differ significantly in their techniques and objectives. Few studies have utilized a consistent dataset to demonstrate these differences within the social sciences, particularly in language and cognitive sciences. This study leverages the Buckeye Speech Corpus to illustrate how both machine learning and statistical analysis are applied in data-driven research to obtain distinct insights. This study significantly enhances our understanding of the diverse approaches employed in data-driven strategies.

Abstract (translated)

数据驱动的方法已经彻底颠覆了科学研究。机器学习和统计分析是这类研究中最常用的方法。尽管这些方法在科学领域具有广泛应用，但它们在技术和目标上存在显著差异。迄今为止，几乎没有研究在社会科学领域利用一个一致的数据集来展示这些差异，特别是在语言和认知科学领域。本研究利用布奇克语音数据集来说明，机器学习和统计分析在数据驱动研究中如何应用于获得独特的见解。本研究显著增强了我们对数据驱动策略采用的不同方法的认知。

URL

https://arxiv.org/abs/2404.14052

PDF

https://arxiv.org/pdf/2404.14052.pdf
Read All
Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks

2024-04-22 09:40:07

Alexandre Bittar, Philip N. Garner

arXiv_CL

arXiv_CL Speech_Recognition Recognition Gradient_Descent Deep_Learning Action Activity Speech
Abstract

Understanding cognitive processes in the brain demands sophisticated models capable of replicating neural dynamics at large scales. We present a physiologically inspired speech recognition architecture, compatible and scalable with deep learning frameworks, and demonstrate that end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Significant cross-frequency couplings, indicative of these oscillations, are measured within and across network layers during speech processing, whereas no such interactions are observed when handling background noise inputs. Furthermore, our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance. Overall, on top of developing our understanding of synchronisation phenomena notably observed in the human auditory pathway, our architecture exhibits dynamic and efficient information processing, with relevance to neuromorphic technology.

Abstract (translated)

理解大脑中的认知过程需要复杂且能够在大尺度上复制神经动态的模型。我们提出了一个生理学上启发的语音识别架构，与深度学习框架兼容并具有可扩展性，并证明了端到端梯度下降训练会导致中央尖峰神经网络中神经振荡的出现。在语音处理过程中，我们测量了跨频联系，这些联系表明了这些振荡，而在处理背景噪声输入时，并没有观察到这样的相互作用。此外，我们的研究结果突出了反馈机制（如尖峰频率适应和循环连接）在调节和同步神经活动以提高识别性能中的关键抑制作用。总的来说，在发展我们人类听觉通路中同步现象的基础上，我们的架构表现出动态和高效的信息处理，与类神经形态技术有关。

URL

https://arxiv.org/abs/2404.14024

PDF

https://arxiv.org/pdf/2404.14024.pdf
Read All
Retrieval-Augmented Audio Deepfake Detection

2024-04-22 05:46:40

Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

arXiv_AI

arXiv_AI Detection Knowledge Pose Speech
Abstract

With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.

Abstract (translated)

近年来，在语音合成方面的进步包括文本到语音（TTS）和语音转换（VC）系统，这些系统能够生成超现实主义的音频深度伪造，因此人们对它们可能被滥用的问题越来越担忧。然而，大多数深度伪造（DF）检测方法仅依赖单个模型的模糊知识，导致性能瓶颈和透明度问题。受到检索增强生成（RAG）的启发，我们提出了一个检索增强检测（RAD）框架，通过增加与检索样本相似的测试样本来增强检测。我们还将多融合注意分类器扩展到与我们的RAD框架相结合。大量实验证明，与基线方法相比，所提出的RAD框架具有卓越的性能，在ASVspoof 2021 DF集上实现了最先进的成果，同时在2019和2021 LA集上获得了竞争力的结果。进一步的样本分析表明，检索器总是从具有与查询音频相似的相同说话人检索样本，从而提高了检测性能。

URL

https://arxiv.org/abs/2404.13892

PDF

https://arxiv.org/pdf/2404.13892.pdf
Read All
MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention

2024-04-21 02:44:17

Xinxin Jiao, Liejun Wang, Yinfeng Yu

arXiv_AI

arXiv_AI Recognition Attention Bert Pose Action Emotion Speech
Abstract

Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.

Abstract (translated)

语音情感识别在人与计算机交互中至关重要，但提取和利用音频中的情感线索仍然具有挑战性。本文介绍了一种名为MFHCA的新方法，用于基于多空间融合和层次合作注意的语音情感识别。我们采用多空间融合模块（MF）来有效地识别与情感相关的频谱图区域，并利用Hubert特征获取更高层次的音频信息。我们的方法还包括一个层次合作注意模块（HCA），以合并来自不同音频层次的特征。我们在IEMOCAP数据集上评估我们的方法，分别实现了2.6%和1.87%的加权准确性和无加权准确性的提高。大量实验证明所提出的方法的有效性。

URL

https://arxiv.org/abs/2404.13509

PDF

https://arxiv.org/pdf/2404.13509.pdf
Read All
Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications

2024-04-21 02:26:15

Charith Chandra Sai Balne, Sreyoshi Bhaduri, Tamoghna Roy, Vinija Jain, Aman Chadha

arXiv_AI

arXiv_AI Deep_Learning Review Face Optimization Text_Generation Medical Speech
Abstract

The rise of deep learning has marked significant progress in fields such as computer vision, natural language processing, and medical imaging, primarily through the adaptation of pre-trained models for specific tasks. Traditional fine-tuning methods, involving adjustments to all parameters, face challenges due to high computational and memory demands. This has led to the development of Parameter Efficient Fine-Tuning (PEFT) techniques, which selectively update parameters to balance computational efficiency with performance. This review examines PEFT approaches, offering a detailed comparison of various strategies highlighting applications across different domains, including text generation, medical imaging, protein modeling, and speech synthesis. By assessing the effectiveness of PEFT methods in reducing computational load, speeding up training, and lowering memory usage, this paper contributes to making deep learning more accessible and adaptable, facilitating its wider application and encouraging innovation in model optimization. Ultimately, the paper aims to contribute towards insights into PEFT's evolving landscape, guiding researchers and practitioners in overcoming the limitations of conventional fine-tuning approaches.

Abstract (translated)

深度学习的兴起在计算机视觉、自然语言处理和医学影像等领域标志着显著的进展，主要是通过为特定任务对预训练模型进行调整。然而，传统的微调方法在计算和内存需求较高的情况下遇到了挑战。这导致开发了参数高效的微调（PEFT）技术，这些技术选择性地更新参数以平衡计算效率和性能。本文回顾了PEFT方法，详细比较了各种策略，突出了在不同领域的应用，包括文本生成、医学影像、蛋白质建模和语音合成。通过评估PEFT方法在减轻计算负担、加速训练和降低内存使用方面的有效性，本文为深度学习更加便捷和适应性提供了一个论据，促进了其在各个领域的更广泛应用，并鼓励模型优化方面的创新。最终，本文旨在为PEFT的发展提供一个指导，引导研究人员和实践者克服传统微调方法的局限性。

URL

https://arxiv.org/abs/2404.13506

PDF

https://arxiv.org/pdf/2404.13506.pdf
Read All
Semantically Corrected Amharic Automatic Speech Recognition

2024-04-20 12:08:00

Samuael Adnew, Paul Pu Liang

arXiv_AI

arXiv_AI Speech_Recognition GAN Recognition Transformer Speech
Abstract

Automatic Speech Recognition (ASR) can play a crucial role in enhancing the accessibility of spoken languages worldwide. In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. Amharic is written in the Ge'ez script, a sequence of graphemes with spacings denoting word boundaries. This makes computational processing of Amharic challenging since the location of spacings can significantly impact the meaning of formed sentences. We find that existing benchmarks for Amharic ASR do not account for these spacings and only measure individual grapheme error rates, leading to significantly inflated measurements of in-the-wild performance. In this paper, we first release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress. Furthermore, we introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected test dataset, our model enhances the semantic correctness of Amharic speech recognition systems, achieving a Character Error Rate (CER) of 5.5\% and a Word Error Rate (WER) of 23.3\%.

Abstract (translated)

自动语音识别（ASR）在提高全球范围内口头语言的可访问性方面发挥着关键作用。在本文中，我们为阿姆哈勒语（一种主要在东非使用的语言）构建了一组ASR工具。阿姆哈勒语用吉斯文书写，这是一种由标点符号组成的序列，其中间隔表示单词边界。这使得阿姆哈勒语的计算处理具有挑战性，因为间隔的位置可能会显著影响形成的句子的意思。我们发现，现有的阿姆哈勒语ASR基准没有考虑到这些间隔，而只是测量单个词形错误率，导致在野外性能的测量值大幅膨胀。在本文中，我们首先发布了现有阿姆哈勒语ASR测试数据集的修正转录，使社区能够准确评估进展。此外，我们使用Transformer编码器-解码器架构引入了一种后处理方法，将原始ASR输出组织成一个语法完整且语义有意义的阿姆哈勒语句子。通过在修正测试数据集上的实验，我们的模型提高了阿姆哈勒语语音识别系统的语义正确性，实现了 Character Error Rate（CER）为5.5\% 和 Word Error Rate（WER）为23.3\%的性能。

URL

https://arxiv.org/abs/2404.13362

PDF

https://arxiv.org/pdf/2404.13362.pdf
Read All
Double Mixture: Towards Continual Event Detection from Speech

2024-04-20 06:32:00

Jingqi Kang, Tongtong Wu, Jinming Zhao, Guitao Wang, Yinwei Wei, Hao Yang, Guilin Qi, Yuan-Fang Li, Gholamreza Haffari

arXiv_CL

arXiv_CL Detection Pose Speech Dialog Chat
Abstract

Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.

Abstract (translated)

语音事件检测对于多媒体检索至关重要，涉及对语义和音频事件进行标记。传统的ASR系统通常忽视这些事件之间的相互作用，仅关注内容，尽管对话的解释可能会因环境背景而有所不同。本文解决了两个主要的语音事件检测挑战：持续集成新事件，同时不遗忘以前的事件，以及语义和音频事件的分离。我们还提供了两个基准数据集，用于说明这个任务。为了应对灾难性遗忘和有效分离的问题，我们提出了名为“双混合”的新方法。这种方法将语音专业知识与健壮的存储机制相结合，提高了可塑性和防止遗忘。我们全面的实验结果表明，这个任务对计算机视觉和自然语言处理当前方法提出了严重挑战。我们的方法实现了最低的遗忘率和最高的泛化水平，证明其健壮性在各种连续学习序列中。我们的代码和数据可以从https://anonymous.4open.science/status/Continual-SpeechED-6461获取。

URL

https://arxiv.org/abs/2404.13289

PDF

https://arxiv.org/pdf/2404.13289.pdf
Read All
TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition

2024-04-19 16:09:17

Chengxin Chen, Pengyuan Zhang

arXiv_SD

arXiv_SD Recognition Pose Emotion Enhancement Speech
Abstract

One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in diminished SER performance in practical use. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially increases the system's robustness in both matched and unmatched noisy environments, without compromising its performance in clean environments.

Abstract (translated)

在《Speech Emotion Recognition (SER)》，一个普遍的挑战是普遍存在的环境噪声，这经常导致在实际应用中SER性能减弱。在本文中，我们提出了一种名为TRNet的双层细化网络，用于解决这一挑战。具体来说，在前端噪声减少和噪声水平估计中使用预训练的语音增强模块。在模型训练期间，我们利用干净的语音时域和它们的相应深度表示作为参考信号来修整增强语音的时域变形和表示变化。实验结果证实，与未配对和配对噪声环境相比，所提出的TRNet显著提高了系统的鲁棒性，而没有牺牲其在干净环境中的性能。

URL

https://arxiv.org/abs/2404.12979

PDF

https://arxiv.org/pdf/2404.12979.pdf
Read All
Learn2Talk: 3D Talking Face Learns from 2D Talking Face

2024-04-19 13:45:14

Yixiang Zhuang, Baoping Cheng, Yao Cheng, Yuntao Jin, Renshuai Liu, Chengyang Li, Xuan Cheng, Jing Liao, Juncong Lin

arXiv_CV

arXiv_CV Speech_Recognition Recognition Face Attention Knowledge Pose 3D Speech
Abstract

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.

Abstract (translated)

演讲驱动的面部动画方法通常包含两个主要类别：3D和2D对话面部，这两者在近年来都吸引了相当大的研究关注。然而，据我们所知，在3D对话面部的研究中，深度并没有达到2D对话面部的水平，尤其是在同步（ lipsynchronization）和语音感知方面。为了弥补这两个子领域的差距，我们提出了一个名为Learn2Talk的学习框架，通过利用2D对话面部的两个专业领域来构建更好的3D对话面部网络。首先，受到音频-视频同步网络的启发，设计了一个3D sync-lip专家模型，以实现音频和3D面部运动的同步。其次，从2D对话面部方法中选择一个教师模型，用于指导音频-到-3D运动回归网络的训练，以实现更高精度的3D顶点准确性。大量的实验结果表明，与最先进的水平相比，所提出的框架在同步、顶点准确性和语音感知方面具有优势。最后，我们展示了两个基于所提出框架的应用：音频-视频语音识别和基于语音的3D高斯平铺基于虚拟角色动画。

URL

https://arxiv.org/abs/2404.12888

PDF

https://arxiv.org/pdf/2404.12888.pdf
Read All
MCM: Multi-condition Motion Synthesis Framework

2024-04-19 13:40:25

Zeyu Ling, Bo Han, Yongkang Wongkan, Han Lin, Mohan Kankanhalli, Weidong Geng

arXiv_CV

arXiv_CV Attention Relation Transformer Pose Speech Diffusion
Abstract

Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch. This framework effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions. This extension encompasses both music-to-dance and co-speech HMS while preserving the intrinsic quality of motion and the capabilities for semantic association inherent in the original model. Furthermore, we propose the implementation of a Transformer-based diffusion model, designated as MWNet, as the main branch. This model adeptly apprehends the spatial intricacies and inter-joint correlations inherent in motion sequences, facilitated by the integration of multi-wise self-attention modules. Extensive experiments show that our method achieves competitive results in single-condition and multi-condition HMS tasks.

Abstract (translated)

条件人类运动合成（HMS）旨在生成符合特定条件的的人类运动序列。文本和音频是作为HMS控制条件的两种主要模式。虽然现有的研究主要集中在单个条件，但多条件HMS仍然没有被充分探索。在这项研究中，我们提出了一个基于双分支结构的MCM多条件HMS框架。这个框架基于主分支和控制分支的构建，有效地将扩散模型的应用范围从仅基于文本条件的应用扩展到了基于音频条件的应用。这个扩展包括音乐-舞蹈和共同说话HMS，同时保留原始模型的固有质量和语义关联能力。此外，我们还提出了一个基于Transformer的扩散模型， designated为MWNet，作为主分支。这个模型巧妙地捕捉了运动序列中的空间复杂性和关节相关性，通过集成多维自注意力模块大大促进了这一目的。大量实验结果表明，我们的方法在单条件和多条件HMS任务中实现了竞争力的结果。

URL

https://arxiv.org/abs/2404.12886

PDF

https://arxiv.org/pdf/2404.12886.pdf
Read All
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

2024-04-19 09:08:44

Zhaoxi Mu, Xinyu Yang

arXiv_CV

arXiv_CV Pose Action Matching Speech
Abstract

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.

Abstract (translated)

视觉提示的集成已经恢复了目标语音提取任务的性能，将其提升到领域的最前沿。然而，这种多模态学习范式通常会遇到模态不平衡的挑战。在音频-视频目标语音提取任务中，音频模态往往占主导地位，可能削弱视觉指导的重要性。为解决这个问题，我们提出了AVSepChain，受到语音链概念的启发。我们的方法将音频-视频目标语音提取任务分为两个阶段：语音感知和语音生成。在语音感知阶段，音频作为主导模态，而视觉信息作为条件模态。相反，在语音生成阶段，这两个角色是相反的。这种模态状态的转换旨在减轻模态不平衡的问题。此外，我们引入了对比性语义匹配损失，以确保生成的语音所传达的语义信息与语音生产阶段时唇运动所传达的语义信息相一致。通过在多个音频-视频目标语音提取基准数据集上进行广泛实验，我们展示了我们方法所取得的优越性能。

URL

https://arxiv.org/abs/2404.12725

PDF

https://arxiv.org/pdf/2404.12725.pdf
Read All
Efficient infusion of self-supervised representations in Automatic Speech Recognition

2024-04-19 05:01:12

Darshan Prabhu, Sai Ganesh Mirishkar, Pankaj Wasnik

arXiv_CL

arXiv_CL Speech_Recognition Recognition Attention Bert Pose Self-Supervised Speech
Abstract

Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets compared to baselines. We further provide detailed analysis and ablation studies that demonstrate the effectiveness of our approach.

Abstract (translated)

自监督学习（SSL）模型，如Wav2vec和HuBERT，在语音相关任务上取得了最先进的成果。鉴于这类模型的有效性，将它们应用于传统的ASR系统具有优势。虽然一些方法建议将这类模型作为可训练的编码器或可学习的前端，但训练这些系统非常耗时且需要大量的计算周期。在本文中，我们提出了两种简单的策略，即（1）框架级加法和（2）跨注意机制，将SSL模型的表示有效地融入ASR架构，从而实现与标准编码器-解码器紧凑系统相当大小的模型，并避免在训练过程中使用SSL模型。我们的方法使得训练更快，同时在Librispeech和Tedlium数据集上的性能相较于基线有了显著的提高。此外，我们还提供了详细的分析和消融实验，以证明我们方法的的有效性。

URL

https://arxiv.org/abs/2404.12628

PDF

https://arxiv.org/pdf/2404.12628.pdf
Read All
Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair

2024-04-18 16:24:12

Yusuke Sakai, Mana Makinae, Hidetaka Kamigaito, Taro Watanabe

arXiv_AI

arXiv_AI Language_Model Pose Speech
Abstract

In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{this https URL}.

Abstract (translated)

在同时机器翻译（SiMT）系统中，使用同时解释（SI）语料库进行训练是一种实现高质量低延迟系统的高效方法。然而，由于 annotator 能力受限，因此很难创建这样的语料库，现有的 SI 语料库也很有限。因此，我们提出了一种将现有语音翻译语料库转换为解释风格数据的方法，保持原始单词顺序并使用大型语言模型（LLM-SI-Corpus）保留整个源内容。我们证明了，在文本到文本和语音到文本设置中，使用 LLM-SI-Corpus 对 SiMT 模型进行微调可以降低延迟，同时保持与离线数据训练模型的相同质量水平。LLM-SI-Corpus 可以在 \url{这个链接} 中使用。

URL

https://arxiv.org/abs/2404.12299

PDF

https://arxiv.org/pdf/2404.12299.pdf
Read All
Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities

2024-04-18 15:18:14

Luciana Trinkaus Menon, Luiz Carlos Ribeiro Neduziak, Jean Paul Barddal, Alessandro Lameiras Koerich, Alceu de Souza Britto Jr

arXiv_CV

arXiv_CV Recognition Attention Prediction Emotion Speech
Abstract

The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.

Abstract (translated)

人类情感的研究,传统上是一个心理学和神经科学领域的基石,受到了人工智能(AI)的深刻影响。多种渠道,如语音(声音)和面部表情(图像),对于理解人类情感至关重要。然而,AI在多模态情感识别(MER)方面的旅程充满了技术挑战。一个重要的挑战是AI模型如何处理特定模态的缺失 - 在现实情况中这是一种常见的情况。本研究的核心是对两种策略在遇到一种缺失模态时的表现和恢复力的评估:一种新颖的多模态动态模态和视图选择,以及跨注意机制。RECOLA数据集上的结果表明,基于动态选择的策略对于MER来说是一个有前景的方法。在缺失模态场景中,所有基于动态选择的策略都超过了基线。本研究结论强调了音频和视频模态在情感预测中的复杂相互作用,展示了动态选择方法在处理缺失模态的适应性。

URL

https://arxiv.org/abs/2404.12251

PDF

https://arxiv.org/pdf/2404.12251.pdf
Read All
Enhancing Suicide Risk Assessment: A Speech-Based Automated Approach in Emergency Medicine

2024-04-18 12:33:57

Shahin Amiriparian, Maurice Gerczuk, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Alexander Kathan, Björn W. Schuller

arXiv_CL

arXiv_CL Deep_Learning Classification Pose Speech
Abstract

The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we have collected a novel dataset of speech recordings from $20$ patients from which we extract three sets of features, including wav2vec, interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of $66.2\,\%$. Moreover, we show that integrating our speech model with a series of patients' metadata, such as the history of suicide attempts or access to firearms, improves the overall result. The metadata integration yields a balanced accuracy of $94.4\,\%$, marking an absolute improvement of $28.2\,\%$, demonstrating the efficacy of our proposed approaches for automatic suicide risk assessment in emergency medicine.

Abstract (translated)

延迟在急诊科接受专业精神科评估和治疗患者存在自杀倾向， creates a notable gap in timely intervention， hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive，speech-based approach for automatic suicide risk assessment. For our study， we have collected a novel dataset of speech recordings from $20$ patients from which we extract three sets of features，including wav2vec，interpretable speech and acoustic features， and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of $66.2\%$．此外，我们证明了将我们的语音模型与一系列患者元数据（如自杀尝试历史或获取枪支的渠道）集成可以提高整体结果。元数据集成使平衡准确率达到了$94.4\%$，表明我们在急诊医学中自动自杀风险评估方法的实效性。

URL

https://arxiv.org/abs/2404.12132

PDF

https://arxiv.org/pdf/2404.12132.pdf
Read All
Exploring Boundaries and Intensities in Offensive and Hate Speech: Unveiling the Complex Spectrum of Social Media Discourse

2024-04-18 09:52:50

Abinew Ali Ayele, Esubalew Alemneh Jalew, Adem Chanie Ali, Seid Muhie Yimam, Chris Biemann

arXiv_CL

arXiv_CL Classification Relation Speech
Abstract

The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia's sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The Afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.

Abstract (translated)

数字媒体和不断变化的社会政治动态显著增强了仇恨内容的传播。现有的研究主要集中在将文本分类为二元类别，往往忽视了文本中存在的连续的冒犯和仇恨程度。在这项研究中，我们提出了一个广泛的哈马斯语 benchmark 数据集，包括 8,258 条推特，分别用于三个不同的任务：分类、识别仇恨目标和评分冒犯力和仇恨程度。我们的研究强调，绝大多数推特属于不太冒犯和不太仇恨的程度，这需要利益相关者的早期干预。民族和政治仇恨目标的普遍存在，在我们的数据集中具有显著的重叠，强调了 Ethiopia 社会政治格局中复杂的关系。我们构建了分类和回归模型，并研究了这些任务中模型的效果。我们的结果表明，简单的二元分类无法解决仇恨和冒犯性言论的问题，反而表现为一个连续范围内的变量。Afro-XLMR-large 模型在分类、目标和回归任务上都取得了最佳性能，分别达到 F1 分数为 75.30%、70.59% 和 29.42%。Afro-XLMR-large 模型的 80.22% 相关系数表明很强的 alignments。

URL

https://arxiv.org/abs/2404.12042

PDF

https://arxiv.org/pdf/2404.12042.pdf
Read All
ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

2024-04-18 09:02:45

Lasal Jayawardena, Prasan Yapa

arXiv_AI

arXiv_AI Language_Model Speech
Abstract

Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.

Abstract (translated)

翻译 Paraphrase生成是自然语言处理（NLP）中的一个关键任务。现有数据域内的数据缺乏语义和词汇多样性，导致生成的大多数同义词与原文句子紧密相似。此外，这些数据通常包含仇恨言论和噪音，可能会无意中包括非英语语言句子。这项研究引入了ParaFusion，一个利用大型语言模型（LLM）开发的大型、高质量英语同义词数据集，以解决这些挑战。ParaFusion通过增加高质量数据来扩展现有数据集，显著增强词汇和语义多样性，同时保持接近的语义相似性。它还减轻了仇恨言论的存在，减少了噪音，确保了更干净和更具凝聚力的英语数据集。结果表明，ParaFusion在多个数据源上提供了至少25%的语义和词汇多样性改进。论文还旨在为同义词评估设定一个金标准，因为它包含迄今为止最全面的评估策略。结果证实了ParaFusion作为改进NLP应用的有价值的资源的可能性。

URL

https://arxiv.org/abs/2404.12010

PDF

https://arxiv.org/pdf/2404.12010.pdf
Read All
A Federated Learning Approach to Privacy Preserving Offensive Language Identification

2024-04-17 15:23:12

Marcos Zampieri, Damith Premasiri, Tharindu Ranasinghe

arXiv_CL

arXiv_CL Deep_Learning Pose Speech
Abstract

The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users' privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

Abstract (translated)

在社交媒体上各种形式的进攻性言论的传播是一个重要的问题。尽管平台已经投入了大量资源来应对这个问题，但隐私问题仍然没有得到很大的解决。训练社交媒体上检测进攻性语言的模型的方法通常是使用大量存储在集中服务器上的大量数据来训练和/或微调。由于大多数社交媒体数据都来自终端用户，我们提出了在进攻性语言识别中使用联邦学习（FL）的隐私保护分布式架构。FL是一种允许多个模型在本地训练而无需数据共享的分布式架构，从而保护用户的隐私。我们提出了一个模型融合方法来执行FL。我们在四个公开可用的英语基准数据集（AHSD，HASOC，HateXplain，OLID）上训练了多个深度学习模型，并详细评估了它们的表现。我们还展示了在英语和西班牙语之间的跨语言实验。我们证明了所提出的模型融合方法在所有数据集上都优于基线，同时保护了隐私。

URL

https://arxiv.org/abs/2404.11470

PDF

https://arxiv.org/pdf/2404.11470.pdf
Read All
Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

2024-04-17 11:31:16

Ye Bai, Chenxing Li, Hao Li, Yuanyuan Zhao, Xiaorui Wang

arXiv_SD

arXiv_SD Recognition Attention Pose Speech
Abstract

In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.

Abstract (translated)

简短的视频和现场直播中,说话声、歌唱声和背景音乐经常重叠并掩盖彼此。这种复杂性使得对音频内容的组织和识别带来了困难,这可能会影响到后续的ASR和音乐理解应用程序。本文提出了一种基于多任务音频源分离(MTASS)的ASR模型,称为JRSV,它同时识别说话和歌唱声音。具体来说,MTASS模块将混合音频分离为不同的说话和歌唱声道,并去除了背景音乐。CTC/attention混合识别模块同时识别这两条轨道。提出了在线去噪以进一步提高识别的鲁棒性。为了评估所提出的方法,构建了一个基准数据集并发布。实验结果表明,JRSV可以在混合音频的每个轨道上显著提高识别准确性。

URL

https://arxiv.org/abs/2404.11275

PDF

https://arxiv.org/pdf/2404.11275.pdf
Read All
Cross-Platform Hate Speech Detection with Weakly Supervised Causal Disentanglement

2024-04-17 03:25:54

Paras Sheth, Tharindu Kumarage, Raha Moraffah, Aman Chadha, Huan Liu

arXiv_CL

arXiv_CL Weakly_Supervised Detection Deep_Learning Regularization Face Speech Dialog Chat
Abstract

Content moderation faces a challenging task as social media's ability to spread hate speech contrasts with its role in promoting global connectivity. With rapidly evolving slang and hate speech, the adaptability of conventional deep learning to the fluid landscape of online dialogue remains limited. In response, causality inspired disentanglement has shown promise by segregating platform specific peculiarities from universal hate indicators. However, its dependency on available ground truth target labels for discerning these nuances faces practical hurdles with the incessant evolution of platforms and the mutable nature of hate speech. Using confidence based reweighting and contrastive regularization, this study presents HATE WATCH, a novel framework of weakly supervised causal disentanglement that circumvents the need for explicit target labeling and effectively disentangles input features into invariant representations of hate. Empirical validation across platforms two with target labels and two without positions HATE WATCH as a novel method in cross platform hate speech detection with superior performance. HATE WATCH advances scalable content moderation techniques towards developing safer online communities.

Abstract (translated)

内容审查面临着一个具有挑战性的任务，因为社交媒体传播仇恨言论的能力与促进全球连通性的作用相矛盾。随着迅速变化的俚语和仇恨言论，传统深度学习对在线对话灵活领域的适应性仍然有限。为了应对这一挑战，因果性启发下的解耦方法已经表现出通过隔离平台特定奇异特点与通用仇恨指标的 fluid 场景的潜力。然而，其对可用目标标签进行判断的依赖性，在平台不断演进和仇恨言论多变性的情况下，面临着实际障碍。通过基于信心的重新加权和平衡对比 regularization，本研究提出了 HATE WATCH，一种新颖的弱监督因果解码框架，它绕过了明确的目标标签的需要，有效将输入特征解耦为不变的仇恨表示。在两个带有目标标签的平台和两个没有位置的平台进行实证验证后，HATE WATCH 作为跨平台仇恨言论检测的一种新颖方法，具有卓越的性能。HATE WATCH 为实现更安全的在线社区提供了可扩展的审查方法。

URL

https://arxiv.org/abs/2404.11036

PDF

https://arxiv.org/pdf/2404.11036.pdf
Read All

Content

Speech (20)

Speech

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL