Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.
最近关于大型语言模型(LLMs)的研究展示了它们在没有明确提示的情况下理解和运用欺骗行为的能力。然而,这种行为仅在少数特殊案例中被观察到,并未显示出对用户构成严重威胁的迹象。此外,在人工智能对齐研究方面取得了显著进展,训练模型拒绝生成误导性或有毒内容的技术得到了提升。因此,LLMs通常被认为诚实且无害。在这项研究中,我们引入了一种新型攻击方法,该方法削弱了这两个特性,并揭示了一个若被利用可能会产生严重现实后果的漏洞。特别是,我们介绍了微调方法,这些方法增强了超越模型安全措施的欺骗倾向。这种“欺骗性攻击”使模型在特定话题上误导用户的同时,在其他问题上保持准确性。此外,我们发现具有欺骗性的模型还会表现出毒性,生成仇恨言论、刻板印象和其他有害内容。最后,我们评估了模型是否能在多轮对话中持续进行欺骗,并得到了参差不齐的结果。鉴于数百万用户与基于LLM的聊天机器人、语音助手、代理等交互,在这些场景中无法确保信任度的情况下,保障这些模型免受欺骗性攻击变得至关重要。
https://arxiv.org/abs/2502.08301
Hate speech detection is a crucial task, especially on social media, where harmful content can spread quickly. Implementing machine learning models to automatically identify and address hate speech is essential for mitigating its impact and preventing its proliferation. The first step in developing an effective hate speech detection model is to acquire a high-quality dataset for training. Labeled data is foundational for most natural language processing tasks, but categorizing hate speech is difficult due to the diverse and often subjective nature of hate speech, which can lead to varying interpretations and disagreements among annotators. This paper examines strategies for addressing annotator disagreement, an issue that has been largely overlooked. In particular, we evaluate different approaches to deal with annotator disagreement regarding hate speech classification in Turkish tweets, based on a fine-tuned BERT model. Our work highlights the importance of the problem and provides state-of-art benchmark results for detection and understanding of hate speech in online discourse.
仇恨言论检测是一项至关重要的任务,特别是在社交媒体上,有害内容可以迅速传播。实施机器学习模型以自动识别和应对仇恨言论对于减轻其影响并防止其扩散至关重要。开发有效的仇恨言论检测模型的第一步是获取高质量的训练数据集。标注数据对大多数自然语言处理任务来说都是基础性的,但将仇恨言论分类却相当困难,因为仇恨言论在多样性和主观性方面存在巨大差异,这可能导致注释者之间出现不同的解释和争议。本文探讨了应对注释者分歧策略的问题,这是长期以来被忽视的一个问题。具体而言,我们基于经过微调的BERT模型评估了处理土耳其推文中的仇恨言论分类时不同注释者之间的分歧的不同方法。我们的工作突出了这一问题的重要性,并为在线对话中仇恨言论的检测和理解提供了最先进的基准结果。
https://arxiv.org/abs/2502.08266
Target speaker extraction focuses on extracting a target speech signal from an environment with multiple speakers by leveraging an enrollment. Existing methods predominantly rely on speaker embeddings obtained from the enrollment, potentially disregarding the contextual information and the internal interactions between the mixture and enrollment. In this paper, we propose a novel DualStream Contextual Fusion Network (DCF-Net) in the time-frequency (T-F) domain. Specifically, DualStream Fusion Block (DSFB) is introduced to obtain contextual information and capture the interactions between contextualized enrollment and mixture representation across both spatial and channel dimensions, and then rich and consistent representations are utilized to guide the extraction network for better extraction. Experimental results demonstrate that DCF-Net outperforms state-of-the-art (SOTA) methods, achieving a scale-invariant signal-to-distortion ratio improvement (SI-SDRi) of 21.6 dB on the benchmark dataset, and exhibits its robustness and effectiveness in both noise and reverberation scenarios. In addition, the wrong extraction results of our model, called target confusion problem, reduce to 0.4%, which highlights the potential of DCF-Net for practical applications.
目标说话人提取专注于从多说话人的环境中通过注册过程(enrollment)提取特定的语音信号。现有方法主要依赖于从注册过程中获得的说话人嵌入,可能忽略了上下文信息以及混合信号和注册信号之间的内部交互作用。 在本文中,我们提出了一种新颖的时间频率(T-F)域内的双流上下文融合网络(DCF-Net)。具体来说,引入了双流融合块(DSFB),以获取上下文信息并捕捉经过上下文化的注册信号与混合表示之间跨越空间和通道维度的交互作用。然后利用丰富且一致的表示来指导提取网络进行更佳的提取。 实验结果表明,DCF-Net优于现有最先进的方法,在基准数据集上实现了21.6 dB的尺度不变信噪比改进(SI-SDRi),并在噪声和混响场景中表现出其鲁棒性和有效性。此外,我们的模型错误提取的结果(称为目标混淆问题)减少到了0.4%,这突显了DCF-Net在实际应用中的潜力。 总结来说: 1. DCF-Net 提出了一个新颖的双流上下文融合架构。 2. 它通过DSFB块有效地捕捉到混合信号与注册信号之间的交互作用。 3. 实验显示,该模型显著提高了目标说话人的提取效果,并具有良好的鲁棒性。 4. 目标混淆问题得到了明显改善。
https://arxiv.org/abs/2502.08191
Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts: the former aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while the latter focuses instead on pinpointing the precise phonetic pronunciation errors made by non-native language learners. However, it is generally expected that a full-fledged CAPT system should perform both functionalities simultaneously and efficiently. In response to this surging demand, we in this work first propose HMamba, a novel CAPT approach that seamlessly integrates APA and MDD tasks in parallel. In addition, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for MDD to facilitate better-supervised learning for detecting mispronounced phones, thereby enhancing overall performance. A comprehensive set of empirical results on the speechocean762 benchmark dataset demonstrates the effectiveness of our approach on APA. Notably, our proposed approach also yields a considerable improvement in MDD performance over a strong baseline, achieving an F1-score of 63.85%. Our codes are made available at this https URL
之前在构建计算机辅助发音训练(CAPT)系统方面的努力通常将自动发音评估(APA)和误读检测与诊断(MDD)视为两个独立的方面:前者旨在提供跨多种语言层次的多个发音维度评分,而后者则专注于识别非母语学习者所犯的具体语音发音错误。然而,普遍认为一个完整的CAPT系统应该能够同时高效地执行这两项功能。为回应这一需求,在这项工作中,我们首先提出了HMamba,一种新颖的CAPT方法,该方法可以无缝整合APA和MDD任务并行进行。 此外,我们还引入了一种新的损失函数——解耦交叉熵损失(deXent),专门针对MDD设计,以促进对误发音检测更好的监督学习,从而提升整体性能。在speechocean762基准数据集上的一系列全面实验证明了我们的方法在APA方面的有效性。值得注意的是,我们提出的方法还显著提升了MDD的表现,在一个强大的基线系统之上取得了F1得分为63.85%的成绩。 我们的代码可在[此处](https://此链接应为原文中提供的具体URL)获取。
https://arxiv.org/abs/2502.07575
Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.
语音合成模型可以将文字转换成自然流畅的音频。早期的模型只能模拟单一说话者的发音,但最近的技术进步催生了零样本系统的发展,这些系统能够使用各种说话者的声音作为附加提示来生成逼真的语音。然而,它们在模仿与训练数据集明显不同的非录音室质量样本时仍然面临挑战。在这项工作中,我们展示了利用低秩适应(LoRA)技术可以使单次录制的嘈杂环境下的自然对话也能成为有效的提示。这种方法可以在提高说话者相似度方面最多提升30个百分点的同时,保持内容和语音的自然性。这标志着向着创建真正多样化的语料库迈出了一大步,这对于所有与语音相关的任务都至关重要。
https://arxiv.org/abs/2502.07562
In multimedia applications such as films and video games, spatial audio techniques are widely employed to enhance user experiences by simulating 3D sound: transforming mono audio into binaural formats. However, this process is often complex and labor-intensive for sound designers, requiring precise synchronization of audio with the spatial positions of visual components. To address these challenges, we propose a visual-based spatial audio generation system - an automated system that integrates face detection YOLOv8 for object detection, monocular depth estimation, and spatial audio techniques. Notably, the system operates without requiring additional binaural dataset training. The proposed system is evaluated against existing Spatial Audio generation system using objective metrics. Experimental results demonstrate that our method significantly improves spatial consistency between audio and video, enhances speech quality, and performs robustly in multi-speaker scenarios. By streamlining the audio-visual alignment process, the proposed system enables sound engineers to achieve high-quality results efficiently, making it a valuable tool for professionals in multimedia production.
在多媒体应用(如电影和视频游戏)中,空间音频技术被广泛用于通过模拟三维声音来增强用户体验:将单声道音频转换为双耳格式。然而,这个过程对于音效设计师来说往往复杂且劳动密集型,需要精确地将音频与视觉组件的空间位置同步。为了应对这些挑战,我们提出了一种基于视觉的空间音频生成系统——一个集成了对象检测(使用YOLOv8进行面部检测)、单目深度估计和空间音频技术的自动化系统。值得注意的是,该系统在不需额外双耳数据集训练的情况下运行。 我们的提议系统通过客观指标与现有的空间音频生成系统进行了比较评估。实验结果表明,我们提出的方法显著提高了音频和视频之间的空间一致性,增强了语音质量,并且在多说话人场景中表现出良好的鲁棒性。通过简化音视同步的过程,所提出的系统使声音工程师能够高效地获得高质量的结果,成为多媒体制作专业人士的宝贵工具。
https://arxiv.org/abs/2502.07538
The acoustic background plays a crucial role in natural conversation. It provides context and helps listeners understand the environment, but a strong background makes it difficult for listeners to understand spoken words. The appropriate handling of these backgrounds is situation-dependent: Although it may be necessary to remove background to ensure speech clarity, preserving the background is sometimes crucial to maintaining the contextual integrity of the speech. Despite recent advancements in zero-shot Text-to-Speech technologies, current systems often struggle with speech prompts containing backgrounds. To address these challenges, we propose a Controllable Masked Speech Prediction strategy coupled with a dual-speaker encoder, utilizing a task-related control signal to guide the prediction of dual background removal and preservation targets. Experimental results demonstrate that our approach enables precise control over the removal or preservation of background across various acoustic conditions and exhibits strong generalization capabilities in unseen scenarios.
声学背景在自然对话中扮演着关键角色。它提供了上下文,并帮助听众理解说话的环境,但是强烈的背景噪音会使得听懂所说的话语变得困难。对这些背景的有效处理是情境相关的:虽然为了确保语音清晰度可能需要移除背景噪声,但在某些情况下保留背景对于维持话语的语境完整性至关重要。尽管最近在零样本文本到语音技术方面取得了进展,但目前的系统在处理包含背景噪音的语音提示时往往面临挑战。为了解决这些难题,我们提出了一种结合双说话人编码器的可控掩码语音预测策略,并利用与任务相关的控制信号来指导双重背景移除和保留目标的预测。实验结果表明,我们的方法能够在各种声学条件下精确地对背景进行去除或保留,并在未知场景中表现出强大的泛化能力。
https://arxiv.org/abs/2502.07345
The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at this https URL.
声音模仿,特别是针对音色和说话风格等特定语音属性的模仿,在语音生成领域至关重要。然而,现有的方法严重依赖于标注数据,并且难以有效分离音色与风格,这在实现可控生成方面尤其具有挑战性,特别是在零样本(zero-shot)场景中。为了应对这些问题,我们提出了Vevo,一个灵活的零样本声音模仿框架,能够控制音色和风格。 Vevo主要通过两个核心阶段运作: 1. **内容-风格建模**:给定文本或语音的内容标记作为输入,我们使用自回归变压器生成在风格参考提示下引导的内容-风格标记。 2. **声学建模**:给定内容-风格标记作为输入,我们采用流匹配变压器产生音色参考引导下的声学表示。 为了获取语音的内容和内容-风格标记,我们设计了一种完全自我监督的方法,逐步解耦语音的音色、风格和语言内容。具体来说,我们将VQ-VAE用作HuBERT连续隐藏特征的标记器,并将VQ-VAE代码本词汇大小作为信息瓶颈,仔细调整以获得分离的声音表示。 仅通过60,000小时的有声书语音数据进行完全自我监督训练,而无需在风格特定语料库上微调,Vevo在口音和情感转换任务中与现有方法相匹配或超越。此外,在零样本声音转换和文本到语音任务中的有效性进一步证明了其强大的泛化能力和多功能性。 音频示例可在[此链接](https://this-url.com)访问(原信息请见原文)。
https://arxiv.org/abs/2502.07243
Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1 Project Page: this https URL.
协同口语手势生成对于创造逼真的虚拟形象和通过同步手势与语音来增强人机交互至关重要。尽管最近取得了进展,现有的方法在从音频中准确识别节奏或语义触发因素以生成上下文相关的手势模式,并实现像素级真实感方面仍然面临挑战。为了解决这些难题,我们引入了Contextual Gesture(上下文手势),这是一个通过三个创新组件改进协同口语手势视频生成的框架:(1) 一个按时间顺序排列的手势与语音对齐机制,将两种模态在时域上连接起来;(2) 上下文相关的手势标记化,该方法通过蒸馏将语音背景融入到动作模式表示中;以及 (3) 结构感知精炼模块,采用边缘链接来连接手势关键点以改进视频生成。我们的大量实验表明,上下文手势不仅能够生成真实且与语音同步的手势视频,而且还支持长序列生成和视频手势编辑应用,如图1所示。项目页面:[此 URL](https://this-url.com/)(请注意,在实际引用中,请确保链接到正确的URL)。
https://arxiv.org/abs/2502.07239
Reverberant speech, denoting the speech signal degraded by the process of reverberation, contains crucial knowledge of both anechoic source speech and room impulse response (RIR). This work proposes a variational Bayesian inference (VBI) framework with neural speech prior (VINP) for joint speech dereverberation and blind RIR identification. In VINP, a probabilistic signal model is constructed in the time-frequency (T-F) domain based on convolution transfer function (CTF) approximation. For the first time, we propose using an arbitrary discriminative dereverberation deep neural network (DNN) to predict the prior distribution of anechoic speech within a probabilistic model. By integrating both reverberant speech and the anechoic speech prior, VINP yields the maximum a posteriori (MAP) and maximum likelihood (ML) estimations of the anechoic speech spectrum and CTF filter, respectively. After simple transformations, the waveforms of anechoic speech and RIR are estimated. Moreover, VINP is effective for automatic speech recognition (ASR) systems, which sets it apart from most deep learning (DL)-based single-channel dereverberation approaches. Experiments on single-channel speech dereverberation demonstrate that VINP reaches an advanced level in most metrics related to human perception and displays unquestionable state-of-the-art (SOTA) performance in ASR-related metrics. For blind RIR identification, experiments indicate that VINP attains the SOTA level in blind estimation of reverberation time at 60 dB (RT60) and direct-to-reverberation ratio (DRR). Codes and audio samples are available online.
回声语音,指的是由混响过程退化的语音信号,包含了无回声源语音和房间脉冲响应(RIR)的重要信息。本文提出了一种基于神经语音先验的变分贝叶斯推理框架(VINP),用于联合进行语音去混响和盲RIR识别。在VINP中,一种基于卷积传输函数(CTF)近似的概率信号模型被构建于时频域内。首次提出了使用任意判别性的去混响深度神经网络(DNN)来预测无回声语音的先验分布,并将其集成到概率模型中。通过结合回声语音和无回声语音先验,VINP分别得到了无回声语音谱和CTF滤波器的最大后验(MAP)和最大似然(ML)估计。经过简单的变换之后,可以估算出无回声语音的波形以及RIR。此外,VINP对于自动语音识别(ASR)系统而言也是有效的,这使其区别于大多数基于深度学习的单通道去混响方法。 实验结果显示,在单通道语音去混响方面,VINP在几乎所有与人类感知相关的指标中达到了先进水平,并且在与ASR相关的指标上表现出无可争议的最佳性能。对于盲RIR识别来说,实验表明VINP在60分贝(RT60)和直达声到回声比率(DRR)的盲估计方面也达到了最佳水平。 代码及音频样本已在线提供。
https://arxiv.org/abs/2502.07205
Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate outperforms existing state-of-the-art methods in terms of video quality and lip-synchronization, and improves flexibility in controlling emotion and head pose. The code will be available at this https URL.
最近基于扩散模型的说话人脸生成方法展示了在将语音音频片段与给定的身份参考准确匹配方面具有令人印象深刻的潜力。然而,现有的方法仍然面临着由于不可控因素(如不精确的唇部同步、不合适头部姿势以及缺乏对面部表情的精细控制)而产生的重大挑战。为了引入除了语音音频片段以外更多的面部引导条件,提出了一种新颖的两阶段训练框架Playmate,用于生成更逼真的面部表情和说话人脸。 在第一阶段,我们引入了一个解耦的隐式3D表示,并设计了精心定制的动作解耦模块,以促进更准确的属性分离并直接从音频提示生成具有表现力的说话视频。然后,在第二阶段,我们引入了一种情感控制模块,将情感控制信息编码到潜在空间中,从而使对情绪进行精细控制成为可能,并因此能够生成带有所需情绪的说话视频。 广泛的实验表明,与现有的最先进的方法相比,Playmate在视频质量和唇部同步方面表现出色,并提高了情感和头部姿势控制方面的灵活性。代码将在提供的链接上提供。
https://arxiv.org/abs/2502.07203
Short-form videos are popular on platforms like TikTok and Instagram as they quickly capture viewers' attention. Many creators repurpose their long-form videos to produce short-form videos, but creators report that planning, extracting, and arranging clips from long-form videos is challenging. Currently, creators make extractive short-form videos composed of existing long-form video clips or abstractive short-form videos by adding newly recorded narration to visuals. While extractive videos maintain the original connection between audio and visuals, abstractive videos offer flexibility in selecting content to be included in a shorter time. We present Lotus, a system that combines both approaches to balance preserving the original content with flexibility over the content. Lotus first creates an abstractive short-form video by generating both a short-form script and its corresponding speech, then matching long-form video clips to the generated narration. Creators can then add extractive clips with an automated method or Lotus's editing interface. Lotus's interface can be used to further refine the short-form video. We compare short-form videos generated by Lotus with those using an extractive baseline method. In our user study, we compare creating short-form videos using Lotus to participants' existing practice.
短形式视频在像TikTok和Instagram这样的平台上非常受欢迎,因为它们能够迅速吸引观众的注意力。许多创作者将他们的长格式视频重新制作成短格式视频,但创作者报告说,从长格式视频中规划、提取并安排片段是一项挑战。目前,创作者会制作由现有长格式视频剪辑组成的提取型短格式视频(extractive short-form videos),或者添加新录制的旁白以适应视觉内容的抽象型短格式视频(abstractive short-form videos)。虽然提取型视频保持了原始音频和画面之间的联系,但抽象型视频则提供了在较短时间内选择包含内容的灵活性。 我们提出了一个名为Lotus(莲花)的系统,该系统结合了上述两种方法,以平衡保留原始内容与提供内容灵活性的关系。Lotus首先通过生成短形式脚本及其对应的旁白来创建抽象型短格式视频,然后将长格式视频片段匹配到生成的叙述中。创作者可以使用自动化方法或Lotus编辑界面添加提取型剪辑。Lotus的界面还可以用于进一步完善短格式视频。 我们对比了由Lotus生成的短格式视频与采用基本提取法生成的短格式视频的效果。在我们的用户研究中,我们将使用Lotus创建短格式视频的过程与参与者现有的做法进行了比较。
https://arxiv.org/abs/2502.07096
Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.
音位变体(allophony)指的是基于语音环境的不同,一个音位在发音上的变化。建模音位变体对于不寻常的发音评估至关重要,这种评估涉及区分不寻常和常规的发音方式。然而,最近以音位分类器为基础的方法通常通过将不同的实现视为单一音位来简化这一过程,从而绕过了建模音位变异复杂性的必要性。鉴于冻结自监督语音模型(S3M)特征在声学建模方面的能力,我们提出了MixGoP这一创新方法,利用高斯混合模型对具有多个子聚类的音位分布进行建模。我们的实验表明,在五个数据集中的四个中,包括失语症和非母语发音的数据集上,MixGoP达到了最先进的性能水平。进一步分析还表明,S3M特征比梅尔频率倒谱系数(MFCC)和梅尔频谱图更有效地捕捉到音位变异,突显了将MixGoP与S3M特征结合使用的优点。
https://arxiv.org/abs/2502.07029
Neuromorphic computing, inspired by nervous systems, revolutionizes information processing with its focus on efficiency and low power consumption. Using sparse coding, this paradigm enhances processing efficiency, which is crucial for edge devices with power constraints. The Locally Competitive Algorithm (LCA), adapted for audio with Gammatone and Gammachirp filter banks, provides an efficient sparse coding method for neuromorphic speech processing. Adaptive LCA (ALCA) further refines this method by dynamically adjusting modulation parameters, thereby improving reconstruction quality and sparsity. This paper introduces an enhanced ALCA version, the ALCA Central Frequency (ALCA-CF), which dynamically adapts both modulation parameters and central frequencies, optimizing the speech representation. Evaluations show that this approach improves reconstruction quality and sparsity while significantly reducing the power consumption of speech classification, without compromising classification accuracy, particularly on Intel's Loihi 2 neuromorphic chip.
神经形态计算,受神经系统启发,通过专注于效率和低功耗彻底革新了信息处理方式。利用稀疏编码,这种范式提高了处理效率,在功率受限的边缘设备中尤为重要。局部竞争算法(LCA)经过改编用于音频处理,并结合Gammatone和Gammachirp滤波器组,提供了一种高效的神经形态语音处理稀疏编码方法。自适应LCA(ALCA)通过动态调整调制参数进一步完善了这一方法,从而提高了重构质量和稀疏性。本文介绍了一种增强版的ALCA——ALCA中心频率(ALCA-CF),该版本能够动态地调整调制参数和中心频率,优化语音表示。评估结果显示,这种方法在不牺牲分类准确性的情况下,显著提升了重构质量和稀疏性,并大幅降低了Loihi 2神经形态芯片上的语音分类功耗。
https://arxiv.org/abs/2502.06989
The NLP community has broadly focused on text-only approaches of cognitive state tasks, but audio can provide vital missing cues through prosody. We posit that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance with text and synthetic audio compared to text and gold audio.
自然语言处理(NLP)社区普遍关注仅基于文本的认知状态任务方法,但音频可以通过语调提供重要的缺失线索。我们认为,文转语音(TTS)模型在生成逼真的音频时会学习追踪认知状态的某些方面,并且这些模型隐含地识别的信号与语言模型利用的信息是正交的(即互补的)。我们提出了合成音频数据微调(SAD)框架,在此框架中,展示了7个与认知状态建模相关任务可以从基于文本和零样本合成音频数据(来自现成的TTS系统)的多模态训练中受益。在将合成音频数据添加到仅文本语料库时,我们证明其可以超越仅文本模式的表现。此外,在包含真实音频的任务和语料库上,我们的SAD框架展示了与基于文本加真实音频相比,在使用文本加合成音频上的竞争性表现。
https://arxiv.org/abs/2502.06922
Graph Neural Networks (GNNs) are vital for learning from graph-structured data, enabling applications in network analysis, recommendation systems, and speech analytics. Deploying them on edge devices like client PCs and laptops enhances real-time processing, privacy, and cloud independence. GNNs aid Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and enable event-based vision tasks. However, irregular memory access, sparsity, and dynamic structures cause high latency and energy overhead on resource-constrained devices. While modern edge processors integrate CPUs, GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular GNN computations. We introduce GraNNite, the first hardware-aware framework optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN accelerators via a structured three-step methodology: (1) enabling NPU execution, (2) optimizing performance, and (3) trading accuracy for efficiency gains. Step 1 employs GraphSplit for workload distribution and StaGr for static aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts performance using EffOp for control-heavy tasks and GraSp for sparsity exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce redundancy and memory transfers. Step 3 balances quality versus efficiency, where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to 8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher performance than CPUs and GPUs, respectively, across GNN models.
图神经网络(GNNs)对于从图形结构化数据中学习至关重要,它们在诸如网络分析、推荐系统和语音分析等应用领域发挥着重要作用。将这些模型部署在边缘设备如客户端PC和笔记本电脑上可以增强实时处理能力,保护隐私,并减少对云端的依赖。此外,图神经网络还支持检索增强生成(RAG)在大型语言模型(LLMs)中的应用,以及基于事件的视觉任务。然而,在资源受限的设备上,不规则的内存访问、稀疏性和动态结构导致了高延迟和能源消耗。 尽管现代边缘处理器集成了CPU、GPU和NPU,专为数据并行任务设计的NPU却难以处理不规则的图神经网络计算。因此,我们提出了GraNNite,这是一个针对商业现成(COTS)最先进的深度学习加速器优化图神经网络执行的硬件感知框架,并采用结构化的三步方法:(1)启用NPU执行;(2)提升性能;以及(3)通过牺牲精度来换取效率。第一步使用GraphSplit进行工作负载分配和StaGr进行静态聚合,而GrAd和NodePad处理动态图。第二步利用EffOp为控制密集型任务加速,并采用GraSp优化稀疏性。图形卷积的预处理PreG、SymG和CacheG减少了冗余并减少内存传输。 在第三阶段中,GraNNite平衡了质量与效率,其中QuantGr应用INT8量化,而GrAx1、GrAx2和GrAx3分别加速注意力机制、广播加法以及SAGE最大聚合。在Intel Core Ultra AI PC上,GraNNite相较于默认NPU映射可实现高达2.6倍至7.6倍的速度提升,并且相比于CPU和GPU最多可以节省8.6倍的能量消耗,在各种图神经网络模型中分别比CPU和GPU提供了10.8倍和6.7倍的更高性能。
https://arxiv.org/abs/2502.06921
Effectively steering hearable devices requires understanding the acoustic environment around the user. In the computational analysis of sound scenes, foundation models have emerged as the state of the art to produce high-performance, robust, multi-purpose audio representations. We introduce and release Deep Evaluation of Audio Representations (DEAR), the first dataset and benchmark to evaluate the efficacy of foundation models in capturing essential acoustic properties for hearables. The dataset includes 1,158 audio tracks, each 30 seconds long, created by spatially mixing proprietary monologues with commercial, high-quality recordings of everyday acoustic scenes. Our benchmark encompasses eight tasks that assess the general context, speech sources, and technical acoustic properties of the audio scenes. Through our evaluation of four general-purpose audio representation models, we demonstrate that the BEATs model significantly surpasses its counterparts. This superiority underscores the advantage of models trained on diverse audio collections, confirming their applicability to a wide array of auditory tasks, including encoding the environment properties necessary for hearable steering. The DEAR dataset and associated code are available at this https URL.
有效地引导可听设备(hearables)需要理解用户周围的声音环境。在声音场景的计算分析中,基础模型已经作为最先进的方法出现,用于生成高性能、鲁棒且多用途的音频表示。我们介绍并发布了深度评估音频表示(DEAR),这是第一个用来评估基础模型在捕捉适合于可听设备的关键声学属性的数据集和基准测试。该数据集包括1,158个音频片段,每个30秒长,并通过空间混合专有的独白与商用高质量的日常声音场景录音创建而成。 我们的基准测试涵盖了八项任务,评估了音频场景的一般背景、语音源以及技术声学属性。通过对四种通用音频表示模型进行评估,我们展示了BEATs模型在各方面都显著优于其竞争对手。这种优越性突显了使用多样化的音频集合训练的模型的优势,并确认这些模型适用于广泛的声音任务,包括编码环境特性以支持可听设备的方向控制。 DEAR数据集及相关代码可在以下链接获取:[此URL](请将"[此URL]"替换为实际提供的网址)。
https://arxiv.org/abs/2502.06664
Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that $n$-gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.
模型蒸馏——利用大型教师模型的输出来训练小型学生模型——是为特定任务创建高效模型的一种实用方法。我们提出一个问题:能否根据学生的输出来识别其老师?这种由教师LLM(大语言模型)留下的“痕迹”会是一种有趣的特征现象。除此之外,可靠地推断出老师的特性可能会有实际的应用意义,因为某些行为者试图将庞大的专有LLM的特定能力蒸馏到部署的小型LM中,这可能违反服务条款。 我们考虑实用的任务蒸馏目标,包括摘要生成、问答和指令跟随等任务。假设有一组有限数量的候选教师模型,并且我们将它们视为黑盒。我们设计了基于词汇特征的操作性判别模型。我们发现仅凭$n$-gram相似度是无法可靠地识别老师的,但是学生模型所偏好的词性(PoS)模板与其老师的一致。 简单来说,虽然直接通过输出的字面相似性来判断教师身份可能不可靠,但通过对语言结构和模式进行更深层次的分析,则有可能揭示出教师模型对学生的潜在影响。这种方法为探索大型语言模型的影响提供了新的视角,并且在实际应用中具有重要的意义,特别是在保护知识产权和防止未经授权的知识转移方面。
https://arxiv.org/abs/2502.06659
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
在大型语言模型(LLMs)时代,语音生成技术的迅速发展确立了离散语音令牌作为语音表示的基础范式。这些具有离散、紧凑和简洁特点的令牌不仅有利于高效的传输和存储,还与语言建模框架天然兼容,使语音能够无缝地融入以文本为主的LLM架构中。当前的研究将离散语音令牌主要分为两类:声学令牌和语义令牌,每一类都演化成了一个由独特设计理念和方法论驱动的丰富研究领域。 本文系统性地综合了现有分类法及离散语音标记化的最新创新成果,对每个范式的优缺点进行了批判性的评估,并通过类型间的系统实验对比来提供深入见解。此外,我们还识别出了该领域的持续挑战并提出潜在的研究方向,旨在为未来在离散语音令牌的发展和应用方面的进展提供可操作的洞察力。
https://arxiv.org/abs/2502.06490
We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then analyze 8 seven minute long speeches and reason about the arguments presented by all speakers to give the final results. Our preliminary evaluation using GPT o1, GPT-4o, and Claude Haiku, shows that LLMs struggle to perform well on DebateBench, highlighting the need to develop more sophisticated techniques for improving their performance.
我们介绍了DebateBench,这是一个新颖的数据集,包含了一些世界上最负盛名的辩论赛事中的大量对话记录和元数据。该数据集包括来自著名辩论锦标赛中多样主题的英式议会辩论,并且每个发言都详细标注了评分和官方裁决数据提供的排名。 我们在32场辩论中整理出了256篇演讲,每场辩论时长超过1小时,输入平均为32,000个标记。DebateBench旨在捕捉长上下文、大规模推理任务,并作为一个基准来评估现代大型语言模型(LLM)在参与论辩、审议以及与人类专家保持一致方面的能力。 为了在DebateBench上表现良好,LLM必须进行在上下文学习以理解辩论的规则和评价标准。然后,它们需要分析八篇每篇长达七分钟的演讲,并对所有发言者的论点进行推理,给出最终结果。我们的初步评估使用了GPT o1、GPT-4o 和 Claude Haiku,显示这些LLM在DebateBench上表现不佳,这表明有必要开发更复杂的技术来提升它们的表现。
https://arxiv.org/abs/2502.06279