Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at this https URL.
传统的听觉、视觉和视听语音识别(分别简称ASR、VSR和AVSR)研究通常是独立进行的。即使最近的一些自监督学习研究同时处理两个或全部三个任务,它们往往也会产生单独的模型,导致推理管道分离,增加内存需求并带来冗余。本文提出了这些系统的统一训练策略。我们证明,通过为所有三项任务训练单一模型可以提高VSR和AVSR的表现,并克服了从头开始训练时常见的优化挑战。此外,我们提出了一种贪婪伪标注方法来更有效地利用未标记样本,解决了相关自监督方法中的不足之处。最后,我们在框架内开发了一种自监督预训练方法,证明了它与我们的半监督方法一起的有效性。尽管所有任务都使用单一模型,但我们的统一方法在ASR、VSR和AVSR上,尤其是在新发布的WildVSR数据集上的表现达到了最新技术的标准,在LRS3和LRS2数据集上的性能也超过了近期的方法。代码和模型可以在这个https链接中找到。
https://arxiv.org/abs/2411.02256
In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a standard approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.
尽管端到端说话人分离系统的流行性很高,由语音活动检测(VAD)、说话人嵌入提取加聚类以及重叠语音检测(OSD)加处理组成的模块化系统在许多条件下仍然能够达到具有竞争力的性能。然而,模块化系统的一个主要缺点是需要独立运行(和训练)不同的模块。在这项工作中,我们提出了一种方法来联合训练一个模型,以同时生成说话人嵌入、VAD和OSD,并且以标准方法的一小部分推理时间达到具有竞争力的性能。此外,联合推理导致了一个简化的整体管道,使我们更接近于可以针对特定分离目标进行端到端训练的统一聚类方法。
https://arxiv.org/abs/2411.02165
Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that leverages Content-aware Timbre Ensemble modeling and Flow Matching. Specifically, CTEFM-VC disentangles utterances into linguistic content and timbre representations, subsequently utilizing a conditional flow matching model and a vocoder to reconstruct the mel-spectrogram and waveform. To enhance its timbre modeling capability and the naturalness of generated speech, we propose a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the joint utilization of linguistic and timbre features through a cross-attention module. Experiments show that our CTEFM-VC system surpasses state-of-the-art VC methods in both speaker similarity and naturalness by at least 18.5% and 7.0%.
零样本语音转换(VC)的目标是将源说话人的音色转换为任何之前未见过的目标说话人,同时保留原始的语言内容。尽管取得了显著的进步,但达到与真实录音相当的说话人相似度和自然度仍然是一个巨大的挑战。在本文中,我们提出了CTEFM-VC,这是一种利用内容感知音色集成建模和流匹配的零样本VC框架。具体而言,CTEFM-VC将语句分解为语言内容和音色表示,随后使用条件流匹配模型和声码器来重建梅尔频谱图和波形。为了增强其音色建模能力和生成语音的自然度,我们提出了一种上下文感知音色集成建模方法,该方法自适应地整合了多样的说话人验证嵌入,并通过跨注意力模块实现了语言和音色特征的同时利用。实验表明,我们的CTEFM-VC系统在说话人相似度和自然度方面至少超过了最先进的VC方法18.5%和7.0%。
https://arxiv.org/abs/2411.02026
Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 60 {\mu}s (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.
基于深度学习的语音增强(SE)方法在需要满足低延迟要求时,常常会面临显著的计算挑战,因为此时需要处理更多帧。本文介绍了一种名为SlowFast的框架,旨在当需要进行低延迟增强时减少计算成本。该框架包含一个以较低帧率分析声学环境的慢分支和一个在时间域中按需以较高帧率执行SE的快分支,从而匹配所需的延迟。具体而言,快分支采用状态空间模型,其状态转移过程由慢分支动态调制。使用Voice Bank + Demand数据集对具有2毫秒算法延迟要求的SE任务进行实验表明,与参数相当的单分支基准网络相比,我们的方法将计算成本降低了70%,同时不妥协增强性能。此外,通过利用SlowFast框架,我们实现了一个达到60微秒(16 kHz采样率下一个样本点)算法延迟、且具有每秒1亿次MACs计算成本的网络,在PESQ-NB得分为3.12和SISNR为16.62。
https://arxiv.org/abs/2411.02019
While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
尽管无文本的口语语言模型(SLMs)在端到端的语音到语音建模中显示出了潜力,但它们在语义连贯性和相关性方面仍然落后于基于文本的大语言模型(LLMs)。这项工作引入了Align-SLM框架,该框架利用受强化学习与AI反馈(RLAIF)启发的偏好优化来增强SLMs的语义理解。我们的方法从给定的提示生成多个语音延续,并使用语义指标创建用于直接偏好优化(DPO)的偏好数据。我们通过ZeroSpeech 2021基准测试对词汇和句法建模进行了框架评估,使用了StoryCloze数据集的口语版本来评估语义连贯性,以及其他语音生成度量标准,包括GPT4-o分数和人工评价。实验结果表明,我们的方法在大多数基准测试中实现了SLMs的最佳性能,强调了偏好优化对提升SLMs语义的重要性。
https://arxiv.org/abs/2411.01834
Spurred by the demand for interpretable models, research on eXplainable AI for language technologies has experienced significant growth, with feature attribution methods emerging as a cornerstone of this progress. While prior work in NLP explored such methods for classification tasks and textual applications, explainability intersecting generation and speech is lagging, with existing techniques failing to account for the autoregressive nature of state-of-the-art models and to provide fine-grained, phonetically meaningful explanations. We address this gap by introducing Spectrogram Perturbation for Explainable Speech-to-text Generation (SPES), a feature attribution technique applicable to sequence generation tasks with autoregressive models. SPES provides explanations for each predicted token based on both the input spectrogram and the previously generated tokens. Extensive evaluation on speech recognition and translation demonstrates that SPES generates explanations that are faithful and plausible to humans.
受到对可解释模型需求的推动,针对语言技术的可解释人工智能(eXplainable AI)研究经历了显著的增长,特征归因方法成为了这一进展的重要基石。尽管先前在自然语言处理领域的研究探索了此类方法用于分类任务和文本应用,但在生成与语音交叉领域的可解释性方面却相对滞后,现有技术未能充分考虑到最先进的模型的自回归特性,并且无法提供细粒度、具有音素意义的解释。我们通过引入Spectrogram Perturbation for Explainable Speech-to-text Generation(SPES)来填补这一空白,这是一种适用于基于自回归模型的序列生成任务的特征归因技术。SPES根据输入频谱图和先前生成的标记为每个预测出的标记提供解释。广泛的评估表明,在语音识别和翻译领域中,SPES能够生成忠实且可信的人类可理解解释。 注:为了保持专业性和准确性,“Spectrogram Perturbation for Explainable Speech-to-text Generation(SPES)”在中文语境下没有直接对应的简化表述,因此保留了其英文名称。
https://arxiv.org/abs/2411.01710
Large language models (LLMs) are trained on data assumed to include natural language pragmatics, but do they actually behave like pragmatic speakers? We attempt to answer this question using the Rational Speech Act (RSA) framework, which models pragmatic reasoning in human communication. Using the paradigm of a reference game constructed from the TUNA corpus, we score candidate referential utterances in both a state-of-the-art LLM (Llama3-8B-Instruct) and in the RSA model, comparing and contrasting these scores. Given that RSA requires defining alternative utterances and a truth-conditional meaning function, we explore such comparison for different choices of each of these requirements. We find that while scores from the LLM have some positive correlation with those from RSA, there isn't sufficient evidence to claim that it behaves like a pragmatic speaker. This initial study paves way for further targeted efforts exploring different models and settings, including human-subject evaluation, to see if LLMs truly can, or be made to, behave like pragmatic speakers.
大型语言模型(LLMs)是在被认为包含自然语言语用学的数据上进行训练的,但它们的行为是否真的像一个具有语用能力的说话者?我们试图使用理性言语行为(RSA)框架来回答这个问题,该框架用于建模人类交流中的语用推理。通过从TUNA语料库构建的一个参照游戏中使用的范式,我们在最先进的LLM(Llama3-8B-Instruct)和RSA模型中对候选的指称表达进行评分,并比较这两种评分。鉴于RSA需要定义替代性言语及真值条件意义函数,我们探索了这些需求不同选择下的对比情况。我们的发现表明,虽然来自LLM的得分与RSA的一些得分之间存在某些正相关关系,但没有足够的证据可以断定它像一个具有语用能力的说话者那样行事。这项初步研究为更进一步的研究铺平道路,包括对不同的模型和设置进行探索,以及通过人类被试评估来验证LLMs是否真的能够或者可以被设计成像具有语用能力的说话者一样行为。
https://arxiv.org/abs/2411.01562
We introduce SinaTools, an open-source Python package for Arabic natural language processing and understanding. SinaTools is a unified package allowing people to integrate it into their system workflow, offering solutions for various tasks such as flat and nested Named Entity Recognition (NER), fully-flagged Word Sense Disambiguation (WSD), Semantic Relatedness, Synonymy Extractions and Evaluation, Lemmatization, Part-of-speech Tagging, Root Tagging, and additional helper utilities such as corpus processing, text stripping methods, and diacritic-aware word matching. This paper presents SinaTools and its benchmarking results, demonstrating that SinaTools outperforms all similar tools on the aforementioned tasks, such as Flat NER (87.33%), Nested NER (89.42%), WSD (82.63%), Semantic Relatedness (0.49 Spearman rank), Lemmatization (90.5%), POS tagging (97.5%), among others. SinaTools can be downloaded from (this https URL).
我们介绍SinaTools,这是一个用于阿拉伯语自然语言处理和理解的开源Python包。SinaTools是一个统一的软件包,允许人们将其集成到自己的系统工作流中,并提供各种任务的解决方案,例如扁平和嵌套的命名实体识别(NER)、完全标注的词义消歧(WSD)、语义相似度计算、同义词提取与评估、词形还原、词性标注、词根标注以及诸如语料库处理、文本剥离方法和带音符的单词匹配等额外辅助工具。本文介绍了SinaTools及其基准测试结果,展示了在上述任务中,如扁平NER(87.33%)、嵌套NER(89.42%)、WSD(82.63%)、语义相似度计算(0.49斯皮尔曼等级相关系数)、词形还原(90.5%)、词性标注(97.5%)等,SinaTools的表现优于所有类似工具。SinaTools可以从(此https URL)下载。
https://arxiv.org/abs/2411.01523
This paper develops theory and algorithms for interacting large language model agents (LLMAs) using methods from statistical signal processing and microeconomics. While both fields are mature, their application to decision-making by interacting LLMAs remains unexplored. Motivated by Bayesian sentiment analysis on online platforms, we construct interpretable models and stochastic control algorithms that enable LLMAs to interact and perform Bayesian inference. Because interacting LLMAs learn from prior decisions and external inputs, they exhibit bias and herding behavior. Thus, developing interpretable models and stochastic control algorithms is essential to understand and mitigate these behaviors. This paper has three main results. First, we show using Bayesian revealed preferences from microeconomics that an individual LLMA satisfies the sufficient conditions for rationally inattentive (bounded rationality) utility maximization and, given an observation, the LLMA chooses an action that maximizes a regularized utility. Second, we utilize Bayesian social learning to construct interpretable models for LLMAs that interact sequentially with each other and the environment while performing Bayesian inference. Our models capture the herding behavior exhibited by interacting LLMAs. Third, we propose a stochastic control framework to delay herding and improve state estimation accuracy under two settings: (a) centrally controlled LLMAs and (b) autonomous LLMAs with incentives. Throughout the paper, we demonstrate the efficacy of our methods on real datasets for hate speech classification and product quality assessment, using open-source models like Mistral and closed-source models like ChatGPT. The main takeaway of this paper, based on substantial empirical analysis and mathematical formalism, is that LLMAs act as rationally bounded Bayesian agents that exhibit social learning when interacting.
本文利用统计信号处理和微观经济学的方法,发展了用于交互式大型语言模型代理(LLMAs)的理论与算法。尽管这两个领域已经相当成熟,但它们在相互作用的LLMAs决策中的应用尚未得到探索。受在线平台上贝叶斯情感分析的启发,我们构建了解释性模型和随机控制算法,使LLMA能够进行交互并执行贝叶斯推理。由于相互作用的LLMAs会从先前的决策和外部输入中学习,因此表现出偏见和群体行为。因此,开发解释性模型和随机控制算法对于理解和缓解这些行为至关重要。 本文有三个主要结果:首先,我们利用微观经济学中的贝叶斯揭示偏好理论证明了单个LLMA满足理性忽视(有限理性)效用最大化的充分条件,并且在给定观察的情况下,LLMA选择能够最大化正则化效用的动作。其次,我们使用贝叶斯社会学习构建了相互作用的LLMAs解释性模型,在这些模型中,LLMAs依次与环境和其他代理进行交互并执行贝叶斯推理。我们的模型捕捉到了相互作用的LLMAs所表现出的群体行为。第三,我们提出了一种随机控制框架,在两种设置下延迟群体行为并提高状态估计准确性:(a) 中央管控下的LLMA;(b) 具有激励机制的自主LLMA。 在本文中,我们使用开源模型(如Mistral)和闭源模型(如ChatGPT),通过真实数据集上的实验验证了方法的有效性,这些数据集涉及仇恨言论分类和产品质量评估。根据大量的实证分析和数学形式化研究,本文的主要结论是:LLMA表现为具有社会学习能力的理性有限贝叶斯代理,在相互作用时会表现出这种特性。
https://arxiv.org/abs/2411.01271
Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100\% codebook utilization. Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation is open source at \href{this https URL}{this https URL}.
语音合成(Text-to-Speech,TTS)系统在处理复杂的语言特征、多音表达和生成自然的多语言语音方面面临着持续的挑战——这些能力对于未来的AI应用至关重要。本文介绍了Fish-Speech,一个采用串行快慢双自回归(Dual-AR)架构的新框架,以提高组有限标量向量量化(GFSQ)在序列生成任务中的稳定性。这种架构提高了代码本处理效率的同时保持了高质量的输出,使其特别适用于AI交互和声音克隆。Fish-Speech利用大型语言模型(LLMs)进行语言特征提取,省去了传统的音素转换(Grapheme-to-Phoneme, G2P),从而简化合成管道并增强多语言支持。此外,我们通过GFSQ开发了FF-GAN以实现更高的压缩比和接近100%的代码本利用率。我们的方法不仅解决了当前TTS系统的关键限制,还为更复杂、语境感知的语音合成了提供了基础。实验结果显示,Fish-Speech在处理复杂的语言场景和声音克隆任务方面明显优于基线模型,展示了其推动AI应用中的TTS技术进步的潜力。该实现是开源的,可以在\href{this https URL}{这个链接}找到。 注意:原文中提供的URL是以占位符形式呈现("this https URL"),在翻译后的文本中我保留了这种方式,并使用LaTeX语法来表示一个超链接,这通常需要特定的文档编译环境才能正确显示。如果在普通文本环境中无法直接生成可点击链接,请根据实际需求调整超链接格式。
https://arxiv.org/abs/2411.01156
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. In healthcare settings, coomunication breakdowns reduce the quality of care. While building an augmentative and alternative communication (AAC) tool to enable fluid communication we found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction. English dysarthric ASR performance is often evaluated on the TORGO dataset. Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. After reducing prompt-overlap, results with SOTA ASR models produce extremely high word error rates for speakers with mild and severe dysarthria. Furthermore, to improve ASR, our work looks at the impact of n-gram language models and large-language model (LLM) based multi-modal generative error-correction algorithms like Whispering-LLaMA for a second pass ASR. Our work highlights how much more needs to be done to improve ASR for atypical speakers to enable equitable healthcare access both in-person and in e-health settings.
患有脑瘫(CP)和肌萎缩侧索硬化症(ALS)的个体经常面临发音困难,导致构音障碍并形成非典型言语模式。在医疗保健环境中,沟通失败会降低护理质量。在构建辅助替代通信(AAC)工具以促进流畅沟通时,我们发现最先进的自动语音识别(ASR)技术如Whisper和Wav2vec2.0由于缺乏训练数据而主要边缘化了非典型的说话者。我们的工作旨在利用先进的ASR,并随后进行特定领域的错误校正。英语构音障碍的ASR性能通常在TORGO数据集上进行评估,该数据集存在一个众所周知的问题,即训练和测试说话人之间的短语重叠。我们的工作提出了一种算法来打破这种提示重叠。减少提示重叠后,最先进的ASR模型对轻度和重度构音障碍的说话者的词错误率极高。此外,为了改进ASR,我们研究了n元语言模型以及基于大型语言模型(LLM)的多模态生成错误校正算法如Whispering-LLaMA在第二遍ASR中的影响。我们的工作强调了为提高非典型说话者的ASR性能、实现公平的医疗保健访问(无论是面对面还是电子健康环境),还需要做更多的工作。
https://arxiv.org/abs/2411.00980
The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
大型语言模型的快速发展带来了许多新的智能应用,特别是在GPT-4o中表现出色的多模态人机交互为用户提供了令人印象深刻的经验。在这一背景下,研究人员最近提出了一些能够实现语音到语音对话的多模态大语言模型(LLM)。本文提出了一个名为Freeze-Omni的语音-文本多模态LLM架构。我们的主要贡献在于,在整个训练过程中保持LLM冻结的情况下,实现了语音输入和输出模态与LLM的连接。我们设计了三个阶段的训练策略,用于实现语音输入和输出的建模,使得Freeze-Omni能够利用文本-语音配对数据(如ASR和TTS数据)以及仅使用8个GPU上的60,000个多轮次文本问答数据获得语音到语音对话的能力。此外,我们还能有效确保Freeze-Omni在语音模式下的智能水平与其基础LLM的文本模式相当,同时实现了较低的端到端语音响应延迟。另外,我们还设计了一种通过多任务训练实现双工对话能力的方法,使Freeze-Omni能够以更自然的方式进行人机对话。总体而言,Freeze-Omni主要为研究人员在冻结LLM条件下开展多模态LLM研究提供了可能性,避免了由于数据和训练资源有限而导致的LLM灾难性遗忘的各种影响。
https://arxiv.org/abs/2411.00774
Narratives are key interpretative devices by which humans make sense of political reality. As the significance of narratives for understanding current societal issues such as polarization and misinformation becomes increasingly evident, there is a growing demand for methods that support their empirical analysis. To this end, we propose a graph-based formalism and machine-guided method for extracting, representing, and analyzing selected narrative signals from digital textual corpora, based on Abstract Meaning Representation (AMR). The formalism and method introduced here specifically cater to the study of political narratives that figure in texts from digital media such as archived political speeches, social media posts, political manifestos and transcripts of parliamentary debates. We conceptualize these political narratives as a type of ontological narratives: stories by which actors position themselves as political beings, and which are akin to political worldviews in which actors present their normative vision of the world, or aspects thereof. We approach the study of such political narratives as a problem of information retrieval: starting from a textual corpus, we first extract a graph-like representation of the meaning of each sentence in the corpus using AMR. Drawing on transferable concepts from narratology, we then apply a set of heuristics to filter these graphs for representations of 1) actors, 2) the events in which these actors figure, and 3) traces of the perspectivization of these events. We approach these references to actors, events, and instances of perspectivization as core narrative signals that initiate a further analysis by alluding to larger political narratives. By means of a case study of State of the European Union addresses, we demonstrate how the formalism can be used to inductively surface signals of political narratives from public discourse.
叙事是人类解读政治现实的关键解释工具。随着叙事对于理解当前社会问题(如极化和误传信息)的重要性日益显现,支持其经验分析的方法需求也在增加。为此,我们提出了一种基于图的形式主义和机器引导方法,从数字文本语料库中提取、表示和分析选定的叙事信号,该方法基于抽象意义表示(AMR)。此处介绍的形式主义和方法特别适用于研究出现在数字媒体文本中的政治叙事,例如存档的政治演讲、社交媒体帖子、政治宣言及议会辩论记录。我们将这些政治叙事视为一种本体论叙事:通过此类故事,行为者将自己定位为政治存在,并且类似于行为者对其世界规范性愿景的表述或其部分描述的政治世界观。我们把这种政治叙事的研究视为信息检索的问题:从文本语料库出发,首先使用AMR提取每个句子意义的图状表示。借鉴叙事学中的可转移概念,然后应用一组启发式方法筛选这些图,以识别1)行为者、2)这些行为者参与其中的事件以及3)对这些事件视角化痕迹的表述。我们将这些关于行为者、事件和视角化的引用视为核心叙事信号,它们暗示了更大的政治叙事并启动进一步分析。通过欧洲联盟国情咨文案例研究,我们展示了如何使用该形式主义从公共话语中归纳出政治叙事的信号。
https://arxiv.org/abs/2411.00702
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.
神经上下文偏差使语音识别模型能够利用与上下文相关的相关信息,从而提高转录准确性。然而,这种偏差机制通常基于音频和偏差项目录之间的交叉注意力模块,这意味着计算复杂性会对偏差目录的大小造成严重的实际限制,进而影响准确性的改进。本研究提出了一种基于矢量量化对交叉注意力评分进行近似的方案,并使大规模偏差目录的使用在计算和内存方面更加高效。我们建议将此技术与基于检索的上下文偏差方法联合使用。首先,利用一个高效的量化检索模块来根据音频短名单出偏差项。然后,利用检索到的条目进行偏差处理。由于所提出的这种方法对偏差方法是无偏见的,因此我们探讨了全交叉注意力、大型语言模型提示以及两者的结合使用。结果显示,基于检索的短名单使系统能够高效地利用包含数千个条目的偏差目录,从而使个人实体识别的错误率最多降低了71%。同时,在与标准点积交叉注意力进行比较时,对于多达一百万条目的列表,所提出的近似算法将计算时间减少了20%,内存使用量减少了85-95%。
https://arxiv.org/abs/2411.00664
Student mental health is a sensitive issue that necessitates special attention. A primary concern is the student-to-counselor ratio, which surpasses the recommended standard of 250:1 in most universities. This imbalance results in extended waiting periods for in-person consultations, which cause suboptimal treatment. Significant efforts have been directed toward developing mental health dialogue systems utilizing the existing open-source mental health-related datasets. However, currently available datasets either discuss general topics or various strategies that may not be viable for direct application due to numerous ethical constraints inherent in this research domain. To address this issue, this paper introduces a specialized mental health dataset that emphasizes the active listening strategy employed in conversation for counseling, also named as ConvCounsel. This dataset comprises both speech and text data, which can facilitate the development of a reliable pipeline for mental health dialogue systems. To demonstrate the utility of the proposed dataset, this paper also presents the NYCUKA, a spoken mental health dialogue system that is designed by using the ConvCounsel dataset. The results show the merit of using this dataset.
学生心理健康是一个需要特别关注的敏感问题。主要的关注点之一是师生辅导人员的比例,在大多数大学中,这一比例超过了推荐标准250:1。这种失衡导致了面对面咨询的长时间等待,从而影响治疗效果。目前已有大量工作致力于利用现有的开放源代码心理健康的关联数据集来开发心理健康对话系统。然而,当前可用的数据集要么讨论的是通用主题,要么是各种可能因研究领域中固有的众多伦理约束而无法直接应用的战略。为了解决这个问题,本文介绍了一个专注于咨询过程中积极倾听策略的专用心理健康数据集,该数据集被命名为ConvCounsel。这个数据集包含语音和文本数据,可以促进心理健康对话系统的可靠开发流程。为了展示所提数据集的实用性,本文还介绍了NYCUKA,这是一个基于ConvCounsel数据集设计的口语心理健康对话系统。结果展示了使用此数据集的优势。 注:文中提到的“NYCUKA”可能是对某个具体项目或系统的指代,在翻译时保留了原文中的名称,如有误,请提供正确的中文译名或其他信息以进行调整。
https://arxiv.org/abs/2411.00604
Transformers and their derivatives have achieved state-of-the-art performance across text, vision, and speech recognition tasks. However, minimal effort has been made to train transformers capable of evaluating the output quality of other models. This paper examines SwinV2-based reward models, called the Input-Output Transformer (IO Transformer) and the Output Transformer. These reward models can be leveraged for tasks such as inference quality evaluation, data categorization, and policy optimization. Our experiments demonstrate highly accurate model output quality assessment across domains where the output is entirely dependent on the input, with the IO Transformer achieving perfect evaluation accuracy on the Change Dataset 25 (CD25). We also explore modified Swin V2 architectures. Ultimately Swin V2 remains on top with a score of 95.41 % on the IO Segmentation Dataset, outperforming the IO Transformer in scenarios where the output is not entirely dependent on the input. Our work expands the application of transformer architectures to reward modeling in computer vision and provides critical insights into optimizing these models for various tasks.
变换器及其衍生模型在文本、视觉和语音识别任务中已经达到了最先进的性能。然而,很少有努力投入到训练能够评估其他模型输出质量的变换器中。本文研究了基于SwinV2的奖励模型,称为输入-输出变换器(IO Transformer)和输出变换器。这些奖励模型可以用于推理质量评估、数据分类和策略优化等任务。我们的实验表明,在输出完全依赖于输入的情况下,这些模型能够进行高度准确的输出质量评估,在Change Dataset 25 (CD25)上,IO Transformer达到了完美的评估精度。我们还探索了修改后的SwinV2架构。最终,Swin V2在IO分割数据集上的得分达到了95.41%,超过了IO Transformer,在输出并不完全依赖于输入的情况下表现更优。我们的工作扩展了变换器架构在计算机视觉中的奖励模型应用,并为优化这些模型以适应各种任务提供了关键见解。
https://arxiv.org/abs/2411.00252
Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.
口语语言模型(SLMs)随着基于文本的仅解码器语言模型的进步而越来越受到关注。SLMs处理文本和语音,使同时进行语音理解和生成成为可能。本文介绍了双代码本说话人不变聚类(DC-Spin),旨在通过连接音频信号和SLM标记来改进语音标记化。DC-Spin提取富含音素信息且对输入变化具有鲁棒性的说话人不变标记,从而增强零样本SLM任务和语音重合成。我们提出了一种基于块的方法,以实现无需重新训练且不会性能下降的流式处理DC-Spin。通过比较不同的标记化方法(自监督和神经音频编解码器)、模型扩展性和下游任务代理,发现可以轻松由n-gram语言模型建模或与音素对齐的标记表现优异,为设计适用于SLMs的语音标记器提供了见解。
https://arxiv.org/abs/2410.24177
Probabilistic topic models are a powerful tool for extracting latent themes from large text datasets. In many text datasets, we also observe per-document covariates (e.g., source, style, political affiliation) that act as environments that modulate a "global" (environment-agnostic) topic representation. Accurately learning these representations is important for prediction on new documents in unseen environments and for estimating the causal effect of topics on real-world outcomes. To this end, we introduce the Multi-environment Topic Model (MTM), an unsupervised probabilistic model that separates global and environment-specific terms. Through experimentation on various political content, from ads to tweets and speeches, we show that the MTM produces interpretable global topics with distinct environment-specific words. On multi-environment data, the MTM outperforms strong baselines in and out-of-distribution. It also enables the discovery of accurate causal effects.
概率主题模型是从大量文本数据集中提取潜在主题的强大工具。在许多文本数据集中,我们还会观察到每篇文档的协变量(如来源、风格、政治倾向),这些协变量充当环境,调制一个“全局”(不依赖于环境)的主题表示。准确地学习这些表示对于预测新环境中未见文档的结果以及估计主题对现实世界结果的因果效应非常重要。为此,我们引入了多环境主题模型(MTM),这是一种无监督的概率模型,可以分离全局和特定于环境的术语。通过在各种政治内容上的实验,从广告到推文和演讲,我们展示了MTM能够生成具有不同特定环境词汇的可解释全局主题。在多环境数据上,MTM的表现优于强基线,在分布内和外都表现出色。它还使准确发现因果效应成为可能。
https://arxiv.org/abs/2410.24126
The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into German, Spanish, and Japanese, we find that (a) S2TT models possess some internal representation of prosody, but the prosody signal is often not strong enough to affect the translations, (b) E2E systems outperform cascades of speech recognition and text translation systems, confirming their theoretical advantage in this regard, and (c) certain cascaded systems also capture prosodic information in the translation, but only to a lesser extent that depends on the particulars of the transcript's surface form.
口语表达的韵律,包括重音、语调和节奏等特征,可以显著影响其潜在的语义,并因此也会影响它的文本翻译。然而,在语音到文本翻译(S2TT)系统中,韵律的研究很少受到重视。特别是,端到端(E2E)系统被提出作为适合进行韵律感知翻译的方法,因为它们在做出翻译决策时可以直接访问语音信号,但这种方法是否能够在实践中成功仍然是一个未知数。主要的挑战在于评估翻译中的韵律意识难度较大。为了解决这一挑战,我们引入了一种评估方法和一个专门的基准测试(命名为ContraProST),旨在捕捉广泛的韵律现象。我们的方法使用大型语言模型和可控文本到语音(TTS)生成对比示例。通过将英语口语翻译成德语、西班牙语和日语的实验,我们发现:(a) S2TT 模型具有一些内在的韵律表示,但韵律信号通常不够强烈以影响翻译;(b) E2E 系统在翻译中优于语音识别与文本翻译系统的级联组合,这证实了它们在此方面的理论优势;(c) 某些级联系统也能够在翻译中捕捉到韵律信息,但这仅限于取决于转录表层形式具体细节的较弱程度。
https://arxiv.org/abs/2410.24019
Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.
已经有多次尝试使用单一模型来处理多个来源分离任务,如语音增强、语音分离、声音事件分离、音乐源分离(MSS)或电影音频源分离(CASS)。这些模型是基于包括语音、乐器或声音事件在内的大规模数据进行训练的,并且通常能够成功地分离出各种各样的源头。然而,要使这种模型涵盖所有分离任务仍然具有挑战性,因为其中一些任务是相互矛盾的(例如,在MSS中需要将乐器分离出来,而在CASS中则需要将它们分组)。为了克服这一问题并支持所有主要的分离任务,我们提出了一种任务感知统一源分离(TUSS)模型。该模型使用可变数量的学习提示来指定要分离哪个源头,并根据给定的提示改变其行为,使其能够处理包括相互矛盾的任务在内的所有主要分离任务。实验结果表明,所提出的TUSS模型成功地应对了前述五项主要分离任务。我们还提供了一些音频示例,包括合成混合音和真实录音,以展示TUSS模型在推理时如何根据提示灵活改变其行为。
https://arxiv.org/abs/2410.23987