Abstract
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at this https URL.
Abstract (translated)
在这篇论文中,我们提出了一种无监督的语音分割方法,该方法建立在先前研究的方法(如说话人识别)的基础上,并适用于广泛的声学-语义区别,从而为通用的无监督语音分割方法铺平了道路。与传统的语音和音频分割主要关注输入信号中的频谱变化(例如,音素划分)不同,我们的方法试图将口语内容划分为具有不同声学-语义风格的片段,并专注于那些难以转化为文本的信息,例如情感或说话人的身份。大多数语音分割任务仅处理一种风格的变化,例如情感记录,而我们提出的方法旨在处理多种声学-语义风格变化。 通过利用最近在语音语言模型(SLM)方面的进展,我们提出了一种简单无监督的分割方法来对给定的口语内容进行划分。我们通过对几个不同设置进行实证研究,证明了所提议方法的有效性。结果表明,在边界检测、片段纯净度和过度分段方面,我们的方法优于评估中的基准方法。 代码可在以下网址获得:[此 URL]
URL
https://arxiv.org/abs/2501.03711