Paper Reading AI Learner

Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

2025-01-07 11:32:13
Avishai Elmakies, Omri Abend, Yossi Adi

Abstract

In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at this https URL.

Abstract (translated)

在这篇论文中,我们提出了一种无监督的语音分割方法,该方法建立在先前研究的方法(如说话人识别)的基础上,并适用于广泛的声学-语义区别,从而为通用的无监督语音分割方法铺平了道路。与传统的语音和音频分割主要关注输入信号中的频谱变化(例如,音素划分)不同,我们的方法试图将口语内容划分为具有不同声学-语义风格的片段,并专注于那些难以转化为文本的信息,例如情感或说话人的身份。大多数语音分割任务仅处理一种风格的变化,例如情感记录,而我们提出的方法旨在处理多种声学-语义风格变化。 通过利用最近在语音语言模型(SLM)方面的进展,我们提出了一种简单无监督的分割方法来对给定的口语内容进行划分。我们通过对几个不同设置进行实证研究,证明了所提议方法的有效性。结果表明,在边界检测、片段纯净度和过度分段方面,我们的方法优于评估中的基准方法。 代码可在以下网址获得:[此 URL]

URL

https://arxiv.org/abs/2501.03711

PDF

https://arxiv.org/pdf/2501.03711.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot