Paper Reading AI Learner

Neural Target Speech Extraction: An Overview

2023-01-31 00:26:52
Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu

Abstract

Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.

Abstract (translated)

人类即使在有噪音、反射和干扰说话人的困难的声学条件下,仍然可以听取目标说话人的声音。这种现象被称为鸡尾酒会效应。几十年来,研究人员一直专注于接近人类听的能力。一个重要的问题是如何处理干扰说话人,因为目标和非目标语音信号具有相似的特征,这使得他们的区分变得更加困难。目标语言/说话人提取(TSE)是从没有噪音和反射的多个说话人混合中分离目标说话人的声音信号,并使用识别混合中说话人的线索来提取这种信号。这些线索可能是一个空间线索,指示目标说话人的方向,一个说话人嘴唇的视频或从录制的启用说话人的语音特征中推导出他们的声学特征。 TSE是一个新兴的研究领域,近年来受到了更多的关注,因为它提供了一个解决鸡尾酒会问题的实际方法,并涉及音频、视觉、数组处理和深度学习等信号处理方面的方面。本文重点介绍了最近的神经网络方法,并提供了深入综述 of TSE。我们将引导读者通过不同的主要方法,强调框架之间的相似之处,并讨论潜在的未来方向。

URL

https://arxiv.org/abs/2301.13341

PDF

https://arxiv.org/pdf/2301.13341.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot