Paper Reading AI Learner

Cross-domain Neural Pitch and Periodicity Estimation

2023-01-28 17:30:47
Max Morrison, Caedon Hsieh, Nathan Pruyne, Bryan Pardo

Abstract

Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of state-of-the-art neural pitch and periodicity estimators. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle speech and music without performance degradation. While neural pitch trackers have historically been significantly slower than signal processing based pitch trackers, our estimator implementations approach the speed of state-of-the-art DSP-based pitch estimators on a standard CPU, but with significantly more accurate pitch and periodicity estimation. Our experiments show that an accurate, cross-domain pitch and periodicity estimator written in PyTorch with a hopsize of ten milliseconds can run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU without hardware optimization. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks. The code for penn is available at this https URL.

Abstract (translated)

音调是我们对音频信号感知的基础方面。音调轮廓通常用于分析语音和音乐信号,并作为许多音频任务输入特征,包括音乐录制、歌唱语音合成和音调编辑。在本文中,我们描述了一组技术,以提高最先进的神经网络音调和周期性估计器的准确性。我们还介绍了一种新的熵方法,以从基于统计推断的音调估计器中提取周期性和每帧语音非语音分类(例如神经网络),并展示如何训练一个神经网络音调估计器,同时处理语音和音乐,而性能不会受到损害。虽然神经网络音调跟踪器历史上比基于信号处理音调跟踪器要慢,但我们的估计器实现方法接近标准CPU上最先进的DSP-based音调估计器的速度,但具有更准确的音调和周期性估计。我们的实验表明,一个在PyTorch中编写的准确、跨域的音调和周期性估计器, hopsize为十毫秒,可以在实时状态下运行11.2倍于实时状态下的Intel i9-9820X 10核心3.30 GHz CPU或408倍于无硬件优化的NVIDIA GeForce RTX 3090 GPU上更快。我们将所有我们的代码和模型发布为音调估计神经网络(penn),这是一个开源的pip可安装的Python模块,用于训练、评估和进行音调和周期性估计神经网络。penn的代码可以在这个httpsURL上获取。

URL

https://arxiv.org/abs/2301.12258

PDF

https://arxiv.org/pdf/2301.12258.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot