Cross-domain Neural Pitch and Periodicity Estimation

Abstract
Abstract (translated)
URL
PDF

Abstract

Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of state-of-the-art neural pitch and periodicity estimators. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle speech and music without performance degradation. While neural pitch trackers have historically been significantly slower than signal processing based pitch trackers, our estimator implementations approach the speed of state-of-the-art DSP-based pitch estimators on a standard CPU, but with significantly more accurate pitch and periodicity estimation. Our experiments show that an accurate, cross-domain pitch and periodicity estimator written in PyTorch with a hopsize of ten milliseconds can run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU without hardware optimization. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks. The code for penn is available at this https URL.

Abstract (translated)

音调是我们对音频信号感知的基础方面。音调轮廓通常用于分析语音和音乐信号,并作为许多音频任务输入特征,包括音乐录制、歌唱语音合成和音调编辑。在本文中,我们描述了一组技术,以提高最先进的神经网络音调和周期性估计器的准确性。我们还介绍了一种新的熵方法,以从基于统计推断的音调估计器中提取周期性和每帧语音非语音分类(例如神经网络),并展示如何训练一个神经网络音调估计器,同时处理语音和音乐,而性能不会受到损害。虽然神经网络音调跟踪器历史上比基于信号处理音调跟踪器要慢,但我们的估计器实现方法接近标准CPU上最先进的DSP-based音调估计器的速度,但具有更准确的音调和周期性估计。我们的实验表明,一个在PyTorch中编写的准确、跨域的音调和周期性估计器, hopsize为十毫秒,可以在实时状态下运行11.2倍于实时状态下的Intel i9-9820X 10核心3.30 GHz CPU或408倍于无硬件优化的NVIDIA GeForce RTX 3090 GPU上更快。我们将所有我们的代码和模型发布为音调估计神经网络(penn),这是一个开源的pip可安装的Python模块,用于训练、评估和进行音调和周期性估计神经网络。penn的代码可以在这个httpsURL上获取。

URL

https://arxiv.org/abs/2301.12258

PDF

https://arxiv.org/pdf/2301.12258.pdf