Abstract
Supervised learning is a mainstream approach to audio signal enhancement (SE) and requires parallel training data consisting of noisy signals and the corresponding clean signals. Such data can only be synthesised and are thus mismatched with real data, which can result in poor performance. Moreover, it is often difficult/impossible to obtain clean signals, which makes it difficult/impossible to apply the approach. Here we explore SE using non-parallel training data consisting of noisy signal clips and noise clips, which can be easily recorded. We define the positive (P) and the negative (N) classes as signal absence and presence, respectively. We observe that the spectrogram patches of noise clips can be used as P data and those of noisy signal clips as unlabelled data. These data enable a convolutional neural network to learn to classify each spectrogram patch as P or N for SE through learning from positive and unlabelled data.
Abstract (translated)
URL
https://arxiv.org/abs/2210.15143