Abstract
Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.25 ms. With a single microphone, we observe a mean SI-SDRi of 4.1 dB. The approach shows generalization with a DNSMOS increase of 0.2 on unseen audio recordings. We use a lightweight LSTM-based model of 644k parameters to generate FIR taps. We benchmark that our system can run on low-power DSP with 388 MIPS and mean end-to-end latency of 3.35 ms. We provide a comparison with baseline low-latency spectral masking techniques. We hope this work will enable a better understanding of latency and can be used to improve the comfort and usability of hearables.
Abstract (translated)
低延迟模型对于实时语音增强应用,如助听器和可听设备至关重要。然而,对于资源受限的可听设备,如低功耗 hearables,亚毫秒延迟空间仍未经探索。我们通过使用计算高效的最小相位 FIR 滤波器来展示语音增强,实现每样本 0.32ms 至 1.25ms 的均算法延迟。通过一个麦克风,我们观察到平均信号-功率误码率(SI-SDRi)为 4.1 dB。这种方法在未见过的音频录音上具有泛化能力,DNSMOS 增加 0.2。我们使用具有 644k 参数的轻量级 LSTM 模型生成 FIR 采样。我们对比了我们的系统在低功耗 DSP 上运行的情况,其时钟脉冲数为 388 MIPS,均延迟为 3.35ms。我们还提供了与基线低延迟语音掩码技术的比较。我们希望这项工作能够更好地理解延迟,并可用于提高可听设备的舒适度和可用性。
URL
https://arxiv.org/abs/2409.18239