Abstract
We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
Abstract (translated)
我们提出了TRAMBA,一种适用于移动和可穿戴平台的混合变压器和Mamba架构的语音和骨传导增强,为语音和骨传导增强提供了一种高效且可扩展的方法。骨传导增强在移动和可穿戴平台上的实现一直是不实用的几个原因:首先,数据收集工作量很大,导致数据稀缺;其次,与具有数百MB内存开销的先进模型相比,更适用于资源受限系统的方法之间存在性能差距。为了将TRAMBA适应振动感知模式,我们使用音频语音数据集预先训练TRAMBA。然后,用户通过少量的骨传导数据进行微调。TRAMBA在PESQ和STOI方面的性能优于最先进的GAN,其内存开销减小了 orders of magnitude,并且推理速度加快了465倍。我们将TRAMBA集成到实际系统中,并证明了TRAMBA(i)通过要求更少的数据采样和传输来提高可穿戴设备的电池寿命,提高了160%;(ii)在嘈杂的环境中产生的声音质量高于通过空气传播的声音;(iii) 内存开销小于20.0 MB。
URL
https://arxiv.org/abs/2405.01242