Abstract
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
Abstract (translated)
我们提出了层次音频编解码器(HAC),这是一种统一的神经语音编解码器,它在一个单一模型中将瓶颈分解为三个语言层级:声学层、音素层和词汇层。HAC利用了两个知识蒸馏目标:一个来自预训练的语音编码器(HuBERT),用于提取音素级别的结构;另一个则来源于基于文本的编码器(LaBSE),以获取词汇线索。在英语及多语言数据上的实验表明,HAC分解后的瓶颈产生了分离式的标记集合:其中一个与音素对齐,而另一个捕捉了词级语义信息。定量评估确认了HAC标记能够保持自然性,并提供可解释的语言信息,在分离性和重建质量方面均优于单一层次的基线模型。这些发现突显了HAC作为一种统一离散语音表示的巨大潜力,它在下游语音生成和理解任务中连接了声学细节与词汇意义之间的桥梁。
URL
https://arxiv.org/abs/2506.15456