Abstract
This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a ``soft`` letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and (c) Romanian-English code-switching. The results indicate that our approach maintains voice cloning capabilities and enables, to a certain extent, code-switching within the same utterance; however, residual English accent characteristics remain. We open-source our code and provide example audio samples at this https URL.
Abstract (translated)
这项工作介绍了一种轻量级的输入层适配器,用于F5-TTS模型,以支持罗马尼亚语。为了保持模型现有的功能(如声音克隆、英语和中文的支持),我们冻结了原始权重,并在模型中添加了一个子网络并训练它作为文本编码器的文本嵌入矩阵的扩展。为简化起见,我们依赖于F5-TTS中实现的ConvNeXt模块来建模新字符级嵌入之间的相互依存关系。该模块充当“软”字母到声音层,将罗马尼亚文转换成连续表示形式,该形式供F5-TTS模型使用以生成自然发音的罗马尼亚语语音。 我们通过一个包含20名人类听众的评估小组,在三个任务上对模型进行了测试:(a) 参考音频与生成音频之间的相似度;(b) 发音和自然性;(c) 罗马尼亚语-英语代码切换。结果表明,我们的方法在保持声音克隆能力的同时,还能够在一定程度上支持同一句中的语言转换(即罗马尼亚语和英语之间的切换),但仍然保留了一些英式口音特征。 我们开源了我们的代码,并提供了示例音频样本,可在[此链接](https://example.com)获取。
URL
https://arxiv.org/abs/2512.12297