Paper Reading AI Learner

F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation

2025-12-13 11:41:54
Radu-Gabriel Chivereanu, Tiberiu Boros

Abstract

This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a ``soft`` letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and (c) Romanian-English code-switching. The results indicate that our approach maintains voice cloning capabilities and enables, to a certain extent, code-switching within the same utterance; however, residual English accent characteristics remain. We open-source our code and provide example audio samples at this https URL.

Abstract (translated)

这项工作介绍了一种轻量级的输入层适配器,用于F5-TTS模型,以支持罗马尼亚语。为了保持模型现有的功能(如声音克隆、英语和中文的支持),我们冻结了原始权重,并在模型中添加了一个子网络并训练它作为文本编码器的文本嵌入矩阵的扩展。为简化起见,我们依赖于F5-TTS中实现的ConvNeXt模块来建模新字符级嵌入之间的相互依存关系。该模块充当“软”字母到声音层,将罗马尼亚文转换成连续表示形式,该形式供F5-TTS模型使用以生成自然发音的罗马尼亚语语音。 我们通过一个包含20名人类听众的评估小组,在三个任务上对模型进行了测试:(a) 参考音频与生成音频之间的相似度;(b) 发音和自然性;(c) 罗马尼亚语-英语代码切换。结果表明,我们的方法在保持声音克隆能力的同时,还能够在一定程度上支持同一句中的语言转换(即罗马尼亚语和英语之间的切换),但仍然保留了一些英式口音特征。 我们开源了我们的代码,并提供了示例音频样本,可在[此链接](https://example.com)获取。

URL

https://arxiv.org/abs/2512.12297

PDF

https://arxiv.org/pdf/2512.12297.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot