Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS

2021-09-28 12:45:14

Shilun Lin, Wenchao Su, Li Meng, Fenglong Xie, Xinhui Li, Li Lu

arXiv_CL

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autoregressive RNN-based decoder. Thirdly, a duration predictor instead of an attention model that connects the above hybrid encoder and decoder. Experiments indicate that Nana-HDR gives full play to the advantages of each component, such as strong text encoding ability of Transformer-based encoder, stateful decoding without being bothered by exposure bias and local information preference, and stable alignment provided by duration predictor. Due to these advantages, Nana-HDR achieves competitive performance in naturalness and robustness on two Mandarin corpora.

Abstract (translated)

URL

https://arxiv.org/abs/2109.13673

PDF

https://arxiv.org/pdf/2109.13673.pdf