首页|期刊导航|南京师范大学学报（工程技术版）|轻量且基频可预测的端到端语音合成系统

轻量且基频可预测的端到端语音合成系统

梁婷艾斯卡尔·艾木都拉刘煌徐颖

南京师范大学学报（工程技术版）2023，Vol.23Issue(4)：37-42,6.

南京师范大学学报（工程技术版）2023，Vol.23Issue(4)：37-42,6.DOI:10.3969/j.issn.1672-1292.2023.04.005

轻量且基频可预测的端到端语音合成系统

A Lightweight End-to-End Speech Synthesis System with Pitch Prediction

梁婷 ¹艾斯卡尔·艾木都拉 ¹刘煌 ²徐颖²

作者信息

1. 新疆大学信息科学与工程学院,新疆乌鲁木齐 830046
2. 上海格子互动信息技术有限公司,上海 200000
折叠

摘要

Abstract

This paper proposes a lightweight end-to-end speech synthesis model with pitch prediction.The model in this paper is based on VITS,an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder,and three improvements are made to make the synthesized speech more rhythmical and more stable in a more efficient way.To be more specific.Firstly,to improve the accuracy of pronunciation and naturalness of speech,we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features,modeling the rich acoustic variation in speech,and phone predictor and CTC loss are introduced to improve the stability of pronunciation.Secondly,the ground truth duration of phonemes is used for alignment of text and frame in the model,and F0 predictor is added to enhance the sense of rhythm of speech.Thirdly,the decoder in the original VITS model with multi-band generation and inverse short-time Fourier transform,which effectively improves the inference speed of the model.Experiments show that the proposed model greatly improves the naturalness and expressiveness by 5%from the MOS(mean opinion score)value and improves the inference speed by 3 times from RTF(real-time factor)compared with the original VITS.

关键词

端到端语音合成/韵律预测/逆快速傅立叶变换/变分字编码器/流/多频带

Key words

end-to-end speech synthesis/prosodic prediction/ISTFT/VAE/flow/sub-band

分类

信息技术与安全科学

引用本文复制引用

梁婷,艾斯卡尔·艾木都拉,刘煌,徐颖..轻量且基频可预测的端到端语音合成系统[J].南京师范大学学报（工程技术版）,2023,23(4):37-42,6.

南京师范大学学报（工程技术版）

ISSN：1672-1292

访问量0

下载量0

段落导航