首页|期刊导航|湖南大学学报（自然科学版）|基于音素级韵律建模的自回归零样本语音合成

基于音素级韵律建模的自回归零样本语音合成

岳焕景王嘉玮杨敬钰

湖南大学学报（自然科学版）2025，Vol.52Issue(4)：114-123,10.

湖南大学学报（自然科学版）2025，Vol.52Issue(4)：114-123,10.DOI:10.16339/j.cnki.hdxbzkb.2025271

基于音素级韵律建模的自回归零样本语音合成

Autoregressive Zero-shot Speech Synthesis Based on Phoneme-level Prosody Modeling

岳焕景 ¹王嘉玮 ¹杨敬钰¹

作者信息

1. 天津大学电气自动化与信息工程学院,天津 300072
折叠

摘要

Abstract

To improve the naturalness and robustness of synthesized prosody,a autoregressive speech synthesis model based on phoneme-level prosody modeling is proposed.This model enhances prosody modeling from two aspects:inter-word pauses and phoneme durations.To enhance the diversity and accuracy of inter-word pauses,a pause prediction module is proposed at the text frontend.This module predicts multiple pause labels based on the original text,thereby providing accurate references for pause duration modeling in speech synthesis.To enhance the naturalness of phoneme durations,a duration prediction module is proposed.This module predicts a mixture Gaussian distribution for each phoneme and obtains diversified phoneme durations through random sampling.To stabilize phoneme duration modeling in the autoregressive model,an attention-based discrimination module is proposed.This module is applied at each time step of the autoregressive process and avoids alignment disorder through attention and discrimination mechanisms.Experimental results demonstrate that the three proposed modules effectively enhance the naturalness and robustness of prosody modeling,thereby improving the quality of speech synthesis.

关键词

语音合成/韵律建模/停顿预测

Key words

speech synthesis/prosody modeling/pause prediction

分类

信息技术与安全科学

引用本文复制引用

岳焕景,王嘉玮,杨敬钰..基于音素级韵律建模的自回归零样本语音合成[J].湖南大学学报（自然科学版）,2025,52(4):114-123,10.

基金项目

国家自然科学基金资助项目(61672378),National Natural Science Foundation of China(61672378) （61672378）

湖南大学学报（自然科学版）

OA北大核心

ISSN：1674-2974

访问量4

下载量0

段落导航