湖南大学学报(自然科学版)2025,Vol.52Issue(4):114-123,10.DOI:10.16339/j.cnki.hdxbzkb.2025271
基于音素级韵律建模的自回归零样本语音合成
Autoregressive Zero-shot Speech Synthesis Based on Phoneme-level Prosody Modeling
摘要
Abstract
To improve the naturalness and robustness of synthesized prosody,a autoregressive speech synthesis model based on phoneme-level prosody modeling is proposed.This model enhances prosody modeling from two aspects:inter-word pauses and phoneme durations.To enhance the diversity and accuracy of inter-word pauses,a pause prediction module is proposed at the text frontend.This module predicts multiple pause labels based on the original text,thereby providing accurate references for pause duration modeling in speech synthesis.To enhance the naturalness of phoneme durations,a duration prediction module is proposed.This module predicts a mixture Gaussian distribution for each phoneme and obtains diversified phoneme durations through random sampling.To stabilize phoneme duration modeling in the autoregressive model,an attention-based discrimination module is proposed.This module is applied at each time step of the autoregressive process and avoids alignment disorder through attention and discrimination mechanisms.Experimental results demonstrate that the three proposed modules effectively enhance the naturalness and robustness of prosody modeling,thereby improving the quality of speech synthesis.关键词
语音合成/韵律建模/停顿预测Key words
speech synthesis/prosody modeling/pause prediction分类
信息技术与安全科学引用本文复制引用
岳焕景,王嘉玮,杨敬钰..基于音素级韵律建模的自回归零样本语音合成[J].湖南大学学报(自然科学版),2025,52(4):114-123,10.基金项目
国家自然科学基金资助项目(61672378),National Natural Science Foundation of China(61672378) (61672378)