信号处理2025,Vol.41Issue(9):1537-1546,10.DOI:10.12466/xhcl.2025.09.007
VALL-E R:利用单调对齐策略的鲁棒且高效零样本语音合成
VALL-E R:Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
摘要
Abstract
With the aid of discrete neural audio codecs,large language model have emerged as a promising approach for zero-shot text-to-speech(TTS)synthesis.However,sampling-based decoding strategies,while offering high diversity,often suffer from robustness issues such as typos,omissions,and repetitions.To address these challenges,we propose VALL-E R,a robust and efficient zero-shot TTS system built upon the VALL-E framework.Specifically,we introduce a monotonic phoneme alignment strategy that reinforces the correspondence between phonemes and acoustic sequences by constraining acoustic tokens to their associated phonemes.Additionally,we propose a codec-merging technique to downsample discrete codes in the shallow quantization layer,significantly accelerating decoding without compromising speech quality.These enhancements grant VALL-E R improved phoneme-level controllability and robustness,achieving word error rates(WER)close to those of ground truth.Furthermore,the model reduces autoregressive steps,achieving over a 60%decrease in inference time.关键词
零样本语音合成/单调对齐/合并编码/鲁棒性/高效性Key words
zero-shot text-to-speech/monotonic alignment/merge codec/robust/efficient分类
信息技术与安全科学引用本文复制引用
韩冰,钱彦旻..VALL-E R:利用单调对齐策略的鲁棒且高效零样本语音合成[J].信号处理,2025,41(9):1537-1546,10.基金项目
国家自然科学基金(62122050,62071288) The National Natural Science Foundation of China(62122050,62071288) (62122050,62071288)