首页|期刊导航|信号处理|VALL-E R:利用单调对齐策略的鲁棒且高效零样本语音合成

VALL-E R:利用单调对齐策略的鲁棒且高效零样本语音合成

韩冰钱彦旻

信号处理2025，Vol.41Issue(9)：1537-1546,10.

信号处理2025，Vol.41Issue(9)：1537-1546,10.DOI:10.12466/xhcl.2025.09.007

VALL-E R:利用单调对齐策略的鲁棒且高效零样本语音合成

VALL-E R:Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

韩冰 ¹钱彦旻¹

作者信息

1. 上海交通大学计算机学院,上海 200240
折叠

摘要

Abstract

With the aid of discrete neural audio codecs,large language model have emerged as a promising approach for zero-shot text-to-speech(TTS)synthesis.However,sampling-based decoding strategies,while offering high diversity,often suffer from robustness issues such as typos,omissions,and repetitions.To address these challenges,we propose VALL-E R,a robust and efficient zero-shot TTS system built upon the VALL-E framework.Specifically,we introduce a monotonic phoneme alignment strategy that reinforces the correspondence between phonemes and acoustic sequences by constraining acoustic tokens to their associated phonemes.Additionally,we propose a codec-merging technique to downsample discrete codes in the shallow quantization layer,significantly accelerating decoding without compromising speech quality.These enhancements grant VALL-E R improved phoneme-level controllability and robustness,achieving word error rates(WER)close to those of ground truth.Furthermore,the model reduces autoregressive steps,achieving over a 60%decrease in inference time.

关键词

零样本语音合成/单调对齐/合并编码/鲁棒性/高效性

Key words

zero-shot text-to-speech/monotonic alignment/merge codec/robust/efficient

分类

信息技术与安全科学

引用本文复制引用

韩冰,钱彦旻..VALL-E R:利用单调对齐策略的鲁棒且高效零样本语音合成[J].信号处理,2025,41(9):1537-1546,10.

基金项目

国家自然科学基金(62122050,62071288) The National Natural Science Foundation of China(62122050,62071288) （62122050,62071288）

信号处理

OA北大核心

ISSN：1003-0530

访问量0

下载量0

段落导航