| 注册
首页|期刊导航|广西师范大学学报(自然科学版)|面向域外说话人适应场景的多层级解耦个性化语音合成

面向域外说话人适应场景的多层级解耦个性化语音合成

高盛祥 杨元樟 王琳钦 莫尚斌 余正涛 董凌

广西师范大学学报(自然科学版)2024,Vol.42Issue(4):11-21,11.
广西师范大学学报(自然科学版)2024,Vol.42Issue(4):11-21,11.DOI:10.16088/j.issn.1001-6600.2023111303

面向域外说话人适应场景的多层级解耦个性化语音合成

Multi-level Disentangled Personalized Speech Synthesis for Out-of-Domain Speakers Adaptation Scenarios

高盛祥 1杨元樟 2王琳钦 2莫尚斌 2余正涛 1董凌1

作者信息

  • 1. 昆明理工大学信息工程与自动化学院,云南昆明 650500||云南省人工智能重点实验室(昆明理工大学),云南昆明 650500||云南省媒体融合重点实验室(云南日报报业集团),云南昆明 650228
  • 2. 昆明理工大学信息工程与自动化学院,云南昆明 650500||云南省人工智能重点实验室(昆明理工大学),云南昆明 650500
  • 折叠

摘要

Abstract

Personalized speech synthesis aims to generate speech with specific speaker's characteristics.Traditional approaches often exhibit noticeable timbre disparities when synthesizing speech from unseen speakers,making it challenging to disentangle speaker-specific timbre features.This paper proposes a multi-level disentangled personalized speech synthesis approach designed for out-of-domain speakers.By fusing features at different granularities,the proposed method effectively enhances the performance of synthesizing speech from unseen speakers under zero-resource conditions.This is achieved by utilizing fast Fourier convolution to extract global speaker features,thereby enhancing the model's generalization to unseen speakers and enabling sentence-level speaker decoupling.Additionally,leveraging a speech recognition model,the method decouples speaker features at the phoneme level and captures phoneme-level timbre features through an attention mechanism,achieving phoneme-level speaker disentanglement.Experimental results on the publicly available dataset AISHELL3 demonstrate that the proposed approach achieves a cosine similarity of 0.697 for speaker feature vectors of cross-speaker adaptation,indicating a 6.25%improvement compared with the baseline model.This enhancement shows the method's capability in modeling timbre features for speech from unseen speakers in cross-speaker adaptation scenarios.

关键词

语音合成/零资源/说话人表征/域外说话人/特征解耦

Key words

speech synthesis/zero-shot/speaker representation/out-of-domain speaker/feature disentanglement

分类

信息技术与安全科学

引用本文复制引用

高盛祥,杨元樟,王琳钦,莫尚斌,余正涛,董凌..面向域外说话人适应场景的多层级解耦个性化语音合成[J].广西师范大学学报(自然科学版),2024,42(4):11-21,11.

基金项目

国家自然科学基金(62376111,U23A20388,61972186,U21B2027) (62376111,U23A20388,61972186,U21B2027)

云南高新技术产业发展项目(201606) (201606)

云南省基础研究计划项目(202001AS070014) (202001AS070014)

云南省科技人才与平台计划项目(202105AC160018) (202105AC160018)

云南省媒体融合重点实验室开放课题(220225702) (220225702)

云南省重点研发计划项目(202303AP140008,202103AA080015) (202303AP140008,202103AA080015)

广西师范大学学报(自然科学版)

OA北大核心CSTPCD

1001-6600

访问量5
|
下载量0
段落导航相关论文