首页|期刊导航|湖北民族大学学报（自然科学版）|基于扩散模型的多源特征解耦语音转换模型

基于扩散模型的多源特征解耦语音转换模型

张业东文双兵谭寒钟黄海峰胡涛

湖北民族大学学报（自然科学版）2026，Vol.44Issue(1)：57-61,5.

湖北民族大学学报（自然科学版）2026，Vol.44Issue(1)：57-61,5.DOI:10.13501/j.cnki.42-1908/n.2026.03.002

基于扩散模型的多源特征解耦语音转换模型

Multi-source Feature Decoupling Voice Conversion Model Based on Diffusion Models

张业东 ¹文双兵 ²谭寒钟 ¹黄海峰 ¹胡涛³

作者信息

1. 湖北民族大学智能科学与工程学院,湖北恩施 445000
2. 湖北民族大学数学与统计学院,湖北恩施 445000
3. 湖北民族大学智能科学与工程学院,湖北恩施 445000||湖北民族大学硒食品营养与健康智能技术湖北省工程研究中心,湖北恩施 445000
折叠

摘要

Abstract

To address issues such as timbre leakage,loss of prosodic details,and insufficient naturalness in generated speech,a multi-source feature decoupling voice conversion model based on diffusion models(MFD-VC)was proposed.Firstly,speech was decomposed into distinct subspaces with different attributes and was processed independently in the model,through which high-fidelity voice conversion was achieved via multi-attribute collaborative control.Secondly,a style encoder(SE)was designed to extract speaker timbre features from reference speech.Concurrently,a content feature encoder(CFE)and a wave network-based encoder(WN)were utilized to process content and fundamental frequency information,respectively.Finally,a multi-scale fusion 1 dimension(MSF1D)module was integrated into the U network diffusion model(UNetDiff)architecture to enhance multi-scale feature expression in skip connections,thereby allowing the generation of more detailed Mel-spectrum representations.The results showed that the MFD-VC achieved an equal error rate of only 12.2%on the Librispeech for text-to-speech(LibriTTS)dataset,while a high average opinion score of 4.35 for similarity was obtained.Moreover,on the voice cloning toolkit(VCTK)dataset,an equal error rate of only 15.2%and a high average opinion score of 4.24 for similarity were achieved.The MFD-VC delivered superior performance in terms of sound quality,similarity,and content clarity.

关键词

语音转换/MFD-VC/扩散模型/波形网络/UNetDiff

Key words

voice conversion/MFD-VC/diffusion model/wave network/UNetDiff

分类

信息技术与安全科学

引用本文复制引用

张业东,文双兵,谭寒钟,黄海峰,胡涛..基于扩散模型的多源特征解耦语音转换模型[J].湖北民族大学学报（自然科学版）,2026,44(1):57-61,5.

基金项目

湖北省自然科学基金恩施创新发展联合基金(2025AFD161,2023AFD061) （2025AFD161,2023AFD061）

湖北省高等学校优秀中青年科技创新团队计划项目(T2023013) （T2023013）

湖北民族大学研究生教育创新项目(MYK2025068). （MYK2025068）

湖北民族大学学报（自然科学版）

ISSN：2096-7594

访问量0

下载量0

段落导航