湖北民族大学学报(自然科学版)2026,Vol.44Issue(1):57-61,5.DOI:10.13501/j.cnki.42-1908/n.2026.03.002
基于扩散模型的多源特征解耦语音转换模型
Multi-source Feature Decoupling Voice Conversion Model Based on Diffusion Models
摘要
Abstract
To address issues such as timbre leakage,loss of prosodic details,and insufficient naturalness in generated speech,a multi-source feature decoupling voice conversion model based on diffusion models(MFD-VC)was proposed.Firstly,speech was decomposed into distinct subspaces with different attributes and was processed independently in the model,through which high-fidelity voice conversion was achieved via multi-attribute collaborative control.Secondly,a style encoder(SE)was designed to extract speaker timbre features from reference speech.Concurrently,a content feature encoder(CFE)and a wave network-based encoder(WN)were utilized to process content and fundamental frequency information,respectively.Finally,a multi-scale fusion 1 dimension(MSF1D)module was integrated into the U network diffusion model(UNetDiff)architecture to enhance multi-scale feature expression in skip connections,thereby allowing the generation of more detailed Mel-spectrum representations.The results showed that the MFD-VC achieved an equal error rate of only 12.2%on the Librispeech for text-to-speech(LibriTTS)dataset,while a high average opinion score of 4.35 for similarity was obtained.Moreover,on the voice cloning toolkit(VCTK)dataset,an equal error rate of only 15.2%and a high average opinion score of 4.24 for similarity were achieved.The MFD-VC delivered superior performance in terms of sound quality,similarity,and content clarity.关键词
语音转换/MFD-VC/扩散模型/波形网络/UNetDiffKey words
voice conversion/MFD-VC/diffusion model/wave network/UNetDiff分类
信息技术与安全科学引用本文复制引用
张业东,文双兵,谭寒钟,黄海峰,胡涛..基于扩散模型的多源特征解耦语音转换模型[J].湖北民族大学学报(自然科学版),2026,44(1):57-61,5.基金项目
湖北省自然科学基金恩施创新发展联合基金(2025AFD161,2023AFD061) (2025AFD161,2023AFD061)
湖北省高等学校优秀中青年科技创新团队计划项目(T2023013) (T2023013)
湖北民族大学研究生教育创新项目(MYK2025068). (MYK2025068)