| 注册
首页|期刊导航|湖北民族大学学报(自然科学版)|基于扩散模型的多源特征解耦语音转换模型

基于扩散模型的多源特征解耦语音转换模型

张业东 文双兵 谭寒钟 黄海峰 胡涛

湖北民族大学学报(自然科学版)2026,Vol.44Issue(1):57-61,5.
湖北民族大学学报(自然科学版)2026,Vol.44Issue(1):57-61,5.DOI:10.13501/j.cnki.42-1908/n.2026.03.002

基于扩散模型的多源特征解耦语音转换模型

Multi-source Feature Decoupling Voice Conversion Model Based on Diffusion Models

张业东 1文双兵 2谭寒钟 1黄海峰 1胡涛3

作者信息

  • 1. 湖北民族大学 智能科学与工程学院,湖北 恩施 445000
  • 2. 湖北民族大学 数学与统计学院,湖北 恩施 445000
  • 3. 湖北民族大学 智能科学与工程学院,湖北 恩施 445000||湖北民族大学 硒食品营养与健康智能技术湖北省工程研究中心,湖北 恩施 445000
  • 折叠

摘要

Abstract

To address issues such as timbre leakage,loss of prosodic details,and insufficient naturalness in generated speech,a multi-source feature decoupling voice conversion model based on diffusion models(MFD-VC)was proposed.Firstly,speech was decomposed into distinct subspaces with different attributes and was processed independently in the model,through which high-fidelity voice conversion was achieved via multi-attribute collaborative control.Secondly,a style encoder(SE)was designed to extract speaker timbre features from reference speech.Concurrently,a content feature encoder(CFE)and a wave network-based encoder(WN)were utilized to process content and fundamental frequency information,respectively.Finally,a multi-scale fusion 1 dimension(MSF1D)module was integrated into the U network diffusion model(UNetDiff)architecture to enhance multi-scale feature expression in skip connections,thereby allowing the generation of more detailed Mel-spectrum representations.The results showed that the MFD-VC achieved an equal error rate of only 12.2%on the Librispeech for text-to-speech(LibriTTS)dataset,while a high average opinion score of 4.35 for similarity was obtained.Moreover,on the voice cloning toolkit(VCTK)dataset,an equal error rate of only 15.2%and a high average opinion score of 4.24 for similarity were achieved.The MFD-VC delivered superior performance in terms of sound quality,similarity,and content clarity.

关键词

语音转换/MFD-VC/扩散模型/波形网络/UNetDiff

Key words

voice conversion/MFD-VC/diffusion model/wave network/UNetDiff

分类

信息技术与安全科学

引用本文复制引用

张业东,文双兵,谭寒钟,黄海峰,胡涛..基于扩散模型的多源特征解耦语音转换模型[J].湖北民族大学学报(自然科学版),2026,44(1):57-61,5.

基金项目

湖北省自然科学基金恩施创新发展联合基金(2025AFD161,2023AFD061) (2025AFD161,2023AFD061)

湖北省高等学校优秀中青年科技创新团队计划项目(T2023013) (T2023013)

湖北民族大学研究生教育创新项目(MYK2025068). (MYK2025068)

湖北民族大学学报(自然科学版)

2096-7594

访问量0
|
下载量0
段落导航相关论文