首页|期刊导航|电子学报|基于对抗学习和增强优化的深度转换语音还原方法

基于对抗学习和增强优化的深度转换语音还原方法

苏兆品周晓琳张国富廉晨思王年松岳峰

电子学报2025，Vol.53Issue(6)：1815-1828,14.

电子学报2025，Vol.53Issue(6)：1815-1828,14.DOI:10.12263/DZXB.20240819

基于对抗学习和增强优化的深度转换语音还原方法

Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches

苏兆品 ¹周晓琳 ²张国富 ¹廉晨思 ³王年松 ³岳峰⁴

作者信息

1. 合肥工业大学计算机与信息学院,安徽合肥 230601||智能互联系统安徽省实验室(合肥工业大学),安徽合肥 230009||音视频智能防识联合实验室,安徽合肥 230000
2. 合肥工业大学计算机与信息学院,安徽合肥 230601
3. 安徽省公安厅物证鉴定管理处,安徽合肥 230000||音视频智能防识联合实验室,安徽合肥 230000
4. 合肥工业大学计算机与信息学院,安徽合肥 230601||智能互联系统安徽省实验室(合肥工业大学),安徽合肥 230009
折叠

摘要

Abstract

Voice conversion is an artificial intelligence technology that uses deep learning to convert the voice of a source speaker into the voice of a target speaker.It is widely used not only in movie dubbing,personalized voice customiza-tion,etc.,but also used by malicious individuals in telecom fraud,identity forgery,political and social manipulation,etc.,posing serious threats to personal privacy,social stability,and even national security.Compared with the detection of VC-generated speeches,how to restore the source speech from VC-generated speeches,that is,VC-generated speeches restora-tion,has more important research significance and practical value for tracking real speakers and preventing the illegal use of VC technologies.However,there are still few related studies.In this paper,a restoration method for VC-generated speeches is proposed based on adversarial learning and enhancement optimization.Specifically,the similarity of the VC-generated speech with the source and target speech is first analyzed,and a restoration framework is present based on preliminary resto-ration-further optimization.Then,an adversarial restoration network is designed based on dynamic convolution and atten-tion mechanisms,aiming to learn as much source speaker information as possible from VC-generated speech through adver-sarial learning of generator,classifier,and discriminator.After that,an enhanced optimization network,consisting of timbre extractor,content extractor,and sound encoder,is designed to generate optimized restored speech by deeply fusing timbre information in the preliminary restored speech and the content information in the deep converted speech.Finally,the effec-tiveness of the proposed method is validated on datasets of three high-performance speech conversion models:BNE-PPG-VC,TriAAN-VC,and Free VC.Comparative experimental results show that the restored speech for the three VC models improves the mean of cosine similarity with the source speech by 11.9,8.7,and 7.1 percentage points respectively,and re-duces the mean of equal-error-rate of speaker verification system by 4.30,3.40,and 3.98 percentage points respectively,which indicates that the proposed method can not only effectively recover the source speaker speech,but also is also applica-ble to unknown VC-generated speech.

关键词

语音转换/深度转换语音/还原语音/对抗学习/增强优化/深度神经网络

Key words

voice conversion/voice conversion generated speeches/restored speech/adversarial learning/enhanced optimization/deep neural network

分类

信息技术与安全科学

引用本文复制引用

苏兆品,周晓琳,张国富,廉晨思,王年松,岳峰..基于对抗学习和增强优化的深度转换语音还原方法[J].电子学报,2025,53(6):1815-1828,14.

基金项目

教育部人文社会科学研究规划基金项目(No.24YJA870011) （No.24YJA870011）

安徽省重点研究与开发计划项目(No.202104d07020001) MOE(Ministry of Education in China)Project of Humanities and Social Sciences(No.24YJA870011) （No.202104d07020001）

Anhui Province Key Research and Development Program Project(No.202104d07020001) （No.202104d07020001）

电子学报

OA北大核心

ISSN：0372-2112

访问量0

下载量0

段落导航