电子学报2025,Vol.53Issue(6):1815-1828,14.DOI:10.12263/DZXB.20240819
基于对抗学习和增强优化的深度转换语音还原方法
Adversarial Learning and Enhanced Optimization Based Restoration Method for VC-Generated Speeches
摘要
Abstract
Voice conversion is an artificial intelligence technology that uses deep learning to convert the voice of a source speaker into the voice of a target speaker.It is widely used not only in movie dubbing,personalized voice customiza-tion,etc.,but also used by malicious individuals in telecom fraud,identity forgery,political and social manipulation,etc.,posing serious threats to personal privacy,social stability,and even national security.Compared with the detection of VC-generated speeches,how to restore the source speech from VC-generated speeches,that is,VC-generated speeches restora-tion,has more important research significance and practical value for tracking real speakers and preventing the illegal use of VC technologies.However,there are still few related studies.In this paper,a restoration method for VC-generated speeches is proposed based on adversarial learning and enhancement optimization.Specifically,the similarity of the VC-generated speech with the source and target speech is first analyzed,and a restoration framework is present based on preliminary resto-ration-further optimization.Then,an adversarial restoration network is designed based on dynamic convolution and atten-tion mechanisms,aiming to learn as much source speaker information as possible from VC-generated speech through adver-sarial learning of generator,classifier,and discriminator.After that,an enhanced optimization network,consisting of timbre extractor,content extractor,and sound encoder,is designed to generate optimized restored speech by deeply fusing timbre information in the preliminary restored speech and the content information in the deep converted speech.Finally,the effec-tiveness of the proposed method is validated on datasets of three high-performance speech conversion models:BNE-PPG-VC,TriAAN-VC,and Free VC.Comparative experimental results show that the restored speech for the three VC models improves the mean of cosine similarity with the source speech by 11.9,8.7,and 7.1 percentage points respectively,and re-duces the mean of equal-error-rate of speaker verification system by 4.30,3.40,and 3.98 percentage points respectively,which indicates that the proposed method can not only effectively recover the source speaker speech,but also is also applica-ble to unknown VC-generated speech.关键词
语音转换/深度转换语音/还原语音/对抗学习/增强优化/深度神经网络Key words
voice conversion/voice conversion generated speeches/restored speech/adversarial learning/enhanced optimization/deep neural network分类
信息技术与安全科学引用本文复制引用
苏兆品,周晓琳,张国富,廉晨思,王年松,岳峰..基于对抗学习和增强优化的深度转换语音还原方法[J].电子学报,2025,53(6):1815-1828,14.基金项目
教育部人文社会科学研究规划基金项目(No.24YJA870011) (No.24YJA870011)
安徽省重点研究与开发计划项目(No.202104d07020001) MOE(Ministry of Education in China)Project of Humanities and Social Sciences(No.24YJA870011) (No.202104d07020001)
Anhui Province Key Research and Development Program Project(No.202104d07020001) (No.202104d07020001)