信号处理2025,Vol.41Issue(9):1494-1512,19.DOI:10.12466/xhcl.2025.09.004
多输入场景通用的一体化语音增强技术
Unified Speech Enhancement Under Varying Input Conditions
摘要
Abstract
Intelligent speech interaction systems are increasingly deployed across a range of real-world applications.However,their performance often degrades significantly in complex acoustic environments because of diverse and chal-lenging conditions.These conditions include non-linear speech distortions,varying levels of reverberation in indoor spaces,and different types of background noise in public areas.Additionally,hardware differences—such as variations in sampling rates,microphone types,number of channels,and array geometries—further complicate the problem.In contrast,traditional deep learning-based speech enhancement(SE)techniques are typically designed with narrow spe-cialization,focusing on specific scenarios or hardware configurations.For instance,many models are trained exclu-sively for either single-channel or fixed multi-channel setups,or for particular sampling rates.This specialization creates challenges in real-world deployment where multiple device configurations may coexist,leading to increased system complexity and resource requirements.Recent advances in signal processing and deep learning offer new opportunities to address these limitations.One promising direction is the development of unified SE techniques capable of handling speech signals with varying input conditions in a single model.Such a unified approach can overcome the limited scope of conventional methods by enabling models to automatically adapt to different input characteristics without explicit re-configuration or model switching.Despite its practical importance,this area remains underexplored.Most existing SE re-search focuses on constrained,scenario-specific conditions.Motivated by this gap,we present a comprehensive study on unified SE techniques and propose the first unified model—Unconstrained Speech Enhancement and Separation(USES)—designed to operate under diverse input conditions.USES can effectively process speech signals with varying sampling rates,microphone numbers and geometrical arrays,durations,and acoustic environments in a unified manner.Compared with prior work,this is the first SE model to support such a wide range of input formats,incorporating inno-vations in multi-domain data preparation,model architecture,and training framework.Extensive experiments on stan-dard SE benchmarks(e.g.,VoiceBank+DEMAND,DNS-2020,CHiME-4)and the URGENT 2025 Challenge dataset demonstrate that USES not only achieves state-of-the-art performance on simulated evaluation data but also significantly improves robustness in real-world conditions.For example,USES outperformed leading models on both the WSJ0-2mix speech separation task and the DNS-2020 denoising benchmark,while successfully unifying support for varied sampling rates and microphone setups.Additionally,the unified model reduced computational costs by 52%and 51%when pro-cessing 16 kHz and 48 kHz inputs,respectively,compared with the high-performing TF-GridNet baseline—achieving similar or better performance with lower complexity.关键词
语音增强/语音分离/去混响/多麦克风/一体化建模Key words
speech enhancement/speech separation/dereverberation/multi-microphone/unified modeling分类
信息技术与安全科学引用本文复制引用
张王优,钱彦旻..多输入场景通用的一体化语音增强技术[J].信号处理,2025,41(9):1494-1512,19.基金项目
科技创新2030-国家重大项目(2021ZD0201500) China STI 2030-Major Projects(2021ZD0201500) (2021ZD0201500)