华南理工大学学报(自然科学版)2026,Vol.54Issue(1):70-82,13.DOI:10.12141/j.issn.1000-565X.250054
一种基于时域全面注意力机制的单通道语音分离模型
A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism
摘要
Abstract
Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone,with significant application value in scenarios such as smart homes,conference systems,and hearing aids.With the rapid development of deep learning technology,self-attention network-based approaches to single-channel speech separation have achieved remarkable progress.While self-attention networks excel at capturing contextual information in long sequence,they still exhibit limitations in capturing detailed fea-tures such as temporal/spectral continuity,spectral structure,and timbre in real-world speech scenarios.Moreover,existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale fea-ture fusion.To address these challenges,this paper proposed a Temporal Comprehensive Attention Network(TCANet),which addresses the aforementioned issues through a synergistic design of local and global attention modules.Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail,while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech.Furthermore,TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism.Experimental results on three benchmark datasets—LRS2-2Mix,Libri2Mix,and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement(SI-SNRi)and signal-to-distortion ratio improvement(SDRi).关键词
深度学习/语音分离/Transformer模块/Conformer结构/全面注意力Key words
deep learning/speech separation/Transformer module/Conformer structure/comprehensive attention分类
信息技术与安全科学引用本文复制引用
杨俊美,张邦成,杨璐,曾徳炉..一种基于时域全面注意力机制的单通道语音分离模型[J].华南理工大学学报(自然科学版),2026,54(1):70-82,13.基金项目
广东省自然科学基金项目(2023A1515011281)Supported by the Natural Science Foundation of Guangdong Province(2023A1515011281) (2023A1515011281)