首页|期刊导航|华南理工大学学报（自然科学版）|一种基于时域全面注意力机制的单通道语音分离模型

一种基于时域全面注意力机制的单通道语音分离模型

杨俊美张邦成杨璐曾徳炉

华南理工大学学报（自然科学版）2026，Vol.54Issue(1)：70-82,13.

华南理工大学学报（自然科学版）2026，Vol.54Issue(1)：70-82,13.DOI:10.12141/j.issn.1000-565X.250054

一种基于时域全面注意力机制的单通道语音分离模型

A Single-Channel Speech Separation Model Based on Time-Domain Comprehensive Attention Mechanism

杨俊美 ¹张邦成 ¹杨璐 ¹曾徳炉¹

作者信息

1. 华南理工大学电子与信息学院,广东广州 510640
折叠

摘要

Abstract

Single-channel speech separation aims to extract clean target speaker speech from a mixed audio signal recorded by a single microphone,with significant application value in scenarios such as smart homes,conference systems,and hearing aids.With the rapid development of deep learning technology,self-attention network-based approaches to single-channel speech separation have achieved remarkable progress.While self-attention networks excel at capturing contextual information in long sequence,they still exhibit limitations in capturing detailed fea-tures such as temporal/spectral continuity,spectral structure,and timbre in real-world speech scenarios.Moreover,existing separation architectures based on a single attention paradigm struggle to achieve effective multi-scale fea-ture fusion.To address these challenges,this paper proposed a Temporal Comprehensive Attention Network(TCANet),which addresses the aforementioned issues through a synergistic design of local and global attention modules.Local modeling employs an S&C-SENet-enhanced Conformer structure to capture short-term features such as spectral structure and timbre in detail,while global modeling incorporates a modified Transformer module with relative position embedding to explicitly learn long-term speech dependencies in speech.Furthermore,TCANet achieves cross-scale fusion of intra-block local features and inter-block global correlations through a dimension transformation mechanism.Experimental results on three benchmark datasets—LRS2-2Mix,Libri2Mix,and EchoSet—demonstrate that the proposed method outperforms existing end-to-end speech separation approaches in terms of scale-invariant signal-to-noise ratio improvement(SI-SNRi)and signal-to-distortion ratio improvement(SDRi).

关键词

深度学习/语音分离/Transformer模块/Conformer结构/全面注意力

Key words

deep learning/speech separation/Transformer module/Conformer structure/comprehensive attention

分类

信息技术与安全科学

引用本文复制引用

杨俊美,张邦成,杨璐,曾徳炉..一种基于时域全面注意力机制的单通道语音分离模型[J].华南理工大学学报（自然科学版）,2026,54(1):70-82,13.

基金项目

广东省自然科学基金项目(2023A1515011281)Supported by the Natural Science Foundation of Guangdong Province(2023A1515011281) （2023A1515011281）

华南理工大学学报（自然科学版）

ISSN：1000-565X

访问量0

下载量0

段落导航