信号处理2025,Vol.41Issue(9):1558-1569,12.DOI:10.12466/xhcl.2025.09.009
低资源条件下的藏语语音情感识别
Tibetan-Speech-Emotion Recognition Under Low-Resource Conditions
摘要
Abstract
In recent years,although significant progress has been made in speech emotion recognition(SER)research for major languages,studies focusing on low-resource languages still face numerous challenges in dataset construction,feature extraction,and recognition model design.To address the problem of Tibetan speech emotion recognition under low-resource conditions,our study first constructed the Tibetan Emotion Speech Dataset-2500(TESD-2500)through the steps of video clipping,audio extraction and enhancement,and manual annotation and verification.This dataset cov-ers four emotion types(anger,sadness,happiness,and neutral),and contains 2500 speech samples.The emotion cat-egories and sample size are still being expanded.Subsequently,we designed a multi-feature fusion speech emotion rec-ognition model incorporating cross-attention and co-attention mechanisms.A Bidirectional Long Short-Term Memory Network(BiLSTM)was employed to model the temporal dynamics of Mel-Frequency Cepstral Coefficient(MFCC)and extract dynamic temporal representations from the speech signal.AlexNet was utilized to extract time-frequency fea-tures from spectrograms and capture the joint time-frequency distribution patterns of the speech signal.A cross-attention mechanism was used to compute the correlation weights between these two types of heterogeneous features.The large-scale pre-trained model WavLM was introduced to extract deep semantic features from the speech signal.Using the re-sults from the aforementioned cross-attention calculation as weight vectors,a co-attention mechanism was applied to perform weighted reconstruction of the deep features.The MFCC temporal features,spectrogram time-frequency fea-tures,and the weighted deep pre-trained model features were concatenated to form a multi-level fused feature representa-tion.This fused representation was then mapped to the emotion category space via fully connected layers to accomplish Tibetan speech emotion classification.Experimental results demonstrated that the proposed model achieved a Weighted Accuracy(WA)of 76.56%and an Unweighted Accuracy(UA)of 75.42%on the TESD-2500 dataset,thus signifi-cantly outperforming baseline models.The study also evaluated the model's generalization capability on the IEMOCAP and EmoDB datasets,achieving 74.27%WA and 73.60%UA on IEMOCAP,and 92.61%WA and 91.68%UA on Em-oDB.The methodology and results presented in the paper may also serve as a reference for speech emotion recognition research in other low-resource languages.关键词
语音情感识别/低资源/多特征融合/预训练模型/藏语Key words
speech emotion recognition/low resources/multi-feature fusion/pretrained model/Tibetan分类
信息技术与安全科学引用本文复制引用
张维昭,李皓渊,杨鸿武..低资源条件下的藏语语音情感识别[J].信号处理,2025,41(9):1558-1569,12.基金项目
国家自然科学基金(62067008) (62067008)
西北师范大学青年教师科研能力提升计划项目(NWNU-LKQN2024-11) The National Natural Science Foundation of China(62067008) (NWNU-LKQN2024-11)
The Young Teachers Research Capacity Enhancement Program of Northwest Normal University(NWNU-LKQN2024-11) (NWNU-LKQN2024-11)