| 注册
首页|期刊导航|信号处理|低资源条件下的藏语语音情感识别

低资源条件下的藏语语音情感识别

张维昭 李皓渊 杨鸿武

信号处理2025,Vol.41Issue(9):1558-1569,12.
信号处理2025,Vol.41Issue(9):1558-1569,12.DOI:10.12466/xhcl.2025.09.009

低资源条件下的藏语语音情感识别

Tibetan-Speech-Emotion Recognition Under Low-Resource Conditions

张维昭 1李皓渊 1杨鸿武2

作者信息

  • 1. 西北师范大学物理与电子工程学院,甘肃 兰州 730070
  • 2. 西北师范大学教育技术学院,甘肃 兰州 730070
  • 折叠

摘要

Abstract

In recent years,although significant progress has been made in speech emotion recognition(SER)research for major languages,studies focusing on low-resource languages still face numerous challenges in dataset construction,feature extraction,and recognition model design.To address the problem of Tibetan speech emotion recognition under low-resource conditions,our study first constructed the Tibetan Emotion Speech Dataset-2500(TESD-2500)through the steps of video clipping,audio extraction and enhancement,and manual annotation and verification.This dataset cov-ers four emotion types(anger,sadness,happiness,and neutral),and contains 2500 speech samples.The emotion cat-egories and sample size are still being expanded.Subsequently,we designed a multi-feature fusion speech emotion rec-ognition model incorporating cross-attention and co-attention mechanisms.A Bidirectional Long Short-Term Memory Network(BiLSTM)was employed to model the temporal dynamics of Mel-Frequency Cepstral Coefficient(MFCC)and extract dynamic temporal representations from the speech signal.AlexNet was utilized to extract time-frequency fea-tures from spectrograms and capture the joint time-frequency distribution patterns of the speech signal.A cross-attention mechanism was used to compute the correlation weights between these two types of heterogeneous features.The large-scale pre-trained model WavLM was introduced to extract deep semantic features from the speech signal.Using the re-sults from the aforementioned cross-attention calculation as weight vectors,a co-attention mechanism was applied to perform weighted reconstruction of the deep features.The MFCC temporal features,spectrogram time-frequency fea-tures,and the weighted deep pre-trained model features were concatenated to form a multi-level fused feature representa-tion.This fused representation was then mapped to the emotion category space via fully connected layers to accomplish Tibetan speech emotion classification.Experimental results demonstrated that the proposed model achieved a Weighted Accuracy(WA)of 76.56%and an Unweighted Accuracy(UA)of 75.42%on the TESD-2500 dataset,thus signifi-cantly outperforming baseline models.The study also evaluated the model's generalization capability on the IEMOCAP and EmoDB datasets,achieving 74.27%WA and 73.60%UA on IEMOCAP,and 92.61%WA and 91.68%UA on Em-oDB.The methodology and results presented in the paper may also serve as a reference for speech emotion recognition research in other low-resource languages.

关键词

语音情感识别/低资源/多特征融合/预训练模型/藏语

Key words

speech emotion recognition/low resources/multi-feature fusion/pretrained model/Tibetan

分类

信息技术与安全科学

引用本文复制引用

张维昭,李皓渊,杨鸿武..低资源条件下的藏语语音情感识别[J].信号处理,2025,41(9):1558-1569,12.

基金项目

国家自然科学基金(62067008) (62067008)

西北师范大学青年教师科研能力提升计划项目(NWNU-LKQN2024-11) The National Natural Science Foundation of China(62067008) (NWNU-LKQN2024-11)

The Young Teachers Research Capacity Enhancement Program of Northwest Normal University(NWNU-LKQN2024-11) (NWNU-LKQN2024-11)

信号处理

OA北大核心

1003-0530

访问量0
|
下载量0
段落导航相关论文