信号处理2025,Vol.41Issue(12):1980-1991,12.DOI:10.12466/xhcl.2025.12.010
基于Whisper的藏语方言语音识别研究
A Study on Tibetan Dialect Speech Recognition Based on Whisper
摘要
Abstract
Although the Whisper large-scale speech model was trained on 680000 hours of multilingual data,its original architecture does not support Tibetan speech recognition.Therefore,directly applying the fine-tuning method based on Whisper to Tibetan speech recognition tasks still faces the following issues:(1)The shared representation space is domi-nated by high-resource languages such as English,leading to insufficient learning of Tibetan-specific characteristics;(2)The default byte-level tokenization of the model indiscriminately splits Tibetan syllables,resulting in broken charac-ter structures and loss of semantic information;(3)The scarcity of Tibetan training data makes it difficult for the model to adequately capture Tibetan linguistic patterns;(4)As different Tibetan dialects share the same writing system,train-ing with mixed dialects makes it hard to distinguish pronunciation variations of the same syllable across dialects,caus-ing severe cross-dialect confusion and increased recognition errors.To address these issues,this paper proposes an im-proved Tibetan dialect speech recognition method within the Whisper multilingual pre-training framework,aiming to help the model learn the commonalities and differences among dialects,thereby enhancing recognition robustness and accuracy in various dialect scenarios.First,a Tibetan-specific byte pair encoding(BPE)model was constructed,and the Whisper vocabulary was expanded by introducing different modeling units(such as letters,BPE subwords,and pho-nemes)to systematically compare the impact of different encoding strategies on the final recognition performance.Sec-ond,a dialect discrimination auxiliary mechanism was introduced alongside the original speech recognition task to en-hance the model's ability to distinguish Tibetan dialects.Finally,based on the analysis of recognition results,an exter-nal language model was incorporated using rescoring and shallow fusion to improve decoding outcomes and further en-hance audio-text consistency.The experimental results show that applying the proposed method with BPE-100 modeling units for fine-tuning and incorporating a language model to optimize decoding reduces the character error rate(CER)from 45.80%for the direct full-parameter method to 9.56%.Additionally,the model's ability to process long Tibetan text sequences improves,with the maximum processable sequence length increasing by approximately three times.关键词
语音识别/藏语/Whisper/建模单元/低资源Key words
speech recognition/Tibetan/Whisper/modeling unit/low resources分类
信息技术与安全科学引用本文复制引用
马立克,李冠宇..基于Whisper的藏语方言语音识别研究[J].信号处理,2025,41(12):1980-1991,12.基金项目
2024年甘肃省科技重大专项计划(24ZDFA004) Gansu Provincial Science and Technology Major Project in 2024(24ZDFA004) (24ZDFA004)