基于Transformer的多编码器端到端语音识别OA
Multi-Encoder Transformer for End-to-End Speech Recognition
当前广泛使用的Transformer模型具有良好的全局依赖关系捕捉能力,但其在浅层时容易忽略局部特征信息.针对该问题,文中提出了一种使用多个编码器来改善语音特征信息提取能力的方法.通过附加一个额外的卷积编码器分支来强化对局部特征信息的捕捉,弥补浅层Transformer对局部特征信息的忽视,有效实现音频特征序列全局和局部依赖关系的融合,即提出了基于Transformer的多编码器模型.在开源中文普通话数据集Aishell-1 上的实验表明,在没有外部语言模型的情况下,相比于Transformer模型,基于Transformer的多编码器模型的字符错误率降低了 4.00%.在内部非公开的上海话方言数据集上,文中所提模型的性能提升更加明显,其字符错误率从 19.92%降低至10.31%,降低了48.24%.
The current widely used Transformer model has a strong ability to capture global dependencies,but it tends to ignore local feature information at shallow layers.To solve this problem,this study proposes a method using multiple encoders to improve the ability of speech feature extraction.An additional convolutional encoder branch is added to strengthen the capture of local feature information,make up for the neglect of local feature information in shallow Transformer,and effectively realize the integration of global and local dependencies of audio feature se-quences.In other words,a multi-encoder model based on Transformer is proposed.Experiments on the open-source Chinese Mandarin data set Aishell-1 show that without an external language model,the proposed Transformer-based multi-encoder model has a relative reduction of 4.00%in character error rate when compared with the Transformer model.On the internal non-public Shanghainese dialect data set,the performance improve-ment of the proposed model is more obvious,and the character error rate is reduced by 48.24%from 19.92%to 10.31%.
庞江飞;孙占全
上海理工大学 光电信息与计算机工程学院,上海 200093
电子信息工程
Transformer语音识别端到端深度神经网络多编码器多头注意力特征融合卷积分支网络
Transformerspeech recognitionend-to-enddeep neural networksmulti-encodermulti-head attentionfeature fusionconvolution branch networks
《电子科技》 2024 (004)
1-7 / 7
国防基础科研计划(JCKY2019413D001);上海理工大学医工交叉项目(10-21-302-413)National Defense Basic Scientific Research Program(JCKY2019413D001);Medical and Engineering Cross Project of University of Shanghai for Science and Technology(10-21-302-413)
评论