计算机工程2026,Vol.52Issue(1):105-115,11.DOI:10.19678/j.issn.1000-3428.0070055
融合最大池化的Conformer中文语音识别
Chinese Speech Recognition Using Conformer Fused with Max Pooling
摘要
Abstract
Speech recognition technology enables machines to understand human speech using advanced algorithms and signal processing technologies,thereby making communication between humans and machines more convenient.Most existing studies on end-to-end speech recognition focus on optimizing the Conformer model.The Conformer encoder suffers from the issue of insufficient extraction of fine-grained local speech features.To resolve these issues,this study proposes a Chinese speech recognition method based on Max Pooling(MP).First,the output of the gated linear unit in the convolutional module of the encoder is max-pooled along the time dimension to extract fine-grained local features corresponding to the characteristics of multiple speech signal frames.Second,these pooled features are fused with the coarse-grained local features extracted via Depthwise Convolution(DWC)using the element-wise sum method to increase the amount of information on local speech features and improve the speech recognition accuracy of the Conformer model.The experimental results on the public Chinese dataset Aishell-1 show that the improved model can reduce the Character Error Rate(CER)of the baseline model from 5.58%to 5.32%and from 5.06%to 4.92%by decoding using greedy search and attention rescoring,respectively.关键词
语音识别/细粒度局部特征/Conformer模型/最大池化/逐通道卷积Key words
speech recognition/fine-grained local feature/Conformer model/Max Pooling(MP)/Depthwise Convolution(DWC)分类
信息技术与安全科学引用本文复制引用
胡从刚,杨立鹏,孙永奇,陈华龙,韩可可..融合最大池化的Conformer中文语音识别[J].计算机工程,2026,52(1):105-115,11.基金项目
中央高校基本科研业务费专项资金(2024JBGP008) (2024JBGP008)
新一代人工智能国家科技重大专项(2021ZD0113002). (2021ZD0113002)