首页|期刊导航|软件导刊|基于词覆盖率的语音数据集最小化方法

基于词覆盖率的语音数据集最小化方法OA

Minimizing Speech Datasets Based on Word Coverage Rate

中文摘要

英文摘要

为解决高性能自动语音识别模型训练集采集成本高和训练成本高的问题,提出一种基于词覆盖率的语音训练集最小化方法,尽可能减少训练集所需的数据规模.该方法引入向量空间模型的概念,将所有语料文本映射到高维空间,通过计算向量之间的余弦距离来筛选相似度最低的文本数据.然后,根据选择的文本数据收集音频,实现使用尽可能少的音频数据达到最佳的识别效果.最后,使用汉明重叠方式计算新增词汇量以评估贡献度,从而优化余弦距离的筛选方式.实验表明,所提方法相较于随机的语音训练集筛选方法,在节省21.31%训练数据量的情况下可达到相同词覆盖率,并且训练集的词覆盖率与训练集所得模型的推理性能存在极强的正相关性,证明了在保持推理性能接近的前提下,可有效节省语音训练集的采集和训练成本,进而促进自动语音识别技术的进一步发展.

To address the issue of high collection and training costs for training high-performance automatic speech recognition models,a method based on word coverage is proposed to minimize the data size required for the training set.This method introduces the concept of vector space models,mapping all corpus texts to a high-dimensional space,and selecting the text data with the lowest similarity by calculating the cosine distance between vectors.Then,collect audio based on the selected text data to achieve the best recognition effect using as little audio data as possible.Finally,using Hamming overlap to calculate the amount of newly added vocabulary to evaluate contribution,in order to opti-mize the selection method of cosine distance.The experiment shows that compared to the random speech training set filtering method,the pro-posed method can achieve the same word coverage while saving 21.31%of training data,and there is a strong positive correlation between the word coverage of the training set and the inference performance of the model obtained from the training set.This proves that while maintaining similar inference performance,it can effectively save the collection and training costs of the speech training set,thereby promoting the further development of automatic speech recognition technology.

作者：朱治军;付磊

作者单位：武汉市公安局青山区分局(钢城分局),湖北武汉 430080深圳华为云计算技术有限公司,广东深圳 518129

分类：计算机与自动化

中文关键词：自动语音识别向量空间模型余弦距离汉明重量训练集最小化

英文关键词：automatic speech recognitionvector space modelcosine distanceHamming weighttraining set minimization

刊名：《软件导刊》 2024 (005)

页码/页数：33-37 / 5

DOI：10.11907/rjdk.241136

基于词覆盖率的语音数据集最小化方法OA

Minimizing Speech Datasets Based on Word Coverage Rate

评论