| 注册
首页|期刊导航|软件导刊|基于词覆盖率的语音数据集最小化方法

基于词覆盖率的语音数据集最小化方法

朱治军 付磊

软件导刊2024,Vol.23Issue(5):33-37,5.
软件导刊2024,Vol.23Issue(5):33-37,5.DOI:10.11907/rjdk.241136

基于词覆盖率的语音数据集最小化方法

Minimizing Speech Datasets Based on Word Coverage Rate

朱治军 1付磊2

作者信息

  • 1. 武汉市公安局青山区分局(钢城分局),湖北 武汉 430080
  • 2. 深圳华为云计算技术有限公司,广东 深圳 518129
  • 折叠

摘要

Abstract

To address the issue of high collection and training costs for training high-performance automatic speech recognition models,a method based on word coverage is proposed to minimize the data size required for the training set.This method introduces the concept of vector space models,mapping all corpus texts to a high-dimensional space,and selecting the text data with the lowest similarity by calculating the cosine distance between vectors.Then,collect audio based on the selected text data to achieve the best recognition effect using as little audio data as possible.Finally,using Hamming overlap to calculate the amount of newly added vocabulary to evaluate contribution,in order to opti-mize the selection method of cosine distance.The experiment shows that compared to the random speech training set filtering method,the pro-posed method can achieve the same word coverage while saving 21.31%of training data,and there is a strong positive correlation between the word coverage of the training set and the inference performance of the model obtained from the training set.This proves that while maintaining similar inference performance,it can effectively save the collection and training costs of the speech training set,thereby promoting the further development of automatic speech recognition technology.

关键词

自动语音识别/向量空间模型/余弦距离/汉明重量/训练集最小化

Key words

automatic speech recognition/vector space model/cosine distance/Hamming weight/training set minimization

分类

信息技术与安全科学

引用本文复制引用

朱治军,付磊..基于词覆盖率的语音数据集最小化方法[J].软件导刊,2024,23(5):33-37,5.

软件导刊

1672-7800

访问量0
|
下载量0
段落导航相关论文