首页|期刊导航|电子学报|NMT语料库中语符不平衡度的测评研究

NMT语料库中语符不平衡度的测评研究

王海波余丽丽王宏伟

电子学报2023，Vol.51Issue(10)：2884-2893,10.

电子学报2023，Vol.51Issue(10)：2884-2893,10.DOI:10.12263/DZXB.20211369

NMT语料库中语符不平衡度的测评研究

Research on Evaluation of Token Imbalance Degree in NMT Corpus

王海波 ¹余丽丽 ²王宏伟³

作者信息

1. 浙江大学生物医学工程与仪器科学学院,浙江杭州 310027
2. 浙江师范大学教师教育学院,浙江金华 321004
3. 浙江大学伊利诺伊大学厄巴纳香槟校区联合学院,浙江海宁 314499
折叠

摘要

Abstract

Token imbalance is a common phenomenon in the corpus of neural machine translation(NMT).It is of great significance to evaluate the token imbalance degree of NMT corpus to improve the quality of corpus and translation effect.Aiming at the defects and deficiencies in the algorithm and word segmentation scope of the existing studies on the measurement of the token imbalance degree,this paper proposes the dispersion of token distribution(DTD)algorithm to cal-culate the token imbalance degree,expands the word segmentation scope,and evaluates the corpus from three granularity:character,subword and word.The experimental results show that the accuracy,validity and robustness of the proposed al-gorithm are greatly improved compared with previous studies.There are great differences in the token imbalance degree of corpora under different word segmentation granularity,in which character granularity has the highest token imbalance de-gree,followed by sub word granularity and word granularity.

关键词

神经机器翻译/语料库/分词/粒度/语符不平衡度

Key words

neural machine translation/corpus/word segmentation/granularity/token imbalance degree

分类

信息技术与安全科学

引用本文复制引用

王海波,余丽丽,王宏伟..NMT语料库中语符不平衡度的测评研究[J].电子学报,2023,51(10):2884-2893,10.

基金项目

国家重点研发计划(No.2020YFB1707803) （No.2020YFB1707803）

浙江大学科研资助项目(No.XY2021018)National Key Research and Development of China(No.2020YFB1707803) （No.XY2021018）

Zhejiang Uni-versity Research Project(No.XY2021018) （No.XY2021018）

电子学报

OA北大核心CSCDCSTPCD

ISSN：0372-2112

访问量0

下载量0

段落导航