电子学报2023,Vol.51Issue(10):2884-2893,10.DOI:10.12263/DZXB.20211369
NMT语料库中语符不平衡度的测评研究
Research on Evaluation of Token Imbalance Degree in NMT Corpus
摘要
Abstract
Token imbalance is a common phenomenon in the corpus of neural machine translation(NMT).It is of great significance to evaluate the token imbalance degree of NMT corpus to improve the quality of corpus and translation effect.Aiming at the defects and deficiencies in the algorithm and word segmentation scope of the existing studies on the measurement of the token imbalance degree,this paper proposes the dispersion of token distribution(DTD)algorithm to cal-culate the token imbalance degree,expands the word segmentation scope,and evaluates the corpus from three granularity:character,subword and word.The experimental results show that the accuracy,validity and robustness of the proposed al-gorithm are greatly improved compared with previous studies.There are great differences in the token imbalance degree of corpora under different word segmentation granularity,in which character granularity has the highest token imbalance de-gree,followed by sub word granularity and word granularity.关键词
神经机器翻译/语料库/分词/粒度/语符不平衡度Key words
neural machine translation/corpus/word segmentation/granularity/token imbalance degree分类
信息技术与安全科学引用本文复制引用
王海波,余丽丽,王宏伟..NMT语料库中语符不平衡度的测评研究[J].电子学报,2023,51(10):2884-2893,10.基金项目
国家重点研发计划(No.2020YFB1707803) (No.2020YFB1707803)
浙江大学科研资助项目(No.XY2021018)National Key Research and Development of China(No.2020YFB1707803) (No.XY2021018)
Zhejiang Uni-versity Research Project(No.XY2021018) (No.XY2021018)