NMT语料库中语符不平衡度的测评研究OACSCDCSTPCD
Research on Evaluation of Token Imbalance Degree in NMT Corpus
语符不平衡是神经机器翻译(Neural Machine Translation,NMT)语料库中普遍存在的现象.评估NMT语料库的语符不平衡度对提升语料库质量和翻译效果具有重要意义.针对现有的语符不平衡度测评研究在算法和分词范围上的缺陷与不足,本文提出语符分布离散度算法(Dispersion of Token Distribution,DTD),用以计算语符不平衡度,并扩大分词范围,从字符、子词和词3种粒度对语料库进行评估.实验结果表明,该算法在准确度、有效性和鲁棒性方面较以往研究有较大提升;语料库在不同分词粒度下的语符不平衡度差异很大,其中字符粒度的语符不平衡度最大,子词粒度次之,词粒度最小.
Token imbalance is a common phenomenon in the corpus of neural machine translation(NMT).It is of great significance to evaluate the token imbalance degree of NMT corpus to improve the quality of corpus and translation effect.Aiming at the defects and deficiencies in the algorithm and word segmentation scope of the existing studies on the measurement of the token imbalance degree,this paper proposes the dispersion of token distribution(DTD)algorithm to cal-culate the token imbalance degree,expands the word segmentation scope,and evaluates the corpus from three granularity:character,subword and word.The experimental results show that the accuracy,validity and robustness of the proposed al-gorithm are greatly improved compared with previous studies.There are great differences in the token imbalance degree of corpora under different word segmentation granularity,in which character granularity has the highest token imbalance de-gree,followed by sub word granularity and word granularity.
王海波;余丽丽;王宏伟
浙江大学生物医学工程与仪器科学学院,浙江杭州 310027浙江师范大学教师教育学院,浙江金华 321004浙江大学伊利诺伊大学厄巴纳香槟校区联合学院,浙江海宁 314499
计算机与自动化
神经机器翻译语料库分词粒度语符不平衡度
neural machine translationcorpusword segmentationgranularitytoken imbalance degree
《电子学报》 2023 (10)
2884-2893,10
国家重点研发计划(No.2020YFB1707803)浙江大学科研资助项目(No.XY2021018)National Key Research and Development of China(No.2020YFB1707803)Zhejiang Uni-versity Research Project(No.XY2021018)
评论