基于单语优先级采样自训练神经机器翻译的研究OA北大核心CSTPCD
Research on self-training neural machine translation based on monolingual priority sampling
为了提高神经机器翻译(NMT)性能,改善不确定性过高的单语在自训练过程中对NMT模型的损害,提出了一种基于优先级采样的自训练神经机器翻译模型.首先,通过引入语法依存分析构建语法依存树并计算单语单词重要程度.然后,构建单语词典并基于单语单词的重要程度和不确定性定义优先级.最后,计算单语优先级并基于优先级进行采样,进而合成平行数据集,作为学生NMT的训练输入.在大规模WMT英德部分数据集上的实验结果表明,所提模型能有效提升NMT的翻译效果,并改善不确定性过高对模型的损害.
To enhance the performance of neural machine translation(NMT)and ameliorate the detrimental impact of high uncertainty in monolingual data during the self-training process,a self-training NMT model based on priority sam-pling was proposed.Initially,syntactic dependency trees were constructed and the importance of monolingual tokeniza-tion was assessed using grammar dependency analysis.Subsequently,a monolingual lexicon was built,and priority was defined based on the importance of monolingual tokenization and uncertainty.Finally,monolingual priorities were com-puted,and sampling was carried out based on these priorities,consequently generating a synthetic parallel dataset for training the student NMT model.Experimental results on a large-scale subset of the WMT English to German dataset demonstrate that the proposed model effectively enhances NMT translation performance and mitigates the impact of high uncertainty on the model.
张笑燕;逄磊;杜晓峰;陆天波;夏亚梅
北京邮电大学计算机学院(国家示范性软件学院),北京 100876
计算机与自动化
机器翻译数据增强自训练不确定性语法依存
machine translationdata augmentationself-traininguncertaintysyntactic dependency
《通信学报》 2024 (004)
65-72 / 8
国家自然科学基金资助项目(No.62162060)The National Natural Science Foundation of China(No.62162060)
评论