|国家科技期刊平台
首页|期刊导航|通信学报|基于单语优先级采样自训练神经机器翻译的研究

基于单语优先级采样自训练神经机器翻译的研究OA北大核心CSTPCD

Research on self-training neural machine translation based on monolingual priority sampling

中文摘要英文摘要

为了提高神经机器翻译(NMT)性能,改善不确定性过高的单语在自训练过程中对NMT模型的损害,提出了一种基于优先级采样的自训练神经机器翻译模型.首先,通过引入语法依存分析构建语法依存树并计算单语单词重要程度.然后,构建单语词典并基于单语单词的重要程度和不确定性定义优先级.最后,计算单语优先级并基于优先级进行采样,进而合成平行数据集,作为学生NMT的训练输入.在大规模WMT英德部分数据集上的实验结果表明,所提模型能有效提升NMT的翻译效果,并改善不确定性过高对模型的损害.

To enhance the performance of neural machine translation(NMT)and ameliorate the detrimental impact of high uncertainty in monolingual data during the self-training process,a self-training NMT model based on priority sam-pling was proposed.Initially,syntactic dependency trees were constructed and the importance of monolingual tokeniza-tion was assessed using grammar dependency analysis.Subsequently,a monolingual lexicon was built,and priority was defined based on the importance of monolingual tokenization and uncertainty.Finally,monolingual priorities were com-puted,and sampling was carried out based on these priorities,consequently generating a synthetic parallel dataset for training the student NMT model.Experimental results on a large-scale subset of the WMT English to German dataset demonstrate that the proposed model effectively enhances NMT translation performance and mitigates the impact of high uncertainty on the model.

张笑燕;逄磊;杜晓峰;陆天波;夏亚梅

北京邮电大学计算机学院(国家示范性软件学院),北京 100876

计算机与自动化

机器翻译数据增强自训练不确定性语法依存

machine translationdata augmentationself-traininguncertaintysyntactic dependency

《通信学报》 2024 (004)

65-72 / 8

国家自然科学基金资助项目(No.62162060)The National Natural Science Foundation of China(No.62162060)

10.11959/j.issn.1000-436x.2024066

评论