计算机与现代化Issue(6):19-24,6.DOI:10.3969/j.issn.1006-2475.2024.06.004
面向藏汉神经机器翻译的数据筛选方法
Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation
摘要
Abstract
Data syntax and semantic losses arise in Tibetan-Chinese machine translation when traditional data augmentation methods are employed.To address this issue,this paper proposes a pseudo-data filtering method combining sentence confusion degree with semantic similarity degree on the basis of traditional data enhancement methods.This strategy effectively tackles chal-lenges such as the inadequate quality and scarcity of parallel data,particularly in low-resource settings.The results of this study demonstrate that the pseudo data filtering approach significantly improves both Tibetan-Chinese and English-Chinese bidirec-tional language translation tasks.The proposed pseudo-data filtering method effectively improves the grammatical and semantic defects of the translation model,thus enhancing the performance of the translation system and the generalization ability of the translation model,and verifies the effectiveness of the proposed method.关键词
回译/数据筛选/藏汉神经机器翻译/困惑度/语义相似度Key words
back translation/data selection/Tibetan Chinese neural machine translation/perplexity/semantic similarity分类
信息技术与安全科学引用本文复制引用
仁青卓玛,拥措,唐超超..面向藏汉神经机器翻译的数据筛选方法[J].计算机与现代化,2024,(6):19-24,6.基金项目
科技创新2030—"新一代人工智能"重大项目(2022ZD0116100) (2022ZD0116100)
西藏自治区科技创新基地自主研究项目(XZ2021JR0002G) (XZ2021JR0002G)
西藏大学学科建设能力提升计划项目(藏财预指[2023]1号) (藏财预指[2023]1号)