低资源场景下基于联合训练与自训练的跨语言摘要方法OA北大核心CSTPCD
Cross-Lingual Summarization Method Based on Joint Training and Self-Training in Low-Resource Scenarios
随着全球化的不断发展,跨语言摘要任务已成为自然语言处理领域的一项重要课题.在低资源场景下,现有方法存在表征转换受限和数据利用不充分等问题.为此,该文提出了一种基于联合训练与自训练的跨语言摘要方法.该方法使用两个模型分别建模翻译任务和跨语言摘要任务,以统一输出端的语言向量空间,从而避免模型间表征转换受限的问题.此外,通过对齐平行训练对的输出特征和概率进行联合训练,增强模型间的语义共享.同时,在联合训练的基础上引入自训练技术,利用额外的单语摘要数据生成合成数据,有效缓解了低资源场景下数据稀缺的问题.实验结果表明,该方法在多个低资源场景下均优于现有对比方法,实现了ROUGE分数的显著提升.
As globalization continues to develop,cross-lingual summarization has become an important topic in natural language processing.In low-resource scenarios,existing methods face challenges such as limited representation transfer and insufficient data utilization.To address these issues,this paper proposes a novel method based on joint training and self-training.Specifically,two models are used to handle the translation and cross-lingual summarization tasks,respectively,which unify the language vector space of the output and avoid the issue of limited representation transfer.Additionally,joint training is performed by aligning the output features and probabilities of parallel training pairs,thereby enhancing semantic sharing between the models.Furthermore,based on joint training,a self-training technique is introduced to generate synthetic data from additional monolingual summary data,effectively mitigating the data scarcity issue of low-resource scenarios.Experimental results demonstrate that this method outperforms existing approaches in multiple low-resource scenarios,achieving significant improvements in ROUGE scores.
程绍欢;唐煜佳;刘峤;陈文宇
电子科技大学计算机科学与工程学院,成都 611731
计算机与自动化
跨语言摘要联合训练低资源场景机器翻译自训练
cross-lingual summarizationjoint traininglow-resource scenariosmachine translationself-training
《电子科技大学学报》 2024 (005)
762-770 / 9
国家自然科学基金企业联合基金重点项目(U22B2061)
评论