| 注册

XBMU-MC:多语言平行语料库

严琦栋 马宁 巴桑珠扎 白玛曲扎 艾科拜·依米提 麦迪那木·吾斯曼 木巴热克·阿布力克木 苏日娜 阿力娅

中国科学数据(中英文网络版)2026,Vol.11Issue(1):68-81,14.
中国科学数据(中英文网络版)2026,Vol.11Issue(1):68-81,14.DOI:10.11922/11-6035.csd.2025.0090.zh

XBMU-MC:多语言平行语料库

XBMU-MC:A Multilingual Parallel Corpus

严琦栋 1马宁 1巴桑珠扎 2白玛曲扎 2艾科拜·依米提 2麦迪那木·吾斯曼 2木巴热克·阿布力克木 2苏日娜 3阿力娅3

作者信息

  • 1. 西北民族大学,语言与文化计算教育部重点实验室,兰州 730030||西北民族大学,甘肃省民族语言文化智能信息处理重点实验室,兰州 730030
  • 2. 西北民族大学,数学与计算机科学学院,兰州 730030
  • 3. 西北民族大学,中国语言文学学部,兰州 730030
  • 折叠

摘要

Abstract

Multilingual parallel corpus is a basic resource for natural language processing to break through language barriers and improve the efficiency and quality of cross-language tasks.To address the relative scarcity of parallel corpora for low-resource languages(Chinese-Tibetan,Chinese-Uyghur,Chinese-Mongolian),this study constructs a parallel corpus—XBMU-MC(Northwest Minzu University-Multilingual Corpus)for multilingual machine translation and cross-language information retrieval tasks.The original corpus consists of manually constructed specific bilingual texts and web-crawled publicly available multilingual data.The manually constructed corpus consists of Chinese texts written by researchers in the fields of culture,science and technology,and society,and the target language translations are obtained through machine translation and manual proofreading;while the web-crawled corpus captures the original texts from mainstream Tibetan,Mongolian,and Uyghur news websites to expand the scale of the corpus and the scope of the domains covered.After that,data enhancement techniques are used to expand the existing corpus,and the data are strictly screened to ensure the alignment consistency and translation accuracy between the source and target languages for each parallel corpus pair.Finally,21,579 high-quality samples were selected and stored in JSON format,including three attributes:instruction,input and output.Finally,the multilingual translation comparison experiments between the XBMU-MC dataset and the open-source MMDS dataset based on the GLM4-9B,Qwen3-8B,and Deepseek-R1-8B models show an average increase in BLEU scores of 21.16,20.51,and 15.37 percentage points.This dataset can help mitigate the problem of insufficient parallel corpus for low-resource languages to a certain extent,effectively support model training and multilingual translation tasks,and can also be used as a corpus basis for cross-lingual large language models in cross-language alignment,instruction fine-tuning and intelligent information processing applications in ethnic languages,helping to improve the performance of related tasks.

关键词

汉藏/汉蒙/汉维/平行语料库/数据增强

Key words

Chinese-Tibetan/Chinese-Mongolian/Chinese-Uyghur/parallel corpus/data enhancement

引用本文复制引用

严琦栋,马宁,巴桑珠扎,白玛曲扎,艾科拜·依米提,麦迪那木·吾斯曼,木巴热克·阿布力克木,苏日娜,阿力娅..XBMU-MC:多语言平行语料库[J].中国科学数据(中英文网络版),2026,11(1):68-81,14.

基金项目

国家自然科学基金项目(62466052) (62466052)

甘肃省中央引导地方科技发展资金项目(25ZYJA034) (25ZYJA034)

西北民族大学中央高校基本科研业务费专项资金项目(31920250008) National Natural Science Foundation of China Project(62466052) (31920250008)

Gansu Province Central Guidance Local Science and Technology Development Fund Project(25ZYJA034) (25ZYJA034)

Fundamental Research Funds for the Central Universities of Northwest Minzu University(31920250008). (31920250008)

中国科学数据(中英文网络版)

2096-2223

访问量0
|
下载量0
段落导航相关论文