中国科学数据(中英文网络版)2026,Vol.11Issue(1):68-81,14.DOI:10.11922/11-6035.csd.2025.0090.zh
XBMU-MC:多语言平行语料库
XBMU-MC:A Multilingual Parallel Corpus
摘要
Abstract
Multilingual parallel corpus is a basic resource for natural language processing to break through language barriers and improve the efficiency and quality of cross-language tasks.To address the relative scarcity of parallel corpora for low-resource languages(Chinese-Tibetan,Chinese-Uyghur,Chinese-Mongolian),this study constructs a parallel corpus—XBMU-MC(Northwest Minzu University-Multilingual Corpus)for multilingual machine translation and cross-language information retrieval tasks.The original corpus consists of manually constructed specific bilingual texts and web-crawled publicly available multilingual data.The manually constructed corpus consists of Chinese texts written by researchers in the fields of culture,science and technology,and society,and the target language translations are obtained through machine translation and manual proofreading;while the web-crawled corpus captures the original texts from mainstream Tibetan,Mongolian,and Uyghur news websites to expand the scale of the corpus and the scope of the domains covered.After that,data enhancement techniques are used to expand the existing corpus,and the data are strictly screened to ensure the alignment consistency and translation accuracy between the source and target languages for each parallel corpus pair.Finally,21,579 high-quality samples were selected and stored in JSON format,including three attributes:instruction,input and output.Finally,the multilingual translation comparison experiments between the XBMU-MC dataset and the open-source MMDS dataset based on the GLM4-9B,Qwen3-8B,and Deepseek-R1-8B models show an average increase in BLEU scores of 21.16,20.51,and 15.37 percentage points.This dataset can help mitigate the problem of insufficient parallel corpus for low-resource languages to a certain extent,effectively support model training and multilingual translation tasks,and can also be used as a corpus basis for cross-lingual large language models in cross-language alignment,instruction fine-tuning and intelligent information processing applications in ethnic languages,helping to improve the performance of related tasks.关键词
汉藏/汉蒙/汉维/平行语料库/数据增强Key words
Chinese-Tibetan/Chinese-Mongolian/Chinese-Uyghur/parallel corpus/data enhancement引用本文复制引用
严琦栋,马宁,巴桑珠扎,白玛曲扎,艾科拜·依米提,麦迪那木·吾斯曼,木巴热克·阿布力克木,苏日娜,阿力娅..XBMU-MC:多语言平行语料库[J].中国科学数据(中英文网络版),2026,11(1):68-81,14.基金项目
国家自然科学基金项目(62466052) (62466052)
甘肃省中央引导地方科技发展资金项目(25ZYJA034) (25ZYJA034)
西北民族大学中央高校基本科研业务费专项资金项目(31920250008) National Natural Science Foundation of China Project(62466052) (31920250008)
Gansu Province Central Guidance Local Science and Technology Development Fund Project(25ZYJA034) (25ZYJA034)
Fundamental Research Funds for the Central Universities of Northwest Minzu University(31920250008). (31920250008)