中国科学数据(中英文网络版)2026,Vol.11Issue(1):54-67,14.DOI:10.11922/11-6035.csd.2025.0007.zh
BodSUM-6000:生成式藏文文本摘要数据集
BodSUM-6000:A dataset of abstractive Tibetan text summarization
摘要
Abstract
The dataset is the foundation for automatic text summarization,and its quality directly determines the performance of summarization systems.Currently,research on Tibetan text summarization suffers from a lack of publicly available,high-quality open-source datasets,which has constrained progress in this field.To fill this gap,news texts were collected from various Tibetan websites,and human-written summaries were created for each text.Additionally,both subjective and objective evaluation methods were used to comprehensively assess the matching between Tibetan news texts and their corresponding human-written summaries,ensuring the quality of the data.Ultimately,BodSUM-6000,a dataset of high-quality abstractive Tibetan text summarization consisting of 6,000 text-summary pairs,was constructed.This dataset provides valuable data resources for training and evaluating abstractive Tibetan text summarization models,helping to address the limitations in the generalizability and accuracy of current Tibetan text summarization techniques,and promoting the further development of Tibetan automatic summarization research.关键词
藏文/生成式文本摘要/数据集构建/质量评价Key words
Tibetan/abstractive text summarization/dataset construction/quality evaluation引用本文复制引用
夏吾吉,黄鹤鸣,贡保杰布..BodSUM-6000:生成式藏文文本摘要数据集[J].中国科学数据(中英文网络版),2026,11(1):54-67,14.基金项目
国家自然科学基金(62066039、62166034) (62066039、62166034)
青海省自然科学基金(2022-ZJ-925). National Natural Science Foundation of China(62066039,62166034) (2022-ZJ-925)
Qinghai Provincial Natural Science Foundation(2022-ZJ-925). (2022-ZJ-925)