首页|期刊导航|中国科学数据（中英文网络版）|BodSUM-6000:生成式藏文文本摘要数据集

BodSUM-6000:生成式藏文文本摘要数据集

夏吾吉黄鹤鸣贡保杰布

中国科学数据（中英文网络版）2026，Vol.11Issue(1)：54-67,14.

中国科学数据（中英文网络版）2026，Vol.11Issue(1)：54-67,14.DOI:10.11922/11-6035.csd.2025.0007.zh

BodSUM-6000:生成式藏文文本摘要数据集

BodSUM-6000:A dataset of abstractive Tibetan text summarization

夏吾吉 ¹黄鹤鸣 ¹贡保杰布¹

作者信息

1. 青海师范大学计算机学院,西宁 810008||青海师范大学藏语智能全国重点实验室,西宁 810008||青海师范大学藏文信息处理教育部重点实验室,西宁 810008
折叠

摘要

Abstract

The dataset is the foundation for automatic text summarization,and its quality directly determines the performance of summarization systems.Currently,research on Tibetan text summarization suffers from a lack of publicly available,high-quality open-source datasets,which has constrained progress in this field.To fill this gap,news texts were collected from various Tibetan websites,and human-written summaries were created for each text.Additionally,both subjective and objective evaluation methods were used to comprehensively assess the matching between Tibetan news texts and their corresponding human-written summaries,ensuring the quality of the data.Ultimately,BodSUM-6000,a dataset of high-quality abstractive Tibetan text summarization consisting of 6,000 text-summary pairs,was constructed.This dataset provides valuable data resources for training and evaluating abstractive Tibetan text summarization models,helping to address the limitations in the generalizability and accuracy of current Tibetan text summarization techniques,and promoting the further development of Tibetan automatic summarization research.

关键词

藏文/生成式文本摘要/数据集构建/质量评价

Key words

Tibetan/abstractive text summarization/dataset construction/quality evaluation

引用本文复制引用

夏吾吉,黄鹤鸣,贡保杰布..BodSUM-6000:生成式藏文文本摘要数据集[J].中国科学数据（中英文网络版）,2026,11(1):54-67,14.

基金项目

国家自然科学基金(62066039、62166034) （62066039、62166034）

青海省自然科学基金(2022-ZJ-925). National Natural Science Foundation of China(62066039,62166034) （2022-ZJ-925）

Qinghai Provincial Natural Science Foundation(2022-ZJ-925). （2022-ZJ-925）

中国科学数据（中英文网络版）

ISSN：2096-2223

访问量0

下载量0

段落导航