首页|期刊导航|科技情报研究|基于领域大语言模型的古籍分词研究

基于领域大语言模型的古籍分词研究

朱丹浩赵志枭吴娜王希羽孙光耀王东波

科技情报研究2024，Vol.6Issue(2)：11-20,10.

科技情报研究2024，Vol.6Issue(2)：11-20,10.DOI:10.19809/j.cnki.kjqbyj.2024.02.002

基于领域大语言模型的古籍分词研究

Research on Word Segmentation of Ancient Books Based on Domain Large Language Model

朱丹浩 ¹赵志枭 ²吴娜 ²王希羽 ²孙光耀 ²王东波²

作者信息

1. 江苏警官学院刑事科学技术系,南京 210031
2. 南京农业大学信息管理学院,南京 210095
折叠

摘要

Abstract

[Purpose/significance]In this paper,we take the automatic text segmentation of ancient books as an entry point,introduce the"Xunzi"series of large language models,and explore the performance of large language models on the task of word division of ancient texts.[Method/process]This paper constructs an instruction dataset based on the Zuozhuan,with data cleaning and organisation.on this basis,1 000 pieces were extracted from it as test data,then 500,1 000,2 000,and 5 000 pieces of data were used as training data to fine-tune the instructions and test their performance,respectively.[Result/conclusion]The experimental results show that only a relatively small amount of data is needed for the large language model to have a more desirable performance,and the Xunzi-Qwen-7B model shows optimal performance with an F1 value of 84.54%when the amount of fine-tuned data reaches 5 000 pieces.

关键词

"荀子"大模型/《左传》/分词/指令微调

Key words

"Xunzi"large language model/Zuozhuan/segmentation/instruction tuning

分类

社会科学

引用本文复制引用

朱丹浩,赵志枭,吴娜,王希羽,孙光耀,王东波..基于领域大语言模型的古籍分词研究[J].科技情报研究,2024,6(2):11-20,10.

基金项目

国家社科基金重大项目"中国古代典籍跨语言知识库构建及应用研究"(编号:21&ZD331) （编号:21&ZD331）

科技情报研究

OACSSCI

ISSN：2096-7144

访问量2

下载量0

段落导航