首页|期刊导航|中医杂志|基于大语言模型的中医黄疸病古籍语料库的构建

基于大语言模型的中医黄疸病古籍语料库的构建

张艺然李海燕聂莹

中医杂志2026，Vol.67Issue(8)：896-903,8.

中医杂志2026，Vol.67Issue(8)：896-903,8.DOI:10.13288/j.11-2166/r.2026.08.015

基于大语言模型的中医黄疸病古籍语料库的构建

Construction of a Traditional Chinese Medicine Classics Corpus for Jaundice Based on Large Language Models

张艺然 ¹李海燕 ¹聂莹¹

作者信息

1. 中国中医科学院,北京市东城区东直门内南小街16号,100700
折叠

摘要

Abstract

Objective To construct a structured corpus of traditional Chinese medicine(TCM)classics on jaun-dice,providing data support for modern clinical practice and scientific research on jaundice,and facilitating the deep mining and utilization of knowledge from traditional Chinese medicine(TCM)classics.Methods Open-source digi-tized resources of TCM classics available on the internet were systematically integrated.Bibliometric methods were employed to construct a terminology database for jaundice.Python regular expressions were used for keyword matching and corpus extraction.A five-element information extraction model of"disease-syndrome-symptom-treatment-efficacy"was designed.Annotation guidelines were developed,and a standard test set was formed through back-to-back annotation by two annotators followed by arbitration through a third party.To select the optimal model for entity information extraction,two large language models,DeepSeek-R1 and ChatGPT-o3,were chosen for comparative per-formance evaluation by invoking their application programming interfaces(APIs)for entity extraction.Results A TCM classics corpus for jaundice was constructed,encompassing 561 ancient books from the pre-Qin period to the Qing Dynasty,and 10,243 pieces of text.The comparative evaluation results showed that DeepSeek-R1 outper-formed ChatGPT-o3 in the performance of entity recognition across all 17 entity types.Based on this optimized model DeepSeek-R1,41,407 entities were extracted from the corpus,covering 17 entity types,including disease names(6238),syndrome types(1622),formulas(2631),and Chinese medicinal herbs(2706).Conclusion The cor-pus constructed using DeepSeek-R1 large language model covers the essential syndrome differentiation and treatment elements for jaundice,laying a solid foundation for subsequent research on the historical evolution and syndrome differentiation and treatment rules of jaundice in TCM.It also provides a replicable technical paradigm for AI-driven knowledge mining from TCM classics.

关键词

中医古籍/黄疸病/语料库/大语言模型/信息抽取

Key words

traditional Chinese medicine classics/jaundice/corpus/large language models/information extraction

引用本文复制引用

张艺然,李海燕,聂莹..基于大语言模型的中医黄疸病古籍语料库的构建[J].中医杂志,2026,67(8):896-903,8.

基金项目

中国中医科学院科技创新工程(CI2021B002) （CI2021B002）

江苏省前沿技术研发计划(BF2025076) （BF2025076）

中国中医科学院基本科研业务费(ZZ170307) （ZZ170307）

中医杂志

ISSN：1001-1668

访问量0

下载量0

段落导航