中医杂志2026,Vol.67Issue(8):896-903,8.DOI:10.13288/j.11-2166/r.2026.08.015
基于大语言模型的中医黄疸病古籍语料库的构建
Construction of a Traditional Chinese Medicine Classics Corpus for Jaundice Based on Large Language Models
摘要
Abstract
Objective To construct a structured corpus of traditional Chinese medicine(TCM)classics on jaun-dice,providing data support for modern clinical practice and scientific research on jaundice,and facilitating the deep mining and utilization of knowledge from traditional Chinese medicine(TCM)classics.Methods Open-source digi-tized resources of TCM classics available on the internet were systematically integrated.Bibliometric methods were employed to construct a terminology database for jaundice.Python regular expressions were used for keyword matching and corpus extraction.A five-element information extraction model of"disease-syndrome-symptom-treatment-efficacy"was designed.Annotation guidelines were developed,and a standard test set was formed through back-to-back annotation by two annotators followed by arbitration through a third party.To select the optimal model for entity information extraction,two large language models,DeepSeek-R1 and ChatGPT-o3,were chosen for comparative per-formance evaluation by invoking their application programming interfaces(APIs)for entity extraction.Results A TCM classics corpus for jaundice was constructed,encompassing 561 ancient books from the pre-Qin period to the Qing Dynasty,and 10,243 pieces of text.The comparative evaluation results showed that DeepSeek-R1 outper-formed ChatGPT-o3 in the performance of entity recognition across all 17 entity types.Based on this optimized model DeepSeek-R1,41,407 entities were extracted from the corpus,covering 17 entity types,including disease names(6238),syndrome types(1622),formulas(2631),and Chinese medicinal herbs(2706).Conclusion The cor-pus constructed using DeepSeek-R1 large language model covers the essential syndrome differentiation and treatment elements for jaundice,laying a solid foundation for subsequent research on the historical evolution and syndrome differentiation and treatment rules of jaundice in TCM.It also provides a replicable technical paradigm for AI-driven knowledge mining from TCM classics.关键词
中医古籍/黄疸病/语料库/大语言模型/信息抽取Key words
traditional Chinese medicine classics/jaundice/corpus/large language models/information extraction引用本文复制引用
张艺然,李海燕,聂莹..基于大语言模型的中医黄疸病古籍语料库的构建[J].中医杂志,2026,67(8):896-903,8.基金项目
中国中医科学院科技创新工程(CI2021B002) (CI2021B002)
江苏省前沿技术研发计划(BF2025076) (BF2025076)
中国中医科学院基本科研业务费(ZZ170307) (ZZ170307)