农业图书情报学报2025,Vol.37Issue(12):81-94,14.DOI:10.13998/j.cnki.issn1002-1248.25-0422
融合思维链的中医药古籍多任务知识抽取方法研究
A Multi-Task Knowledge Extraction Method for Traditional Chinese Medicine Ancient Books Integrating Chain-of-Thought
摘要
Abstract
[Purpose/Significance]Although traditional Chinese Medicine(TCM)classics contain valuable knowledge they remain difficult to process automatically due to their complex page layouts,coexistence of traditional and simplified variant characters,alias-rich terminology,and strong cross-paragraph semantic dependencies.Existing pipelines often split the processes of optical character recognition(OCR),normalization,entity recognition,relation extraction,and entity alignment.This leads to error propagation.Additionally,many studies also focus on modern clinical texts rather than historical sources.This paper addresses these gaps by presenting an end-to-end pipeline that transforms ancient page images to a structured knowledge graph.The central contribution is the CoTCMKE,which is a chain-of-thought(CoT)and ontology-constrained joint model that performs named entity recognition(NER),relation extraction(RE),and entity alignment(EA)simultaneously.By making intermediate reasoning explicit and binding predictions to a TCM ontology,the framework improves batch digitization efficiency,extraction accuracy,and interpretability for digital humanities and library&information science(LIS)applications.[Method/Process]We built a unified pipeline with three steps.1)Text recognition:a multimodal large language model(MLLM)recognizes text directly from complex pages with mixed vertical/horizontal layouts and performs context-aware traditional-to-simplified conversion.2)Ontology construction:following semantic completeness,multimodal friendliness,evolvability,and interoperability,experts curate an ontology of core TCM concepts(e.g.,diseases,symptoms,formulae,herbs)with aliases and constraints to guide decoding and ensure consistency.3)Knowledge extraction:CoTCMKE integrates CoT with ontology constraints for multi-task extraction,which is entity localization and normalization,ontology-consistent relation generation,and cross-passage/cross-volume entity alignment.Constraint-aware decoding uses immediate checks and backtracking when a generated entity or relation violates ontology rules or alias mappings.For data,we used Shang Han Lun.Qwen2.5-VL-32B assists OCR,conversion,and initial auto-labeling;two TCM-trained annotators independently review and reconcile results.The final sets contain 2 340 NER items,1 880 RE items,and 450 EA pairs,evaluated with 10-fold cross-validation.The multimodal large language model(MLLM)was adapted via LoRA with early stopping.The comparisons include traditional deep models,a unified IE framework,prompt-only inference,and a LoRA-SFT baseline.[Results/Conclusions]On Shang Han Lun,CoTCMKE outperformed LoRA-SFT by+3.1 F1 for NER,+1.6 for RE,and+1.3 for EA.In cross-book transfer to Jin Kui Yao Lue,the model maintained stable performance without retraining,indicating robustness and scalability.Ablation results showed that CoT reduced boundary and ambiguity errors,while ontology constraints curbed illegal triples and alias fragmentation.Combining both yielded the best overall results.The analysis yielded the following observations.1)explicit medical relation templates act as semantic guardrails;2)proactive alias consolidation before decoding reduces entity scattering and improves alignment;3)explicit type-path guidance helps disambiguate fine-grained categories(e.g.,pulse findings vs.general symptoms).The framework supports the automatic construction of"formula-symptom-herb"triples,as well as alias and variant normalization.It also supports evidence-linked semantic searches and navigation,which benefit LIS workflows,education,and research.Current limitations include the scope of the curated ontology and its focus on two classics.Future work will extend to additional TCM classics and broader historical corpora,support continual incremental learning,and deliver knowledge services based on the constructed graphs.关键词
中医药古籍知识图谱/多模态大模型/思维链推理/古籍数字化Key words
traditional Chinese medicine knowledge graph/multi-modal large language model/chain-of-thought/digitization of ancient books分类
通用工业技术引用本文复制引用
安波..融合思维链的中医药古籍多任务知识抽取方法研究[J].农业图书情报学报,2025,37(12):81-94,14.基金项目
国家社科基金一般项目"藏汉双语藏文古籍知识图谱构建研究"(22BTQ010) (22BTQ010)