首页|期刊导航|现代情报|中医古文相似度计算研究:一种以生成式AI融合领域知识的SimCSE方法

中医古文相似度计算研究:一种以生成式AI融合领域知识的SimCSE方法

张君冬刘江峰邓景鹏刘艳华黄奇

现代情报2025，Vol.45Issue(4)：49-59,11.

现代情报2025，Vol.45Issue(4)：49-59,11.DOI:10.3969/j.issn.1008-0821.2025.04.005

中医古文相似度计算研究:一种以生成式AI融合领域知识的SimCSE方法

Research on Similarity Calculation of Traditional Chinese Medicine Classical Prose:A SimCSE Approach to Fusing Domain Knowledge With Generative AI

张君冬 ¹刘江峰 ²邓景鹏 ³刘艳华 ⁴黄奇¹

作者信息

1. 南京大学信息管理学院,江苏南京 210023
2. 南京大学信息管理学院,江苏南京 210023||南京大学数据智能与交叉创新实验室,江苏南京 210023
3. 武汉大学信息管理学院,湖北武汉 430064
4. 南京中医药大学卫生经济管理学院,江苏南京 210023
折叠

摘要

Abstract

[Purpose/Significance]The study aims to construct a similarity computation model specifically applicable to TCM ancient texts,solve the problems of difficulty in semantic characterization and the high cost of data annotation of BERT on TCM ancient texts.[Method/Process]On the basis of incremental pre-training of multiple models,this study used generative AI to generate all task data,and combined SimCSE method to compare the effects of different training meth-ods,pre-training models,positive and negative sample construction methods,and positive sample mixing strategies.[Re-sult/Conclusion]The research results show that the performance of unsupervised learning models is generally low,and the introduction of positive and negative samples generated by AI significantly improves the performance.On the training set composed of semantically different and low similarity negative samples constructed using AI,and mixed with positive sam-ples constructed using AI assisted synonym replacement method,the TCM Gujiroberta model showed the best performance,reaching 90.9%.In addition,selecting negative samples with low similarity and randomly mixing datasets with different types of positive samples can further improve model performance.This study designs a SimCSE similarity calculation model that integrates knowledge of traditional Chinese medicine ancient books in a zero sample scenario,which can provide sup-port for the research and application of ancient books.In the future,further optimization would be considered in terms of dataset construction strategies.

关键词

中医古籍/相似度计算/预训练语言模型/SimCSE/AIGC

Key words

traditional Chinese medicine ancient books/similarity calculation/pre-trained language model/SimCSE/AIGC

分类

信息技术与安全科学

引用本文复制引用

张君冬,刘江峰,邓景鹏,刘艳华,黄奇..中医古文相似度计算研究:一种以生成式AI融合领域知识的SimCSE方法[J].现代情报,2025,45(4):49-59,11.

基金项目

江苏省研究生科研与实践创新计划项目"图模驱动的在线医疗健康智慧问答服务研究"(项目编号:KYCX24＿0107) （项目编号:KYCX24＿0107）

江苏高校哲学社会科学研究重大项目"中医古籍文献预训练模型构建及其应用研究"(项目编号:2023SJZD084). （项目编号:2023SJZD084）

现代情报

OA北大核心

ISSN：1008-0821

访问量6

下载量0

段落导航