现代情报2025,Vol.45Issue(4):49-59,11.DOI:10.3969/j.issn.1008-0821.2025.04.005
中医古文相似度计算研究:一种以生成式AI融合领域知识的SimCSE方法
Research on Similarity Calculation of Traditional Chinese Medicine Classical Prose:A SimCSE Approach to Fusing Domain Knowledge With Generative AI
摘要
Abstract
[Purpose/Significance]The study aims to construct a similarity computation model specifically applicable to TCM ancient texts,solve the problems of difficulty in semantic characterization and the high cost of data annotation of BERT on TCM ancient texts.[Method/Process]On the basis of incremental pre-training of multiple models,this study used generative AI to generate all task data,and combined SimCSE method to compare the effects of different training meth-ods,pre-training models,positive and negative sample construction methods,and positive sample mixing strategies.[Re-sult/Conclusion]The research results show that the performance of unsupervised learning models is generally low,and the introduction of positive and negative samples generated by AI significantly improves the performance.On the training set composed of semantically different and low similarity negative samples constructed using AI,and mixed with positive sam-ples constructed using AI assisted synonym replacement method,the TCM Gujiroberta model showed the best performance,reaching 90.9%.In addition,selecting negative samples with low similarity and randomly mixing datasets with different types of positive samples can further improve model performance.This study designs a SimCSE similarity calculation model that integrates knowledge of traditional Chinese medicine ancient books in a zero sample scenario,which can provide sup-port for the research and application of ancient books.In the future,further optimization would be considered in terms of dataset construction strategies.关键词
中医古籍/相似度计算/预训练语言模型/SimCSE/AIGCKey words
traditional Chinese medicine ancient books/similarity calculation/pre-trained language model/SimCSE/AIGC分类
信息技术与安全科学引用本文复制引用
张君冬,刘江峰,邓景鹏,刘艳华,黄奇..中医古文相似度计算研究:一种以生成式AI融合领域知识的SimCSE方法[J].现代情报,2025,45(4):49-59,11.基金项目
江苏省研究生科研与实践创新计划项目"图模驱动的在线医疗健康智慧问答服务研究"(项目编号:KYCX24_0107) (项目编号:KYCX24_0107)
江苏高校哲学社会科学研究重大项目"中医古籍文献预训练模型构建及其应用研究"(项目编号:2023SJZD084). (项目编号:2023SJZD084)