首页|期刊导航|农业图书情报学报|大语言模型赋能科技文献数据挖掘进展分析

大语言模型赋能科技文献数据挖掘进展分析

蔡祎然胡正银刘春江

农业图书情报学报2025，Vol.37Issue(2)：4-22,19.

农业图书情报学报2025，Vol.37Issue(2)：4-22,19.DOI:10.13998/j.cnki.issn1002-1248.25-0116

大语言模型赋能科技文献数据挖掘进展分析

Analysis of Progress in Data Mining of Scientific Literature Using Large Language Models

蔡祎然 ¹胡正银 ¹刘春江¹

作者信息

1. 中国科学院成都文献情报中心,成都 610299||中国科学院大学经济与管理学院信息资源管理系,北京 100190
折叠

摘要

Abstract

[Purpose/Significance]Scientific literature contains rich domain knowledge and scientific data,which can provide high-quality data support for AI-driven scientific research(AI4S).This paper systematically reviews the methods,tools,and applications of arge language models(LLMs)in scientific literature data mining,and discusses their research directions and development trends.It addresses critical shortcomings in interdisciplinary knowledge extraction and provides practical insights to enhance AI4S workflows,thereby aligning AI capabilities with domain-specific scientific needs.[Method/Process]This study employs a systematic literature review and case analysis to formulate a tripartite framework:1)Methodological dimension:Textual knowledge mining uses dynamic prompts,few-shot learning,and domain-adaptive pre-training(such as MagBERT and MatSciBERT)to improve entity recognition.Scientific data extraction uses chain-of-thought prompting and knowledge graphs(such as ChatExtract and SynAsk)to parse experimental datasets.Chart decoding uses neural networks to extract numerical values and semantic patterns from visual elements.2)Tool dimension:This explores the core functionalities of notable AI tools,including data mining platforms(such as LitU,SciAIEngine)and knowledge generation systems(such as Agent Laboratory,VirSci),with a focus on multimodal processing and automation.3)Application dimension:LLMs produce high-quality datasets to tackle the issue of data scarcity.They facilitate tasks such as predicting material properties and diagnosing medical conditions.The scientific credibility of these datasets is ensured through a process of"LLMs+expert validation".[Results/Conclusions]The findings indicate that LLMs significantly improve the automation of scientific literature mining.Methodologically,this research introduces dynamic prompt learning frameworks and domain adaptation fine-tuning technologies to address the shortcomings of traditional rule-driven approaches.In terms of tools,cross-modal parsing tools and interactive analysis platforms have been developed to facilitate end-to-end data mining and knowledge generation.In terms of applications,the study has accelerated the transition of scientific literature from single-modal to multimodal formats,thereby supporting the creation of high-quality scientific datasets,vertical domain-specific models,and knowledge service platforms.However,significant challenges remain,including insufficient depth of domain knowledge embedding,the low efficiency of multimodal data collaboration,and a lack of model interpretability.Future research should focus on developing interpretable LLMs with knowledge graph integration,improving cross-modal alignment techniques,and integrating"human-in-the-loop"systems to enhance reliability.It is also imperative to establish standardized data governance and intellectual property frameworks to promote the ethical utilization of scientific literature data.Such advances will facilitate a shift from efficiency optimization to knowledge generation in AI4S.

关键词

科技文献数据挖掘/大语言模型/AI4S/数据驱动/知识发现

Key words

scientific literature data mining/large language models/AI for Science/data driven/knowledge discovery

分类

社会科学

引用本文复制引用

蔡祎然,胡正银,刘春江..大语言模型赋能科技文献数据挖掘进展分析[J].农业图书情报学报,2025,37(2):4-22,19.

基金项目

国家自然科学基金重大研究计划重点项目"支持下一代人工智能的通用型高质量科学数据库"(92470204) （92470204）

农业图书情报学报

ISSN：1002-1248

访问量0

下载量0

段落导航