农业图书情报学报2025,Vol.37Issue(2):4-22,19.DOI:10.13998/j.cnki.issn1002-1248.25-0116
大语言模型赋能科技文献数据挖掘进展分析
Analysis of Progress in Data Mining of Scientific Literature Using Large Language Models
摘要
Abstract
[Purpose/Significance]Scientific literature contains rich domain knowledge and scientific data,which can provide high-quality data support for AI-driven scientific research(AI4S).This paper systematically reviews the methods,tools,and applications of arge language models(LLMs)in scientific literature data mining,and discusses their research directions and development trends.It addresses critical shortcomings in interdisciplinary knowledge extraction and provides practical insights to enhance AI4S workflows,thereby aligning AI capabilities with domain-specific scientific needs.[Method/Process]This study employs a systematic literature review and case analysis to formulate a tripartite framework:1)Methodological dimension:Textual knowledge mining uses dynamic prompts,few-shot learning,and domain-adaptive pre-training(such as MagBERT and MatSciBERT)to improve entity recognition.Scientific data extraction uses chain-of-thought prompting and knowledge graphs(such as ChatExtract and SynAsk)to parse experimental datasets.Chart decoding uses neural networks to extract numerical values and semantic patterns from visual elements.2)Tool dimension:This explores the core functionalities of notable AI tools,including data mining platforms(such as LitU,SciAIEngine)and knowledge generation systems(such as Agent Laboratory,VirSci),with a focus on multimodal processing and automation.3)Application dimension:LLMs produce high-quality datasets to tackle the issue of data scarcity.They facilitate tasks such as predicting material properties and diagnosing medical conditions.The scientific credibility of these datasets is ensured through a process of"LLMs+expert validation".[Results/Conclusions]The findings indicate that LLMs significantly improve the automation of scientific literature mining.Methodologically,this research introduces dynamic prompt learning frameworks and domain adaptation fine-tuning technologies to address the shortcomings of traditional rule-driven approaches.In terms of tools,cross-modal parsing tools and interactive analysis platforms have been developed to facilitate end-to-end data mining and knowledge generation.In terms of applications,the study has accelerated the transition of scientific literature from single-modal to multimodal formats,thereby supporting the creation of high-quality scientific datasets,vertical domain-specific models,and knowledge service platforms.However,significant challenges remain,including insufficient depth of domain knowledge embedding,the low efficiency of multimodal data collaboration,and a lack of model interpretability.Future research should focus on developing interpretable LLMs with knowledge graph integration,improving cross-modal alignment techniques,and integrating"human-in-the-loop"systems to enhance reliability.It is also imperative to establish standardized data governance and intellectual property frameworks to promote the ethical utilization of scientific literature data.Such advances will facilitate a shift from efficiency optimization to knowledge generation in AI4S.关键词
科技文献数据挖掘/大语言模型/AI4S/数据驱动/知识发现Key words
scientific literature data mining/large language models/AI for Science/data driven/knowledge discovery分类
社会科学引用本文复制引用
蔡祎然,胡正银,刘春江..大语言模型赋能科技文献数据挖掘进展分析[J].农业图书情报学报,2025,37(2):4-22,19.基金项目
国家自然科学基金重大研究计划重点项目"支持下一代人工智能的通用型高质量科学数据库"(92470204) (92470204)