| 注册
首页|期刊导航|数字图书馆论坛|基于生成式预训练语言模型的学者画像构建研究

基于生成式预训练语言模型的学者画像构建研究

柳涛 丁陈君 姜恩波 许睿 陈方

数字图书馆论坛2024,Vol.20Issue(3):1-11,11.
数字图书馆论坛2024,Vol.20Issue(3):1-11,11.DOI:10.3772/j.issn.1673-2286.2024.03.001

基于生成式预训练语言模型的学者画像构建研究

Construction of Scholar Profile Based on Generative Pre-Trained Language Model

柳涛 1丁陈君 2姜恩波 1许睿 2陈方1

作者信息

  • 1. 中国科学院成都文献情报中心,成都 610299||中国科学院大学信息资源管理系,北京 100190
  • 2. 中国科学院成都文献情报中心,成都 610299
  • 折叠

摘要

Abstract

In the era of big data,the information of scholars in the Internet that exists in a multi-source heterogeneous and unstructured form is accompanied by problems such as attribute confusion and long entities during entity extraction,which seriously affect the accuracy of the construction of scholar profiles.Meanwhile,the scholar attribute entity extraction model,as a key model in the construction of scholar profiles,still presents significant technical barriers in practical applications,which pose certain obstacles to the widespread application of scholar profiles.Therefore,based on open resources,we construct an attribute entity extraction method based on generative pre-trained language models through guided sentence modelling,autoregressive generation approach,and training corpus fine-tuning,and validate the method from four aspects:overall model effect,entity category extraction effect,instance analysis of the main influencing factors,and analysis of sample fine-tuning impact.Compared with the contrastive models,the method proposed in this paper achieves optimal performance across 12 categories of scholar attribute entities,with a comprehensive F1 score of 99.34%.It not only effectively identifies and differentiates mutually confusing attribute entities,but also enhances the extraction precision of typical long attribute entities such as"research interests"by 6.11%.This method provides more expedient and effective methodological support for the engineering application of scholar profiles.

关键词

生成式预训练语言模型/样例微调/学者画像/GPT-3

Key words

Generative Pre-Trained Language Model/Sample Fine-Tuning/Scholar Profile/GPT-3

分类

社会科学

引用本文复制引用

柳涛,丁陈君,姜恩波,许睿,陈方..基于生成式预训练语言模型的学者画像构建研究[J].数字图书馆论坛,2024,20(3):1-11,11.

基金项目

本研究得到"西部之光"人才培养计划"基于模式创新的医药生物产业科技服务体系研发及应用示范"(编号:E1C0000401)、中国科学院成都文献情报中心创新基金项目"生物-信息科技情报领域智慧数据体系建设"(编号:E1Z0000101)资助. (编号:E1C0000401)

数字图书馆论坛

OACSSCICSTPCD

1673-2286

访问量0
|
下载量0
段落导航相关论文