情报杂志2025,Vol.44Issue(6):177-184,8.DOI:10.3969/j.issn.1002-1965.2025.06.022
基于大模型的新闻媒体网页信息提取方法
News Media Web Page Information Extraction Method Based on Large Language Model
摘要
Abstract
[Research purpose]Due to the low gathering accuracy and difficulty of meeting complex needs in existing web page informa-tion extraction technologies based on non-visual features,this paper studies efficient and intelligent web page information extraction tech-nology to achieve fast and accurate news media web page information extraction.[Research method]This paper proposes a web page in-formation extraction scheme based on a large language model for news media.Through model base comparison and selection,data set con-struction,supervised fine-tuning,and prompt engineering,this paper constructs a new web page information extraction large language model with high versatility,which improves the accuracy and efficiency.[Research result/conclusion]Through experimental compara-tive analysis of other intelligent extraction schemes for news web page data,it is found that the dedicated news media large model built based on an open source large language model and supervised fine-tuning has an average accuracy rate and an average F1 value of more than 90%in information extraction,which is more applicable than existing web page information extraction schemes.关键词
大语言模型/新闻网页/文本信息提取/HTML智能解析/网页要素智能提取/多语种识别/思维链Key words
large language model/news web page/text information extraction/intelligent parsing of HTML code/intelligent extraction of web page element/multi-language recognition/chain of thought分类
计算机与自动化引用本文复制引用
刘建文,万子建,陈婷,刘汪洋,沈宜..基于大模型的新闻媒体网页信息提取方法[J].情报杂志,2025,44(6):177-184,8.基金项目
国家重点研发计划(编号:2022YFB3105400)研究成果. (编号:2022YFB3105400)