首页|期刊导航|情报杂志|基于大模型的新闻媒体网页信息提取方法

基于大模型的新闻媒体网页信息提取方法

刘建文万子建陈婷刘汪洋沈宜

情报杂志2025，Vol.44Issue(6)：177-184,8.

情报杂志2025，Vol.44Issue(6)：177-184,8.DOI:10.3969/j.issn.1002-1965.2025.06.022

基于大模型的新闻媒体网页信息提取方法

News Media Web Page Information Extraction Method Based on Large Language Model

刘建文 ¹万子建 ¹陈婷 ¹刘汪洋 ¹沈宜¹

作者信息

1. 深圳市网联安瑞网络科技有限公司深圳 518038||广东省网络空间认知域工程技术研究中心深圳 518038
折叠

摘要

Abstract

[Research purpose]Due to the low gathering accuracy and difficulty of meeting complex needs in existing web page informa-tion extraction technologies based on non-visual features,this paper studies efficient and intelligent web page information extraction tech-nology to achieve fast and accurate news media web page information extraction.[Research method]This paper proposes a web page in-formation extraction scheme based on a large language model for news media.Through model base comparison and selection,data set con-struction,supervised fine-tuning,and prompt engineering,this paper constructs a new web page information extraction large language model with high versatility,which improves the accuracy and efficiency.[Research result/conclusion]Through experimental compara-tive analysis of other intelligent extraction schemes for news web page data,it is found that the dedicated news media large model built based on an open source large language model and supervised fine-tuning has an average accuracy rate and an average F1 value of more than 90％in information extraction,which is more applicable than existing web page information extraction schemes.

关键词

大语言模型/新闻网页/文本信息提取/HTML智能解析/网页要素智能提取/多语种识别/思维链

Key words

large language model/news web page/text information extraction/intelligent parsing of HTML code/intelligent extraction of web page element/multi-language recognition/chain of thought

分类

信息技术与安全科学

引用本文复制引用

刘建文,万子建,陈婷,刘汪洋,沈宜..基于大模型的新闻媒体网页信息提取方法[J].情报杂志,2025,44(6):177-184,8.

基金项目

国家重点研发计划(编号:2022YFB3105400)研究成果. （编号:2022YFB3105400）

情报杂志

OA北大核心

ISSN：1002-1965

访问量0

下载量0

段落导航