首页|期刊导航|广西师范大学学报（自然科学版）|基于CRFs的新闻网页主题内容自动抽取方法

基于CRFs的新闻网页主题内容自动抽取方法

张春元

广西师范大学学报（自然科学版）2011，Vol.29Issue(1)：138-142,5.

基于CRFs的新闻网页主题内容自动抽取方法

Automatic Web News Content Extraction Based on CRFs

张春元¹

作者信息

1. 海南大学,信息科学技术学院,海南,海口,570228
折叠

摘要

Abstract

Most previous workson Web information extraction seldom use associations among Web page blocks. In order to solve this problem ,this paper proposes an automatic Web news content extraction approach based on conditional random fields (CRFs). Firstly,it parses a target news page to a DOM tree.After eliminating invalid nodes,pruning subtrees and deleting single nodes in the tree,it uses heuristic rules to segment the DOM tree to blocks and converts these blocks into a data sequence. Then,it defines feature functions to extract each block's own state features and neighbor blocks' category transition features. Finally,by labeling the data sequence based on CRFs,it identifies each block's category to extract the page's content. Experimental results indicate that this approach is precise and adaptable for Web news content extraction ,and importing associations among page blocks can improve Web news content extraction.

关键词

Web信息抽取/条件随机场/网页分块

Key words

Web information extraction /conditional random fields/Web page segmentation

分类

信息技术与安全科学

引用本文复制引用

张春元..基于CRFs的新闻网页主题内容自动抽取方法[J].广西师范大学学报（自然科学版）,2011,29(1):138-142,5.

基金项目

国家自然科学基金资助项目(60863001) （60863001）

广西师范大学学报（自然科学版）

OA北大核心CSTPCD

ISSN：1001-6600

访问量0

下载量0

段落导航