广西师范大学学报(自然科学版)2011,Vol.29Issue(1):138-142,5.
基于CRFs的新闻网页主题内容自动抽取方法
Automatic Web News Content Extraction Based on CRFs
摘要
Abstract
Most previous workson Web information extraction seldom use associations among Web page blocks. In order to solve this problem ,this paper proposes an automatic Web news content extraction approach based on conditional random fields (CRFs). Firstly,it parses a target news page to a DOM tree.After eliminating invalid nodes,pruning subtrees and deleting single nodes in the tree,it uses heuristic rules to segment the DOM tree to blocks and converts these blocks into a data sequence. Then,it defines feature functions to extract each block's own state features and neighbor blocks' category transition features. Finally,by labeling the data sequence based on CRFs,it identifies each block's category to extract the page's content. Experimental results indicate that this approach is precise and adaptable for Web news content extraction ,and importing associations among page blocks can improve Web news content extraction.关键词
Web信息抽取/条件随机场/网页分块Key words
Web information extraction /conditional random fields/Web page segmentation分类
信息技术与安全科学引用本文复制引用
张春元..基于CRFs的新闻网页主题内容自动抽取方法[J].广西师范大学学报(自然科学版),2011,29(1):138-142,5.基金项目
国家自然科学基金资助项目(60863001) (60863001)