首页|期刊导航|计算机工程与应用|基于正文特征及网页结构的主题网页信息抽取

基于正文特征及网页结构的主题网页信息抽取

段晓丽王宇谷静刘玮楠

计算机工程与应用2012，Vol.48Issue(30)：151-156,6.

计算机工程与应用2012，Vol.48Issue(30)：151-156,6.DOI:10.3778/j.issn.1002-8331.2012.30.032

基于正文特征及网页结构的主题网页信息抽取

Content extraction of theme pages based on body feature and page structure

段晓丽 ¹王宇 ¹谷静 ²刘玮楠¹

作者信息

1. 大连理工大学管理科学与工程学院,辽宁大连116024
2. 中国环境管理干部学院经济学系,河北秦皇岛066004
折叠

摘要

Abstract

Web text extraction is the foundation of Web information processing work (information retrieval, text mining, etc.). Based on the statistical analysis of theme pages, including body features and structure characteristics, this paper puts forward a kind of theme pages text extraction method combining Web page text features and HTML tags characteristics. The text content block is acquired according to the DOM tree parsed from the Web pages, and then the characteristics of noise information are analysed in the text content block in order to remove the noise information. Experiments show this method has higher accuracy and recall rate.

关键词

正文特征/标签信息/正文抽取

Key words

body feature/ tag information/ content extraction

分类

信息技术与安全科学

引用本文复制引用

段晓丽,王宇,谷静,刘玮楠..基于正文特征及网页结构的主题网页信息抽取[J].计算机工程与应用,2012,48(30):151-156,6.

基金项目

国家自然科学基金重大项目(No.70890080)子课题(70890083) （No.70890080）

教育部人文社科研究项目(No.09YJA870005). （No.09YJA870005）

计算机工程与应用

OACSCDCSTPCD

ISSN：1002-8331

访问量0

下载量0

段落导航