计算机工程与应用2012,Vol.48Issue(30):151-156,6.DOI:10.3778/j.issn.1002-8331.2012.30.032
基于正文特征及网页结构的主题网页信息抽取
Content extraction of theme pages based on body feature and page structure
摘要
Abstract
Web text extraction is the foundation of Web information processing work (information retrieval, text mining, etc.). Based on the statistical analysis of theme pages, including body features and structure characteristics, this paper puts forward a kind of theme pages text extraction method combining Web page text features and HTML tags characteristics. The text content block is acquired according to the DOM tree parsed from the Web pages, and then the characteristics of noise information are analysed in the text content block in order to remove the noise information. Experiments show this method has higher accuracy and recall rate.关键词
正文特征/标签信息/正文抽取Key words
body feature/ tag information/ content extraction分类
信息技术与安全科学引用本文复制引用
段晓丽,王宇,谷静,刘玮楠..基于正文特征及网页结构的主题网页信息抽取[J].计算机工程与应用,2012,48(30):151-156,6.基金项目
国家自然科学基金重大项目(No.70890080)子课题(70890083) (No.70890080)
教育部人文社科研究项目(No.09YJA870005). (No.09YJA870005)