计算机应用与软件2013,Vol.30Issue(4):101-106,152,7.DOI:10.3969/j.issn.1000-386x.2013.04.028
一种基于SVM和AdaBoost的Web实体信息抽取方法
A WEB ENTITY INFORMATION EXTRACTION METHOD BASED ON SVM AND ADABOOST
摘要
Abstract
In this paper, a Web entity information extraction method based on SVM and AdaBoost is proposed. Firstly, an identification method for Web page' s main data region based on SVM is proposed, which segments Web page data region effectively based on the display characteristics of Web entity instances in the page, identifies the main data area where the Web entity instances locates. Secondly, based on the characteristics of the Web entity attribute labels, a method based on AdaBoost ensemble learning is proposed, which automatically extracts the Web entities information from the main data area of the page. A variety of experiments are conducted on two real data sets, and the comparison is done with correlated research works as well, experimental results show that this method is able to achieve fairly good extraction effect.关键词
Web信息抽取/页面分割/集成学习Key words
Web information extraction/ Page segmentation/ Ensemble learning分类
信息技术与安全科学引用本文复制引用
孙明,陆春生,徐秀星,李庆忠,彭朝晖..一种基于SVM和AdaBoost的Web实体信息抽取方法[J].计算机应用与软件,2013,30(4):101-106,152,7.基金项目
国家科技支撑计划项目(2008BAH32B01). (2008BAH32B01)