首页|期刊导航|通信学报|基于节点属性与正文内容的海量Web信息抽取方法

基于节点属性与正文内容的海量Web信息抽取方法

王海艳曹攀

通信学报2016，Vol.37Issue(10)：9-17,9.

通信学报2016，Vol.37Issue(10)：9-17,9.DOI:10.11959/j.issn.1000-436x.2016190

基于节点属性与正文内容的海量Web信息抽取方法

Information extraction from massive Web pages based on node property and text content

王海艳 ¹曹攀²

作者信息

1. 南京邮电大学计算机学院，江苏南京 210023
2. 江苏省无线传感网高技术研究重点实验室，江苏南京 210003
折叠

摘要

Abstract

To address the problem of extracting valuable information from massive Web pages in big data environ-ments, a novel information extraction method based on node property and text content for massive Web pages was put forward. Web pages were converted into a document object model (DOM) tree, and a pruning and fusion algorithm was introduced to simplify the DOM tree. For each node in the DOM tree, both density property and vision property was defined and Web pages were pretreated based on these property values. A MapReduce framework was employed to realize parallel information extraction from massive Web pages. Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.

关键词

Web信息/抽取/MapReduce/DOM树

Key words

Web information/extraction/MapReduce/DOM tree

分类

信息技术与安全科学

引用本文复制引用

王海艳,曹攀..基于节点属性与正文内容的海量Web信息抽取方法[J].通信学报,2016,37(10):9-17,9.

基金项目

国家自然科学基金资助项目（No.61201163, No.61672297）；“六大人才高峰”基金资助项目（No.2013-JY-022）；江苏省“333高层次人才培养工程”基金资助项目Foundation Items:The National Natural Science Foundation of China (No.61201163,No.61672297), Six Talent Peaks Project in Jiangsu Province (No.2013-JY-022),333 High Level Personnel Training Project in Jiangsu Province （No.61201163,No.61672297）

通信学报

OA北大核心CSCDCSTPCD

ISSN：1000-436X

访问量0

下载量0

段落导航