首页|期刊导航|计算机工程|基于DOM树和视觉特征的网页信息自动抽取

基于DOM树和视觉特征的网页信息自动抽取

黄武冠朱明尹文科

计算机工程Issue(10)：309-312,4.

计算机工程Issue(10)：309-312,4.DOI:10.3969/j.issn.1000-3428.2013.10.068

基于DOM树和视觉特征的网页信息自动抽取

Web Information Automatic Extraction Based on DOM Tree and Visual Feature

黄武冠 ¹朱明 ¹尹文科¹

作者信息

1. 中国科学技术大学自动化系，合肥 230027
折叠

摘要

Abstract

This paper proposes an automatic extraction method based on Document Object Model(DOM) tree and visual features for Web information to extract businesses information in list pages of life information websites. By analyzing and using DOM tree and visual features of data regions in list pages, the method can get the candidate target data regions firstly. The method identifies the target data region by making use of visual features and extracts data records finally. The method tests ten life information websites and achieves 100%recall and 100%precision on eight websites. The results show that the proposed method can get better results.

关键词

文档对象模型树/视觉特征/自动抽取/数据记录/数据区域/挖掘算法

Key words

Document Object Model(DOM) tree/visual feature/automatic extraction/data recording/data region/mining algorithm

分类

信息技术与安全科学

引用本文复制引用

黄武冠,朱明,尹文科..基于DOM树和视觉特征的网页信息自动抽取[J].计算机工程,2013,(10):309-312,4.

基金项目

国家科技支撑计划基金资助项目(2011BAH11B01)；中国科学院重点部署基金资助项目(KGZD-EW-103-(5)) （2011BAH11B01）

计算机工程

OACSCDCSTPCD

ISSN：1000-3428

访问量1

下载量0

段落导航