| 注册
首页|期刊导航|重庆大学学报:自然科学版|视觉与标签信息的Deep Web查询页面内容提取

视觉与标签信息的Deep Web查询页面内容提取

冯永 唐黎

重庆大学学报:自然科学版2012,Vol.35Issue(6):117-124,8.
重庆大学学报:自然科学版2012,Vol.35Issue(6):117-124,8.

视觉与标签信息的Deep Web查询页面内容提取

Combining vision information and tag information to extract Deep Web result pages content

冯永 1唐黎1

作者信息

  • 1. 重庆大学计算机学院,重庆400044/重庆大学信息物理社会可信服务计算教育部重点实验室,重庆400044
  • 折叠

摘要

Abstract

Extracting content from deep web pages is a challenging problem due to the underlying intricate structures of such pages.A vision and tags based approach(DVS) is proposed.It primarily utilizes the vision information and tag information on the Deep Web result pages to extract the content structure of pages.This approach includes two steps as follows: First,the vision information and tag information are produced by analyzing the Cascading Style Sheet and the DOM Tree to generate an initial visual-tree of the Deep Web result page.And then,the Path Shingle(PS) algorithm is employed,by considering both of the vision and the tag information,and the blocks in the visual-tree are clustered according to the similarity computing result of them to produce the final visual-tree,i.e.,the content structure of pages.The innovations of DVS are that it utilizes the vision information and tag information on the Deep Web pages to extract the content structure;and stores the vision information as a tree to tansform the analysis of the vision information to a vision-attribute tree.Experiments are conducted with a large set of Web databases called UIUC’s TEL.The experimental results show that the vision and tag based approach has high precision compared with the WTS algorithm and the VIPS algorithm.

关键词

深层网/内容提取/DOM树/CSS样式/视觉树

Key words

deep web/content extraction/dom tree/cascading style sheet/visual tree

分类

信息技术与安全科学

引用本文复制引用

冯永,唐黎..视觉与标签信息的Deep Web查询页面内容提取[J].重庆大学学报:自然科学版,2012,35(6):117-124,8.

基金项目

国家自然科学基金资助项目 ()

重庆市高等教育教学改革研究重点资助项目 ()

中央高校基本科研业务基金资助项目 ()

“211工程”三期建设资助项目 ()

重庆大学学报:自然科学版

OA北大核心CSCDCSTPCD

1000-582X

访问量0
|
下载量0
段落导航相关论文