首页|期刊导航|计算机应用研究|基于布局相似性的网页正文内容提取研究

基于布局相似性的网页正文内容提取研究

杨柳青李晓东耿光刚

计算机应用研究Issue(9)：2581-2586,6.

计算机应用研究Issue(9)：2581-2586,6.DOI:10.3969/j.issn.1001-3695.2015.09.005

基于布局相似性的网页正文内容提取研究

Study of Web pages content extraction based on layout similarity

杨柳青 ¹李晓东 ²耿光刚³

作者信息

1. 中国科学院计算机网络信息中心，北京 100190
2. 中国互联网络信息中心，北京 100190
3. 中国互联网络信息中心，北京 100190
折叠

摘要

Abstract

Appropriate Web content extraction technique can remove the data which is redundant,repetitive and useless from massive Web pages while extracting more meaningful and more useful data.Through the observation of Web pages,this paper proposed and implemented a Web content extraction method based on the layout similarity that the pages under the same Web site showed similar in content layout and style structure.It achieves the purpose of main content extraction by comparing the similarity of the DOMnode structure data from the Web pages belong to the same topic of the same sites.It also did some tenta-tive research and implementation on some other content relevent to this content extraction method.Experiments prove that this method is simple,pratical and universal,and it can not only meet the requirement of both high accuracy but also provide sup-port for more Internet applications of content analysis.

关键词

布局相似性/网页正文提取/信息检索

Key words

layout similarity/Web page content extract/information retrieval

分类

信息技术与安全科学

引用本文复制引用

杨柳青,李晓东,耿光刚..基于布局相似性的网页正文内容提取研究[J].计算机应用研究,2015,(9):2581-2586,6.

基金项目

国家自然科学基金面上项目（61375039）；国家自然科学基金青年资助项目（61005029）；中国科学院计算机网络信息中心“一三五”规划重点培育方向专项基金资助项目（）

计算机应用研究

OA北大核心CSCDCSTPCD

ISSN：1001-3695

访问量0

下载量0

段落导航