首页|期刊导航|计算机工程与应用|基于结构相似网页聚类的正文提取算法研究

基于结构相似网页聚类的正文提取算法研究

王海涌冯兆旭杨海波张津栋

计算机工程与应用2018，Vol.54Issue(11)：122-127,139,7.

计算机工程与应用2018，Vol.54Issue(11)：122-127,139,7.DOI:10.3778/j.issn.1002-8331.1701-0161

基于结构相似网页聚类的正文提取算法研究

Research on text extraction algorithm based on structure similarity page clustering

王海涌 ¹冯兆旭 ¹杨海波 ¹张津栋¹

作者信息

1. 兰州交通大学电子与信息工程学院,兰州730070
折叠

摘要

Abstract

The current Web pages are getting more and more diverse, complex which makes the information extraction more difficult.In this paper,a text extraction algorithm based on structure similarity page clustering is proposed.Firstly, the contribution of each"block"to the template is assigned to different weights according to the composition of the front page of the Web page.Secondly,the similarity of the corresponding blocks in the two Web pages is calculated.The simi-larity and the weight of each block product as the sum of the two pages'similarity.This algorithm takes into account the influence of Web page structure difference on Web page text extraction.Web page is clustered based on computing the similarity between Web pages.The results are more accurate for the Web page text in the same cluster.The experimental results show that the method has higher accuracy and the evaluation indexes are improved.

关键词

正文提取/相似性/文档对象模型(DOM)树/层次聚类

Key words

information extraction/similarity/Document Object Model(DOM)tree/hierarchical clustering

分类

信息技术与安全科学

引用本文复制引用

王海涌,冯兆旭,杨海波,张津栋..基于结构相似网页聚类的正文提取算法研究[J].计算机工程与应用,2018,54(11):122-127,139,7.

基金项目

甘肃省自然科学基金(No.145RJZA086) （No.145RJZA086）

兰州交通大学科技支撑基金(No.ZC2014003) （No.ZC2014003）

兰州市科技计划项目(No.2013-3-79). （No.2013-3-79）

计算机工程与应用

OA北大核心CSCDCSTPCD

ISSN：1002-8331

访问量4

下载量0

段落导航