计算机工程与应用2018,Vol.54Issue(11):122-127,139,7.DOI:10.3778/j.issn.1002-8331.1701-0161
基于结构相似网页聚类的正文提取算法研究
Research on text extraction algorithm based on structure similarity page clustering
摘要
Abstract
The current Web pages are getting more and more diverse, complex which makes the information extraction more difficult.In this paper,a text extraction algorithm based on structure similarity page clustering is proposed.Firstly, the contribution of each"block"to the template is assigned to different weights according to the composition of the front page of the Web page.Secondly,the similarity of the corresponding blocks in the two Web pages is calculated.The simi-larity and the weight of each block product as the sum of the two pages'similarity.This algorithm takes into account the influence of Web page structure difference on Web page text extraction.Web page is clustered based on computing the similarity between Web pages.The results are more accurate for the Web page text in the same cluster.The experimental results show that the method has higher accuracy and the evaluation indexes are improved.关键词
正文提取/相似性/文档对象模型(DOM)树/层次聚类Key words
information extraction/similarity/Document Object Model(DOM)tree/hierarchical clustering分类
信息技术与安全科学引用本文复制引用
王海涌,冯兆旭,杨海波,张津栋..基于结构相似网页聚类的正文提取算法研究[J].计算机工程与应用,2018,54(11):122-127,139,7.基金项目
甘肃省自然科学基金(No.145RJZA086) (No.145RJZA086)
兰州交通大学科技支撑基金(No.ZC2014003) (No.ZC2014003)
兰州市科技计划项目(No.2013-3-79). (No.2013-3-79)