计算机应用研究Issue(9):2581-2586,6.DOI:10.3969/j.issn.1001-3695.2015.09.005
基于布局相似性的网页正文内容提取研究
Study of Web pages content extraction based on layout similarity
摘要
Abstract
Appropriate Web content extraction technique can remove the data which is redundant,repetitive and useless from massive Web pages while extracting more meaningful and more useful data.Through the observation of Web pages,this paper proposed and implemented a Web content extraction method based on the layout similarity that the pages under the same Web site showed similar in content layout and style structure.It achieves the purpose of main content extraction by comparing the similarity of the DOMnode structure data from the Web pages belong to the same topic of the same sites.It also did some tenta-tive research and implementation on some other content relevent to this content extraction method.Experiments prove that this method is simple,pratical and universal,and it can not only meet the requirement of both high accuracy but also provide sup-port for more Internet applications of content analysis.关键词
布局相似性/网页正文提取/信息检索Key words
layout similarity/Web page content extract/information retrieval分类
信息技术与安全科学引用本文复制引用
杨柳青,李晓东,耿光刚..基于布局相似性的网页正文内容提取研究[J].计算机应用研究,2015,(9):2581-2586,6.基金项目
国家自然科学基金面上项目(61375039);国家自然科学基金青年资助项目(61005029);中国科学院计算机网络信息中心“一三五”规划重点培育方向专项基金资助项目 ()