沈阳大学学报2012,Vol.24Issue(3):52-55,4.
基于XML的Web内容挖掘方法
Method of Web Content Mining based on XML
郑霞 1陈建国2
作者信息
- 1. 闽江学院计算机科学系,福建福州350001
- 2. 福建工程学院软件学院,福建福州350003
- 折叠
摘要
Abstract
The characteristics of Web content mining were analyzed and a model of Web content mining was proposed base on XML. The HITS algorithm was used to determine the authority of Web pages, the HTML Tidy tool was used for non-XML documents through the data cleansing and transform XML documents into well-formed, and text clustering techniques were used for XML document classification data in data mining. Combining with the examples of traditional scientific papers of automated extraction system from Internet, the model is proved to work well, and it can automatically and effectively extract web page content.关键词
Web挖掘/数据挖掘/文本聚类/非XML文档Key words
Web Mining/data mining/text clustering/non-XML documents分类
计算机与自动化引用本文复制引用
郑霞,陈建国..基于XML的Web内容挖掘方法[J].沈阳大学学报,2012,24(3):52-55,4.