巢湖学院学报Issue(6):34-38,5.
数据挖掘中Web文档转换算法的设计与实现
DESIGN AND IMPLEMENTATION OF WEB DOCUMENTS CONVERSION ALGORITHM IN DATA MINING
摘要
Abstract
Web text information mining is one of the important applications of applying data mining technologies into informa- tion analysis and processing, how to transform web documents into data mining to the required format, i.e. web document pre- processing becomes a significant research task. In this paper the method is : from Internet to download a large number of web- page files, webpage files are converted into a text files, and then through the algorithm to word frequency statistics the data of the text files, delete non-using words, remove high frequency words, process etyma of substantive words, extract stems, elimi- nate redundant words and establish word lis4 thus extraction word list, alphabetical index to generate word frequency index, and the dictionary file comparison, get the word ID, the last generation of Reuters-21578 Database data format. This web docu ment data converted into standard data sets for classification and clustering to prepare in data mining.关键词
Web文档/数据挖掘/预处理Key words
web documents/data mining/preprocessing分类
信息技术与安全科学引用本文复制引用
赵小龙,佘东..数据挖掘中Web文档转换算法的设计与实现[J].巢湖学院学报,2011,(6):34-38,5.基金项目
安徽省高校优秀人才基金项目 ()
巢湖学院一般项目 ()
安徽工业经济学院《学院科研管理信息系统开发研究》自然科学基金项目支持 ()