现代电子技术2016,Vol.39Issue(1):108-112,117,6.DOI:10.16652/j.issn.1004-373x.2016.01.029
基于半监督学习的Web页面内容分类技术研究
Study on Web page content classification technology based on semi-supervised learning
摘要
Abstract
For the key issues that how to use labeled and unlabeled data to conduct Web classification,a classifier of com-bining generative model with discriminative model is explored. The maximum likelihood estimation is adopted in the unlabeled training set to construct a semi-supervised classifier with high classification performance. The Dirichlet-polynomial mixed distri-bution is used to model the text,and then a hybrid model which is suitable for the semi-supervised learning is proposed. Since the EM algorithm for the semi-supervised learning has fast convergence rate and is easy to fall into local optimum,two intelli-gent optimization methods of simulated annealing algorithm and genetic algorithm are introduced,analyzed and processed. A new intelligent semi-supervised classification algorithm was generated by combing the two algorithms,and the feasibility of the algorithm was verified.关键词
Web页面内容分类/半监督学习/半监督分类/智能优化/Dirichlet分布Key words
Web page content classification/semi-supervised learning/semi-supervised classification/intelligent optimiza-tion/Dirichlet distribution分类
信息技术与安全科学引用本文复制引用
赵夫群..基于半监督学习的Web页面内容分类技术研究[J].现代电子技术,2016,39(1):108-112,117,6.基金项目
咸阳师范学院专项科研计划项目:基于人工智能的三维油藏数据处理研究(07XSYK224) (07XSYK224)
陕西省教育厅专项科研计划项目:信息化环境下关中方言的保护与传承(12JK0212) (12JK0212)