南京大学学报:自然科学版2012,Vol.48Issue(4):405-411,7.
一种基于Seeds集和成对约束的半监督聚类算法
A semi-supervised clustering algorithm based on seeds and pair-wise constraints
摘要
Abstract
Abstract:Semi-supervised learning, a kind of application-driven machine learning method, has become one of the hot topics of artificial intelligence and pattern recognition. As the main branch of semi-supervised learning, semi- supervised clustering gives a small amount of supervision information into the search process of optimal clustering. Recently, kinds of semi-supervised clustering algorithms are proposed, such as methods based on search, methods based on similarity, methods based on search and similarity. However, most current semi-supervised clustering algorithms don't use valuable seeds and pair-wise constraints at the same time. Therefore, a semi-supervised clustering algorithm based on seeds and pair-wise constraints is introduced, in order to make full use of given supervision information. In addition, Tri-training algorithm is a representative method based on Co-training mechanism. Considering that Tri-training algorithm can use three classifiers to label unlabeled samples, the proposed algorithm will utilize it to get more labeled samples. Firstly, based on Tri-training method, some unlabeledsamples are selected and annotated, to enlarge the number of initial labeled samples. Secondly, pair wise constraints are utilized to optimize enlarged labeled samples, with the purpose of improving its quality. Thirdly, initial clustering centers are acquired by optimized labeled samples. Finally, K-Means algorithm is carried out, and in the search process, pair-wise constraints are used to modify the partitioning results each time. Furthermore the proposed algorithm is compared with K-Means, Seeded-K-Means and COP-K-Means algorithm. And experimental results on three UCI data sets in same setting demonstrate that this method can take full advantage o{ given supervision information and get a better clustering result. Moreover, the experiment in Haberman data set is conducted to analyze relative impact on the algorithm's performance of pair-wise constraints and labeled samples numbers. Experimental results illustrate that the more pair-wise constraints numbers, or the more labeled samples numbers, the better this algorithm's performance.关键词
半监督聚类/Seeds集/成对约束Key words
semi-supervised clustering/seeds/pair-wise constraints分类
计算机与自动化引用本文复制引用
常瑜,梁吉业,高嘉伟,杨静..一种基于Seeds集和成对约束的半监督聚类算法[J].南京大学学报:自然科学版,2012,48(4):405-411,7.基金项目
国家自然科学基金(71031006,70971080),国家“973”计划前期研究专项课题(2011CB311805),高等学校博士学科点专项科研基金 ()