南京大学学报(自然科学版)2017,Vol.53Issue(2):368-377,10.DOI:10.13232/j.cnki.jnju.2017.02.019
一种基于簇边界的密度峰值点快速搜索聚类算法
An improved clustering algorithm by fast search and find ofdensity peaks based on boundary samples
摘要
Abstract
In data mining community,clustering is one of the most important research topics because of the complexity and nonsupervisory of data.A great deal of techniques are devoted to the study of data clustering algorithms.A paper titled with clustering by fast search and find of density peaks(DPC)was proposed in Science journal,which focused on density-based clustering.Compared with other clustering algorithms,DPC only uses less parameters but can obtain better clustering results.However,when there exist multi density peaks in a cluster,the clustering results are not satisfactory.For this reason a boundary partition-based DPC algorithm,B-DPC,is proposed.B-DPC algorithm improves the standard DPC from two aspects:a criterion of cleaning noisy data and the data clustering processes with two rounds.A new criterion how to judge whether a data instance is a noise is defined by calculating the distances among all data instances.A data instance can be viewed as a noise if the distances between this instance and all noisy data instances in noisy dataset are less than a predetermined threshold.Such noisy data instances are firstly cleaned from dataset,and then B-DPC begins to implement a two-round process.The first-round process is to apply the standard DPC to choose some latent cluster centers.Then some initial clusters can be obtained and the decision graph can be built.The second-round process is to combine those similar clusters into more actual count of clusters,which is implemented by finding boundary data instances,the count of these boundary instances and the ratio of the boundary instances to the near clusters.In order to test the B-DPC algorithm,some classical artificial datasets and real-world datasets are applied to our experiments.And several well-performed clustering algorithms,such as DPC,DBSCAN,K-means,are also used as comparing clustering methods.Experimental results show that B-DPC can solve the multi density peaks problem effectively,and also discover the clusters with arbitrary shapes.关键词
密度峰/聚类中心/噪声清除/聚类Key words
density peaks/cluster centers/noise cleaning/clustering分类
信息技术与安全科学引用本文复制引用
贾培灵,樊建聪,彭延军..一种基于簇边界的密度峰值点快速搜索聚类算法[J].南京大学学报(自然科学版),2017,53(2):368-377,10.基金项目
国家自然科学基金(61203305,61433012),山东省重点研发计划(攻关)(2016GSF120012),山东省自然科学基金(ZR2015FM013),山东省"泰山学者"攀登计划 (61203305,61433012)