| 注册
首页|期刊导航|工程科学与技术|基于改进K-means的局部离群点检测方法

基于改进K-means的局部离群点检测方法

周玉 夏浩 岳学震 王培崇

工程科学与技术2024,Vol.56Issue(4):66-77,12.
工程科学与技术2024,Vol.56Issue(4):66-77,12.DOI:10.12454/j.jsuese.202201398

基于改进K-means的局部离群点检测方法

Local Outlier Detection Method Based on Improved K-means

周玉 1夏浩 1岳学震 1王培崇2

作者信息

  • 1. 华北水利水电大学电气工程学院,河南郑州 450045
  • 2. 河北地质大学信息工程学院,河北石家庄 050031
  • 折叠

摘要

Abstract

Objective Outliers are defined as data points generated for various special reasons.They are often regarded as noise points due to their deviation from normal data points and are considered points of research value,occupying a small proportion of the dataset.The task of outlier detection in-volves identifying these points and analyzing their potential abnormal information through the analysis of data attribute features.This process aims to uncover unusual patterns or behaviors within the dataset that can provide insights into unique phenomena or anomalies.Most clustering-based outlier detection methods primarily detect outliers in the dataset from a global perspective,with weaker performance in detecting local out-liers.Hence,an improved K-means clustering algorithm is proposed by introducing fast search and discovering density peak methods.A local out-lier detection method,named KLOD(local outlier detection based on improved K-means and least squares methods),is developed to achieve pre-cise detection of local outliers. Methods The K-means clustering algorithm is characterized by hard clustering,meaning that after clustering the dataset,each data point has a clear association with one cluster or another.This property makes it suitable for outlier detection,as outliers significantly affect the clustering pro-cess.However,selecting initial cluster centers and determining the number of clusters is crucial as they directly impact the clustering effective-ness.To select the accurate cluster center,clustering by fast search and finding density peaks is utilized to compute the local density and relative distance of data points,constructing a decision graph based on these metrics.The challenge lies in accurately determining the cutoff distance dc,making it difficult to precisely identify the number of cluster centers from the decision graph obtained using a single dc value.The elbow method is employed to determine the optimal number of clusters for an unknown dataset for the best clustering effectiveness to address the challenge of determining the number of clusters.When clustering data into different numbers of clusters,the cost function value changes accordingly.The number of clusters is depicted on the x-axis,and the cost function value is on they-axis.The changes in the cost function value with the number of clusters are recorded and plotted as a line graph.When there is no significant decrease in the cost function value with an increase in the number of cluster centers,the position of the"elbow"is observed to determine the optimal number of clusters.After determining the initial cluster centers and the number of clusters k,the dataset is clustered using the K-means clustering algorithm to obtain k clusters and their corresponding cluster centers.The objective function value for each data point in each dimension within each cluster is then computed.Then,the objective function val-ues for each dimension of the data points in each cluster are sorted in ascending order.The objective function values,sorted in ascending order,are fitted using the least squares approach to obtain a curve.The derivative of this fitted curve is then calculated to obtain the slope,providing in-sight into the rate of change of the objective function values within each cluster.Each dimension's degree of dispersion and information content can vary in the dataset,so different weights are assigned to each dimension.Information entropy is employed to measure the dataset's degree of dispersion,and higher weightage is given to dimensions with higher outlier degrees to represent their impact on the overall dataset.By incorporat-ing information entropy,each dimension's objective function value for each data point is weighted by the corresponding change rate.This process results in the final anomaly score,and the top-n data points with high anomaly scores are considered outliers. Results and Discussions The experimental results indicated that in the artificial dataset,KLOD,KNN,and LOF all detect sparse local outliers ef-fectively.However,the LOF algorithm struggles to detect outliers within outlier clusters.Additionally,the KNN method cannot detect local out-liers within densely distributed clusters when there is a considerable distance between normal data points.In contrast,the KLOD method analyzes each cluster individually,addressing the issue of uneven cluster densities.The KLOD method analyzes each dimension of the data points within each cluster separately,achieving accurate detection.In the UCI dataset,the KLOD method achieves optimal detection accuracy in 10 datasets,with detection accuracy on par with KNN and LOF in 2 datasets.Compared to the KNN and LOF algorithms,KLOD also demonstrates high ac-curacy in outlier detection.The fast search density peak method is applied to calculate the local density and relative distance of data points,and the y value of each data point is determined based on these two metrics to improve the K-means clustering algorithm.However,the size of y is in-fluenced by the cutoff distance dc,making it difficult to intuitively choose k initial cluster centers.Hence,the elbow method selects the k data points with the largest γ values as initial cluster centers for the K-means clustering algorithm.Least squares fitting is employed to fit the objective function values for each dimension sorted in ascending order.This method highlights the degree of outlierness of outliers,incorporating more out-lier information into the final anomaly score. Conclusions Experimental results on artificial and UCI real datasets demonstrated that the KLOD method can detect local outliers with moderate outliers.Compared to the KNN and LOF methods,it significantly improves detection accuracy.However,due to limitations of the K-means al-gorithm itself,its clustering performance is poor for datasets containing arbitrarily shaped clusters,affecting detection performance.Therefore,fu-ture studies can focus on enhancing the performance of outlier detection methods on datasets with arbitrary cluster shapes.

关键词

离群点检测/K均值聚类/最小二乘法/密度峰值/目标函数值

Key words

outlier detection/K-means/least squares method/peak density/objective function

分类

信息技术与安全科学

引用本文复制引用

周玉,夏浩,岳学震,王培崇..基于改进K-means的局部离群点检测方法[J].工程科学与技术,2024,56(4):66-77,12.

基金项目

国家自然科学基金项目(U1504622 ()

31671580) ()

河南省高等学校青年骨干教师培养计划项目(2018GGJS079) (2018GGJS079)

河北省高等学校科学技术研究项目(ZD2020344) (ZD2020344)

工程科学与技术

OA北大核心CSTPCD

2096-3246

访问量1
|
下载量0
段落导航相关论文