[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/CSAM.2015.22.1.055

Variable Selection and Outlier Detection for Automated K-means Clustering

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)

Publication Information

Communications for Statistical Applications and Methods / v.22, no.1, 2015 , pp. 55-67 More about this Journal

Abstract

An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Keywords

Automated K-means clustering; variable selection; outlier detecting; VS-KM; adjusted rand index; Mahalanobis distance;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	Arai, K. and Barakbah, A. R. (2007). Hierarchical K-means: an algorithm for centroids initialization for K-means, Reports of the Faculty of Science and Engineering, Saga University, 36, 25-31.
2	Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821. DOI ScienceOn
3	Bartkowiak, A. (2005). Robust Mahalanobis distances obtained using the 'Multout'; and "Fast-mcd' Methods, Biocybernetics and Biomedical Engineering, 25, 7-21.
4	Brusco, M. J. and Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering, Pychometrika, 66, 249-270. DOI ScienceOn
5	Carmone, F. J., Kara, A. and Maxwell, S. (1999). HINoV; A new model to improve market segmentation by identifying noisy variables, Journal of Marketing Research, 36, 501-509. DOI ScienceOn
6	Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, Arnold.
7	Filzmoser, P. and Varmuza, K. (2013). Package Chemometrics. Documentation available at: http:// cran.r-project.org/web/packages/chemometrics/index.html.
8	Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering, Journal of Classification, 5, 205-228. DOI
9	Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder), Journal of the American Statistical Association, 78, 553-584. DOI ScienceOn
10	Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis, Computer Journal, 41, 578-588. DOI ScienceOn
11	Gnanadesikan, R., Kettenring, J. R. and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis, Journal of Classification, 7, 271-285.
12	Hautamaki, V., Cherednichenko, S., Karkkainen, I., Kinnunen, T. and Franti, P. (2005). Improving K-Means by Outlier Removal, LNCS Springer, Berlin / Heidelberg, may 2005, 978-987.
13	Hawkins, D. (1980). Identifications of Outliers, Chapman and Hall, London.
14	Hubert, L. and Arabie, P. (1985). Comparing partitions, Journal of Classification, 2, 193-218.
15	Jayakumar, G. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.
16	Jiang, M. F., Tseng, S. S. and Su, C. M. (2001). Two-phase clustering process for outliers detection, Pattern Recognition Letters, 22, 691-700. DOI ScienceOn
17	Kim, S. (2009). Automated K-means clustering and R implementation, The Korean Journal of Applied Statistics, 22, 723-733. 과학기술학회마을 DOI ScienceOn
18	Kim, S. (2012). A variable selection procedure for K-means clustering, The Korean Journal of Applied Statistics, 25, 471-483. 과학기술학회마을 DOI ScienceOn
19	Milligan, G. W. (1980). An examination of six types of the effects of error perturbation on fifteen clustering algorithms, Psychometrika, 45, 325-342. DOI
20	Kriegel, H.-P., Kroger, P. and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining, Available from: https://www.siam.org/meetings/sdm10/ tutorial3.pdf.
21	Milligan, G. W. (1985). An algorithm for generating artificial test clusters, Psychometrika, 50, 123-127. DOI
22	Milligan, G. W. (1989). A validation study of a variable-weighting algorithm for cluster analysis, Journal of Classification, 6, 53-71. DOI
23	Milligan, G. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179. DOI
24	Mojena, R. (1977). Hierarchical grouping method and stopping rules: An evaluation, The Computer Journal, 20, 359-363. DOI
25	Mojena, R., Wishart, D. and Andrews, G. B. (1980). Stopping rules for Ward's clustering method, COMPSTAT, 426-432.
26	Pachgade, S. D. and Dhande, S. S. (2012). Outlier detection over data set using cluster-based and distance-based approach, International Journal of Advanced Research in Computer Science and Software Engineering, 2, 12-16.
27	Pamula, R., Deka, J. K. and Nandi, S. (2011). An outlier detection method based on clustering, Second International Conference on Emerging Applications of Information Technology, 253-256.
28	Qiu,W.-L. and Joe, H. (2006a). Generation of random clusters with specified degree of separation, Journal of Classification, 23, 315-334. DOI
29	Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of American Statistical Association, 66, 846-850. DOI ScienceOn
30	Qiu, W.-L. and Joe, H. (2006b). Separation index and partial membership for clustering, Computational, Statistics and Data Analysis, 50, 585-603. DOI ScienceOn
31	Rocke, D. M. and Woodruff, D. L. (1996). Identification of outliers in multivariate data, Journal of the American Statistical Association, 91, 1047-1061. DOI ScienceOn
32	Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection, John Wiley and Sons, New York.
33	Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association, 85, 633-651. DOI ScienceOn
34	Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the Number of Clusters in a Dataset via the Gap Statistic, Technical report, Dept of Biostatistics, Stanford University, Available from : http://www-stat.stanford.edu/-tibs/research.html.
35	Ward, J. H. (1963). Hierarchical grouping to optimize an objective function, Journal of American Statistical Association, 58, 236-244. DOI ScienceOn
36	Wehrens R., Buydens L., Fraley, C. and Raftery, A. (2004). Model-based clustering for image seg- mentation and large datasets via sampling, Journal of Classification, 21, 231-253. DOI

	Guojun Gan. (2017) Pattern Recognition Letters k -means clustering with outlier removal / 90 , 8
2050-3326	(2019) Journal of Marketing Analytics Joint selection of variables and clusters: recovering the underlying structure of marketing data / (2050-3326)