Browse > Article
http://dx.doi.org/10.5351/CSAM.2015.22.1.055

Variable Selection and Outlier Detection for Automated K-means Clustering  

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)
Publication Information
Communications for Statistical Applications and Methods / v.22, no.1, 2015 , pp. 55-67 More about this Journal
Abstract
An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.
Keywords
Automated K-means clustering; variable selection; outlier detecting; VS-KM; adjusted rand index; Mahalanobis distance;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Arai, K. and Barakbah, A. R. (2007). Hierarchical K-means: an algorithm for centroids initialization for K-means, Reports of the Faculty of Science and Engineering, Saga University, 36, 25-31.
2 Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821.   DOI   ScienceOn
3 Bartkowiak, A. (2005). Robust Mahalanobis distances obtained using the 'Multout'; and "Fast-mcd' Methods, Biocybernetics and Biomedical Engineering, 25, 7-21.
4 Brusco, M. J. and Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering, Pychometrika, 66, 249-270.   DOI   ScienceOn
5 Carmone, F. J., Kara, A. and Maxwell, S. (1999). HINoV; A new model to improve market segmentation by identifying noisy variables, Journal of Marketing Research, 36, 501-509.   DOI   ScienceOn
6 Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, Arnold.
7 Filzmoser, P. and Varmuza, K. (2013). Package Chemometrics. Documentation available at: http:// cran.r-project.org/web/packages/chemometrics/index.html.
8 Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering, Journal of Classification, 5, 205-228.   DOI
9 Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder), Journal of the American Statistical Association, 78, 553-584.   DOI   ScienceOn
10 Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis, Computer Journal, 41, 578-588.   DOI   ScienceOn
11 Gnanadesikan, R., Kettenring, J. R. and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis, Journal of Classification, 7, 271-285.
12 Hautamaki, V., Cherednichenko, S., Karkkainen, I., Kinnunen, T. and Franti, P. (2005). Improving K-Means by Outlier Removal, LNCS Springer, Berlin / Heidelberg, may 2005, 978-987.
13 Hawkins, D. (1980). Identifications of Outliers, Chapman and Hall, London.
14 Hubert, L. and Arabie, P. (1985). Comparing partitions, Journal of Classification, 2, 193-218.
15 Jayakumar, G. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.
16 Jiang, M. F., Tseng, S. S. and Su, C. M. (2001). Two-phase clustering process for outliers detection, Pattern Recognition Letters, 22, 691-700.   DOI   ScienceOn
17 Kim, S. (2009). Automated K-means clustering and R implementation, The Korean Journal of Applied Statistics, 22, 723-733.   과학기술학회마을   DOI   ScienceOn
18 Kim, S. (2012). A variable selection procedure for K-means clustering, The Korean Journal of Applied Statistics, 25, 471-483.   과학기술학회마을   DOI   ScienceOn
19 Milligan, G. W. (1980). An examination of six types of the effects of error perturbation on fifteen clustering algorithms, Psychometrika, 45, 325-342.   DOI
20 Kriegel, H.-P., Kroger, P. and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining, Available from: https://www.siam.org/meetings/sdm10/ tutorial3.pdf.
21 Milligan, G. W. (1985). An algorithm for generating artificial test clusters, Psychometrika, 50, 123-127.   DOI
22 Milligan, G. W. (1989). A validation study of a variable-weighting algorithm for cluster analysis, Journal of Classification, 6, 53-71.   DOI
23 Milligan, G. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179.   DOI
24 Mojena, R. (1977). Hierarchical grouping method and stopping rules: An evaluation, The Computer Journal, 20, 359-363.   DOI
25 Mojena, R., Wishart, D. and Andrews, G. B. (1980). Stopping rules for Ward's clustering method, COMPSTAT, 426-432.
26 Pachgade, S. D. and Dhande, S. S. (2012). Outlier detection over data set using cluster-based and distance-based approach, International Journal of Advanced Research in Computer Science and Software Engineering, 2, 12-16.
27 Pamula, R., Deka, J. K. and Nandi, S. (2011). An outlier detection method based on clustering, Second International Conference on Emerging Applications of Information Technology, 253-256.
28 Qiu,W.-L. and Joe, H. (2006a). Generation of random clusters with specified degree of separation, Journal of Classification, 23, 315-334.   DOI
29 Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of American Statistical Association, 66, 846-850.   DOI   ScienceOn
30 Qiu, W.-L. and Joe, H. (2006b). Separation index and partial membership for clustering, Computational, Statistics and Data Analysis, 50, 585-603.   DOI   ScienceOn
31 Rocke, D. M. and Woodruff, D. L. (1996). Identification of outliers in multivariate data, Journal of the American Statistical Association, 91, 1047-1061.   DOI   ScienceOn
32 Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection, John Wiley and Sons, New York.
33 Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association, 85, 633-651.   DOI   ScienceOn
34 Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the Number of Clusters in a Dataset via the Gap Statistic, Technical report, Dept of Biostatistics, Stanford University, Available from : http://www-stat.stanford.edu/-tibs/research.html.
35 Ward, J. H. (1963). Hierarchical grouping to optimize an objective function, Journal of American Statistical Association, 58, 236-244.   DOI   ScienceOn
36 Wehrens R., Buydens L., Fraley, C. and Raftery, A. (2004). Model-based clustering for image seg- mentation and large datasets via sampling, Journal of Classification, 21, 231-253.   DOI