DOI QR코드

DOI QR Code

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)
  • Received : 2014.10.28
  • Accepted : 2015.01.13
  • Published : 2015.01.31

Abstract

An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Keywords

References

  1. Arai, K. and Barakbah, A. R. (2007). Hierarchical K-means: an algorithm for centroids initialization for K-means, Reports of the Faculty of Science and Engineering, Saga University, 36, 25-31.
  2. Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821. https://doi.org/10.2307/2532201
  3. Bartkowiak, A. (2005). Robust Mahalanobis distances obtained using the 'Multout'; and "Fast-mcd' Methods, Biocybernetics and Biomedical Engineering, 25, 7-21.
  4. Brusco, M. J. and Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering, Pychometrika, 66, 249-270. https://doi.org/10.1007/BF02294838
  5. Carmone, F. J., Kara, A. and Maxwell, S. (1999). HINoV; A new model to improve market segmentation by identifying noisy variables, Journal of Marketing Research, 36, 501-509. https://doi.org/10.2307/3152003
  6. Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, Arnold.
  7. Filzmoser, P. and Varmuza, K. (2013). Package Chemometrics. Documentation available at: http:// cran.r-project.org/web/packages/chemometrics/index.html.
  8. Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering, Journal of Classification, 5, 205-228. https://doi.org/10.1007/BF01897164
  9. Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder), Journal of the American Statistical Association, 78, 553-584. https://doi.org/10.1080/01621459.1983.10478008
  10. Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis, Computer Journal, 41, 578-588. https://doi.org/10.1093/comjnl/41.8.578
  11. Gnanadesikan, R., Kettenring, J. R. and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis, Journal of Classification, 7, 271-285.
  12. Hautamaki, V., Cherednichenko, S., Karkkainen, I., Kinnunen, T. and Franti, P. (2005). Improving K-Means by Outlier Removal, LNCS Springer, Berlin / Heidelberg, may 2005, 978-987.
  13. Hawkins, D. (1980). Identifications of Outliers, Chapman and Hall, London.
  14. Hubert, L. and Arabie, P. (1985). Comparing partitions, Journal of Classification, 2, 193-218.
  15. Jayakumar, G. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.
  16. Jiang, M. F., Tseng, S. S. and Su, C. M. (2001). Two-phase clustering process for outliers detection, Pattern Recognition Letters, 22, 691-700. https://doi.org/10.1016/S0167-8655(00)00131-8
  17. Kim, S. (2009). Automated K-means clustering and R implementation, The Korean Journal of Applied Statistics, 22, 723-733. https://doi.org/10.5351/KJAS.2009.22.4.723
  18. Kim, S. (2012). A variable selection procedure for K-means clustering, The Korean Journal of Applied Statistics, 25, 471-483. https://doi.org/10.5351/KJAS.2012.25.3.471
  19. Kriegel, H.-P., Kroger, P. and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining, Available from: https://www.siam.org/meetings/sdm10/ tutorial3.pdf.
  20. Milligan, G. W. (1980). An examination of six types of the effects of error perturbation on fifteen clustering algorithms, Psychometrika, 45, 325-342. https://doi.org/10.1007/BF02293907
  21. Milligan, G. W. (1985). An algorithm for generating artificial test clusters, Psychometrika, 50, 123-127. https://doi.org/10.1007/BF02294153
  22. Milligan, G. W. (1989). A validation study of a variable-weighting algorithm for cluster analysis, Journal of Classification, 6, 53-71. https://doi.org/10.1007/BF01908588
  23. Milligan, G. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179. https://doi.org/10.1007/BF02294245
  24. Mojena, R. (1977). Hierarchical grouping method and stopping rules: An evaluation, The Computer Journal, 20, 359-363. https://doi.org/10.1093/comjnl/20.4.359
  25. Mojena, R., Wishart, D. and Andrews, G. B. (1980). Stopping rules for Ward's clustering method, COMPSTAT, 426-432.
  26. Pachgade, S. D. and Dhande, S. S. (2012). Outlier detection over data set using cluster-based and distance-based approach, International Journal of Advanced Research in Computer Science and Software Engineering, 2, 12-16.
  27. Pamula, R., Deka, J. K. and Nandi, S. (2011). An outlier detection method based on clustering, Second International Conference on Emerging Applications of Information Technology, 253-256.
  28. Qiu,W.-L. and Joe, H. (2006a). Generation of random clusters with specified degree of separation, Journal of Classification, 23, 315-334. https://doi.org/10.1007/s00357-006-0018-y
  29. Qiu, W.-L. and Joe, H. (2006b). Separation index and partial membership for clustering, Computational, Statistics and Data Analysis, 50, 585-603. https://doi.org/10.1016/j.csda.2004.09.009
  30. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of American Statistical Association, 66, 846-850. https://doi.org/10.1080/01621459.1971.10482356
  31. Rocke, D. M. and Woodruff, D. L. (1996). Identification of outliers in multivariate data, Journal of the American Statistical Association, 91, 1047-1061. https://doi.org/10.1080/01621459.1996.10476975
  32. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection, John Wiley and Sons, New York.
  33. Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association, 85, 633-651. https://doi.org/10.1080/01621459.1990.10474920
  34. Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the Number of Clusters in a Dataset via the Gap Statistic, Technical report, Dept of Biostatistics, Stanford University, Available from : http://www-stat.stanford.edu/-tibs/research.html.
  35. Ward, J. H. (1963). Hierarchical grouping to optimize an objective function, Journal of American Statistical Association, 58, 236-244. https://doi.org/10.1080/01621459.1963.10500845
  36. Wehrens R., Buydens L., Fraley, C. and Raftery, A. (2004). Model-based clustering for image seg- mentation and large datasets via sampling, Journal of Classification, 21, 231-253. https://doi.org/10.1007/s00357-004-0018-8

Cited by

  1. k -means clustering with outlier removal vol.90, 2017, https://doi.org/10.1016/j.patrec.2017.03.008
  2. Joint selection of variables and clusters: recovering the underlying structure of marketing data pp.2050-3326, 2019, https://doi.org/10.1057/s41270-018-0045-7