DOI QR코드

DOI QR Code

A Variable Selection Procedure for K-Means Clustering

  • Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)
  • Received : 2012.02.23
  • Accepted : 2012.04.18
  • Published : 2012.06.30

Abstract

One of the most important problems in cluster analysis is the selection of variables that truly define cluster structure, while eliminating noisy variables that mask such structure. Brusco and Cradit (2001) present VS-KM(variable-selection heuristic for K-means clustering) procedure for selecting true variables for K-means clustering based on adjusted Rand index. This procedure starts with the fixed number of clusters in K-means and adds variables sequentially based on an adjusted Rand index. This paper presents an updated procedure combining the VS-KM with the automated K-means procedure provided by Kim (2009). This automated variable selection procedure for K-means clustering calculates the cluster number and initial cluster center whenever new variable is added and adds a variable based on adjusted Rand index. Simulation result indicates that the proposed procedure is very effective at selecting true variables and at eliminating noisy variables. Implemented program using R can be obtained on the website "http://faculty.knou.ac.kr/sskim/nvarkm.r and vnvarkm.r".

Keywords

References

  1. Ban eld, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821. https://doi.org/10.2307/2532201
  2. Brusco, M. J. and Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering, Psychometrika, 66, 249-270. https://doi.org/10.1007/BF02294838
  3. Carmone, F. J., Kara, A. and Maxwell, S. (1999). HINoV; A new model to improve market segmentation by identifying noisy variables, Journal of Marketing Research, 36, 501-509. https://doi.org/10.2307/3152003
  4. De Sarbo, W. S., Carroll, J. D., Clark, L. A. and Green, P. E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with different weighting of variables, Psychometrika, 49, 57-78. https://doi.org/10.1007/BF02294206
  5. De Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree clustering, Quality and Quantity, 20, 169-180. https://doi.org/10.1007/BF00227423
  6. Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, Arnold.
  7. Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1987). Variable selection in clustering other contexts, In C.L. Mallows(Ed.), Design, Data and Analysis, 13-34.
  8. Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering, Journal of Classi cation, 5, 205-228.
  9. Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder), Journal of the American Statistical Association, 78, 553-584. https://doi.org/10.1080/01621459.1983.10478008
  10. Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via modelbased cluster analysis, Computer Journal, 41, 578-588. https://doi.org/10.1093/comjnl/41.8.578
  11. Gnanadesikan, R., Kettenring, J. R. and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis, Journal of Classi cation, 7, 271-285.
  12. Hubert, L. and Arabie, P. (1985). Comparing partitions, Journal of Classi cation, 2, 193-218.
  13. Kim, S. (1999). Interactive visualization of K-means and Hierarchical clusters, The Journal of Data Science and Classi cation, 3, 13-27.
  14. Kim, S. (2009). Automated K-means clustering and R implementation, The Korean Journal of Applied Statistics, 22, 723-733. https://doi.org/10.5351/KJAS.2009.22.4.723
  15. Kim, S.-G. (2011). Variable selection in normal mixture model based clustering under heteroscedasticity, The Korean Journal of Applied Statistics, 24, 1213-1224. https://doi.org/10.5351/KJAS.2011.24.6.1213
  16. Kim, S., Kwon, S. and Cook, D. (2000). Interactive visualization of hierarchical clusters using MDS and MST, Metrika, 51, 39-51. https://doi.org/10.1007/s001840000043
  17. Milligan, G. W. (1980a). An examination of six types of the effects of error perturbation on fifteen clustering algorithms, Psychometrika, 45, 325-342. https://doi.org/10.1007/BF02293907
  18. Milligan, G. W. (1980b). An algorithm for generating artificial test clusters, Psychometrika, 50, 123-127.
  19. Milligan, G. W. (1989). A validation study of a variable-weighting algorithm for cluster analysis, Journal of Classi cation, 6, 53-71.
  20. Milligan, G. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179. https://doi.org/10.1007/BF02294245
  21. Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation, The Computer Journal, 20, 259-363.
  22. Mojena, R., Wishart, D. and Andrews, G. B. (1980). Stopping rules for Wards' clustering method, COMP- STAT,, 426-432.
  23. Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering, Journal of the American Statistical Assocation, 101, 168-178. https://doi.org/10.1198/016214506000000113
  24. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Assocation, 66, 846-850. https://doi.org/10.1080/01621459.1971.10482356
  25. Qui, W.-L. and Joe, H. (2006). Generation of random clusters with specified degree of separation, Journal of Classi cation, 23, 315-334.
  26. Steinley, D. and Brusco, M. J. (2008). A new variable weighting and selection procedure for K-means cluster analysis, Multivariate Behavioral Research, 43, 77-108. https://doi.org/10.1080/00273170701836695
  27. Waller, N. G., Underhill, J. M. and Kaiser, H. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation, Multivariate Behavioral Research, 34, 123-142. https://doi.org/10.1207/S15327906Mb340201
  28. Ward, J. H. (1963). Hierarchical grouping to optimise an objective function, Journal of American Statistical Association, 58, 236-244. https://doi.org/10.1080/01621459.1963.10500845

Cited by

  1. Variable Selection and Outlier Detection for Automated K-means Clustering vol.22, pp.1, 2015, https://doi.org/10.5351/CSAM.2015.22.1.055
  2. Operational Management System and Characteristics Analysis on the Rural Experience Programs:the Case of Comprehensive Rural Village Development Projects vol.21, pp.2, 2015, https://doi.org/10.7851/ksrp.2015.21.2.103