A Variable Selection Procedure for K-Means Clustering

Kim, Sung-Soo;

doi:10.5351/KJAS.2012.25.3.471

The Korean Journal of Applied Statistics (응용통계연구)

Volume 25 Issue 3
/
Pages.471-483
/
2012
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

A Variable Selection Procedure for K-Means Clustering

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)

Received : 2012.02.23
Accepted : 2012.04.18
Published : 2012.06.30

https://doi.org/10.5351/KJAS.2012.25.3.471 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

One of the most important problems in cluster analysis is the selection of variables that truly define cluster structure, while eliminating noisy variables that mask such structure. Brusco and Cradit (2001) present VS-KM(variable-selection heuristic for K-means clustering) procedure for selecting true variables for K-means clustering based on adjusted Rand index. This procedure starts with the fixed number of clusters in K-means and adds variables sequentially based on an adjusted Rand index. This paper presents an updated procedure combining the VS-KM with the automated K-means procedure provided by Kim (2009). This automated variable selection procedure for K-means clustering calculates the cluster number and initial cluster center whenever new variable is added and adds a variable based on adjusted Rand index. Simulation result indicates that the proposed procedure is very effective at selecting true variables and at eliminating noisy variables. Implemented program using R can be obtained on the website "http://faculty.knou.ac.kr/sskim/nvarkm.r and vnvarkm.r".

Keywords

References

Baneld, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821. https://doi.org/10.2307/2532201
Brusco, M. J. and Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering, Psychometrika, 66, 249-270. https://doi.org/10.1007/BF02294838
Carmone, F. J., Kara, A. and Maxwell, S. (1999). HINoV; A new model to improve market segmentation by identifying noisy variables, Journal of Marketing Research, 36, 501-509. https://doi.org/10.2307/3152003
De Sarbo, W. S., Carroll, J. D., Clark, L. A. and Green, P. E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with different weighting of variables, Psychometrika, 49, 57-78. https://doi.org/10.1007/BF02294206
De Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree clustering, Quality and Quantity, 20, 169-180. https://doi.org/10.1007/BF00227423
Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, Arnold.
Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1987). Variable selection in clustering other contexts, In C.L. Mallows(Ed.), Design, Data and Analysis, 13-34.
Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering, Journal of Classication, 5, 205-228.
Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder), Journal of the American Statistical Association, 78, 553-584. https://doi.org/10.1080/01621459.1983.10478008
Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via modelbased cluster analysis, Computer Journal, 41, 578-588. https://doi.org/10.1093/comjnl/41.8.578
Gnanadesikan, R., Kettenring, J. R. and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis, Journal of Classication, 7, 271-285.
Hubert, L. and Arabie, P. (1985). Comparing partitions, Journal of Classication, 2, 193-218.
Kim, S. (1999). Interactive visualization of K-means and Hierarchical clusters, The Journal of Data Science and Classication, 3, 13-27.
Kim, S. (2009). Automated K-means clustering and R implementation, The Korean Journal of Applied Statistics, 22, 723-733. https://doi.org/10.5351/KJAS.2009.22.4.723
Kim, S.-G. (2011). Variable selection in normal mixture model based clustering under heteroscedasticity, The Korean Journal of Applied Statistics, 24, 1213-1224. https://doi.org/10.5351/KJAS.2011.24.6.1213
Kim, S., Kwon, S. and Cook, D. (2000). Interactive visualization of hierarchical clusters using MDS and MST, Metrika, 51, 39-51. https://doi.org/10.1007/s001840000043
Milligan, G. W. (1980a). An examination of six types of the effects of error perturbation on fifteen clustering algorithms, Psychometrika, 45, 325-342. https://doi.org/10.1007/BF02293907
Milligan, G. W. (1980b). An algorithm for generating artificial test clusters, Psychometrika, 50, 123-127.
Milligan, G. W. (1989). A validation study of a variable-weighting algorithm for cluster analysis, Journal of Classication, 6, 53-71.
Milligan, G. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179. https://doi.org/10.1007/BF02294245
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation, The Computer Journal, 20, 259-363.
Mojena, R., Wishart, D. and Andrews, G. B. (1980). Stopping rules for Wards' clustering method, COMP- STAT,, 426-432.
Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering, Journal of the American Statistical Assocation, 101, 168-178. https://doi.org/10.1198/016214506000000113
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Assocation, 66, 846-850. https://doi.org/10.1080/01621459.1971.10482356
Qui, W.-L. and Joe, H. (2006). Generation of random clusters with specified degree of separation, Journal of Classication, 23, 315-334.
Steinley, D. and Brusco, M. J. (2008). A new variable weighting and selection procedure for K-means cluster analysis, Multivariate Behavioral Research, 43, 77-108. https://doi.org/10.1080/00273170701836695
Waller, N. G., Underhill, J. M. and Kaiser, H. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation, Multivariate Behavioral Research, 34, 123-142. https://doi.org/10.1207/S15327906Mb340201
Ward, J. H. (1963). Hierarchical grouping to optimise an objective function, Journal of American Statistical Association, 58, 236-244. https://doi.org/10.1080/01621459.1963.10500845

Cited by

Variable Selection and Outlier Detection for Automated K-means Clustering vol.22, pp.1, 2015, https://doi.org/10.5351/CSAM.2015.22.1.055
Operational Management System and Characteristics Analysis on the Rural Experience Programs:the Case of Comprehensive Rural Village Development Projects vol.21, pp.2, 2015, https://doi.org/10.7851/ksrp.2015.21.2.103

The Korean Journal of Applied Statistics (응용통계연구)

A Variable Selection Procedure for K-Means Clustering

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)