Cluster Analysis with Balancing Weight on Mixed-type Data

Chae, Seong-San;Kim, Jong-Min;Yang, Wan-Youn;

doi:10.5351/CKSS.2006.13.3.719

Communications for Statistical Applications and Methods

Volume 13 Issue 3
/
Pages.719-732
/
2006
/
2287-7843(pISSN)
/
2383-4757(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Cluster Analysis with Balancing Weight on Mixed-type Data

Chae, Seong-San (Department of Applied Statistics, Daejeon University) ;
Kim, Jong-Min (Division of Science and Mathematics, University of Minnesota) ;
Yang, Wan-Youn (Department of Applied Statistics, Kyungwon University)

Published : 2006.12.31

https://doi.org/10.5351/CKSS.2006.13.3.719 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

A set of clustering algorithms with proper weight on the formulation of distance which extend to mixed numeric and multiple binary values is presented. A simple matching and Jaccard coefficients are used to measure similarity between objects for multiple binary attributes. Similarities are converted to dissimilarities between i th and j th objects. The performance of clustering algorithms with balancing weight on different similarity measures is demonstrated. Our experiments show that clustering algorithms with application of proper weight give competitive recovery level when a set of data with mixed numeric and multiple binary attributes is clustered.

Keywords

References

Affi, A.A. and Clark, V. (1990). Computer-Aided Multivariate Analysis. Van Nostrand Reinhold Company, New York
Asparoukhov, O.K. and Krzanowski, W.J. (2001). A comparison of discriminant procedures for binary variables. Computational Statistics & Data Analysis, Vol. 38, 139-160 https://doi.org/10.1016/S0167-9473(01)00032-9
Chae, S.S., DuBien J.L. and Warde, W.D. (2006). A method of predicting the number of clusters using Rand's statistic. Computational Statistics & Data Analysis, Vol. 50, 3531-3546 https://doi.org/10.1016/j.csda.2005.08.006
Chae, S.S. and Kim, J.I. (2005). Cluster analysis using principal coordinates for binary data. The Korean Communications in Statistics, Vol. 12, 683-696 https://doi.org/10.5351/CKSS.2005.12.3.683
DuBien, J.L. and Warde, W.D. (1987). A comparison of agglomerative cluster -ing methods with respect to noise. Communications in Statistics, Theory and Method, Vol. 16, 1433-1460 https://doi.org/10.1080/03610928708829447
Everitt, B. (1993). Cluster Analysis. 3rd edition, John Wiley & Sons
Gowda, K.C. and Diday, E. (1991). Symbolic clustering using a new dis simi -larity measures. Pattern Recognition, Vol. 24, 567-578 https://doi.org/10.1016/0031-3203(91)90022-W
Gower, J.C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, Vol. 53, 325-338 https://doi.org/10.1093/biomet/53.3-4.325
Gower, J.C. (1967). A comparison of some methods of cluster analysis. Biometrics, Vol. 23, 623-637 https://doi.org/10.2307/2528417
Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, Vol. 27, 857-871 https://doi.org/10.2307/2528823
Gower, J.C. and Legendre, P. (1986), Metric and Euclidean properties of dis -similarity coefficients. Journal of Classification, Vol. 3, 5-48 https://doi.org/10.1007/BF01896809
Huang, Z. (1998). Extensions to the k-means algorithms for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, Vol. 2, 283-304 https://doi.org/10.1023/A:1009769707641
Jain, A.K. and Dubes, R.C, (1988). Algorithms for Clustering Data. Prentice Hall
Lee, J.J. (2005). Discriminant analysis of binary data with multinomial distri -bution by using the iterative cross entropy minimization estimation. The Korean Communications in Statistics, Vol. 12, 125-137 https://doi.org/10.5351/CKSS.2005.12.1.125
Ordonez, C. (2003). Clustering binary data streams with K-means. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Joumal of the American Statistical Association, Vol. 66, 846-850 https://doi.org/10.2307/2284239

Communications for Statistical Applications and Methods

Cluster Analysis with Balancing Weight on Mixed-type Data

Abstract

Keywords

References

Detail Search