On the clustering of huge categorical data

  • Kim, Dae-Hak (School of Liberal Arts, Catholic University of Daegu)
  • Received : 2010.10.23
  • Accepted : 2010.11.23
  • Published : 2010.11.30

Abstract

Basic objective in cluster analysis is to discover natural groupings of items. In general, clustering is conducted based on some similarity (or dissimilarity) matrix or the original input data. Various measures of similarities between objects are developed. In this paper, we consider a clustering of huge categorical real data set which shows the aspects of time-location-activity of Korean people. Some useful similarity measure for the data set, are developed and adopted for the categorical variables. Hierarchical and nonhierarchical clustering method are applied for the considered data set which is huge and consists of many categorical variables.

Keywords

References

  1. Almeida, J. A. S., Barbosa, L. M. S., Pais, A. A. C. C. and Formosinho, S. J. (2007). Improving hierarchical cluster analysis: A new method with outlier detection and automatic clustering. Chemometrics and Intelligent Laboratory Systems, 87, 208-217. https://doi.org/10.1016/j.chemolab.2007.01.005
  2. Cox, T. F. (2005). An introduction to multivariate data analysis, Hodder Arnold.
  3. Dillon, W. R. and Goldstein, M. (1984). Multivariate analysis: Methods and applications, John Wiley & Sons.
  4. Huang Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, Dept. of computer science, the University of British Columbia, Canada, 1-8.
  5. Huang Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge, 2, No. 2, 283-304. https://doi.org/10.1023/A:1009769707641
  6. Jain, A. K. and Moreau, J. V. (1987). Bootstrap technique in cluster analysis. Pattern Recognition, 20, 547-568. https://doi.org/10.1016/0031-3203(87)90081-1
  7. Kang, C. W., Moon, S. H. and Cho, J. S. (2006). Spatial cluster analysis for earthquake on the korean peninsula. Journal of the Korean Data & Information Science Society, 17, 1141-1150.
  8. Kaufman, L. and Rousseauw, P. J. (1990). Finding groups in data: An introduction to cluster analysis, Wiley, New York.
  9. Kim, D. (2009). A practical application of cluster analysis using SPSS. Journal of the Korean Data & Information Science Society, 20, 1207-1212.
  10. Kim, J. H. and Lim, J. W. (2003). Cluster analysis with air pollutants and meteorological factors in seoul. Journal of the Korean Data & Information Science Society, 14, 737-787.
  11. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations, In L. M. Le Cam and J. Neyman, editor. Proceedings of the 5th Berkeley symposium on Mathematical Statistics and Probability, 1, 281-297, University of California.
  12. SPSS (2004). SPSS, advanced models 12.0.1 , SPSS Inc., Chicago.