• Title/Summary/Keyword: statistical clustering method

Search Result 231, Processing Time 0.025 seconds

Results of Discriminant Analysis with Respect to Cluster Analyses Under Dimensional Reduction

  • Chae, Seong-San
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.2
    • /
    • pp.543-553
    • /
    • 2002
  • Principal component analysis is applied to reduce p-dimensions into q-dimensions ( $q {\leq} p$). Any partition of a collection of data points with p and q variables generated by the application of six hierarchical clustering methods is re-classified by discriminant analysis. From the application of discriminant analysis through each hierarchical clustering method, correct classification ratios are obtained. The results illustrate which method is more reasonable in exploratory data analysis.

Geodesic Clustering for Covariance Matrices

  • Lee, Haesung;Ahn, Hyun-Jung;Kim, Kwang-Rae;Kim, Peter T.;Koo, Ja-Yong
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.4
    • /
    • pp.321-331
    • /
    • 2015
  • The K-means clustering algorithm is a popular and widely used method for clustering. For covariance matrices, we consider a geodesic clustering algorithm based on the K-means clustering framework in consideration of symmetric positive definite matrices as a Riemannian (non-Euclidean) manifold. This paper considers a geodesic clustering algorithm for data consisting of symmetric positive definite (SPD) matrices, utilizing the Riemannian geometric structure for SPD matrices and the idea of a K-means clustering algorithm. A K-means clustering algorithm is divided into two main steps for which we need a dissimilarity measure between two matrix data points and a way of computing centroids for observations in clusters. In order to use the Riemannian structure, we adopt the geodesic distance and the intrinsic mean for symmetric positive definite matrices. We demonstrate our proposed method through simulations as well as application to real financial data.

Descriptive and Systematic Comparison of Clustering Methods in Microarray Data Analysis

  • Kim, Seo-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.1
    • /
    • pp.89-106
    • /
    • 2009
  • There have been many new advances in the development of improved clustering methods for microarray data analysis, but traditional clustering methods are still often used in genomic data analysis, which maY be more due to their conceptual simplicity and their broad usability in commercial software packages than to their intrinsic merits. Thus, it is crucial to assess the performance of each existing method through a comprehensive comparative analysis so as to provide informed guidelines on choosing clustering methods. In this study, we investigated existing clustering methods applied to microarray data in various real scenarios. To this end, we focused on how the various methods differ, and why a particular method does not perform well. We applied both internal and external validation methods to the following eight clustering methods using various simulated data sets and real microarray data sets.

Nonparametric analysis of income distributions among different regions based on energy distance with applications to China Health and Nutrition Survey data

  • Ma, Zhihua;Xue, Yishu;Hu, Guanyu
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.1
    • /
    • pp.57-67
    • /
    • 2019
  • Income distribution is a major concern in economic theory. In regional economics, it is often of interest to compare income distributions in different regions. Traditional methods often compare the income inequality of different regions by assuming parametric forms of the income distributions, or using summary statistics like the Gini coefficient. In this paper, we propose a nonparametric procedure to test for heterogeneity in income distributions among different regions, and a K-means clustering procedure for clustering income distributions based on energy distance. In simulation studies, it is shown that the energy distance based method has competitive results with other common methods in hypothesis testing, and the energy distance based clustering method performs well in the clustering problem. The proposed approaches are applied in analyzing data from China Health and Nutrition Survey 2011. The results indicate that there are significant differences among income distributions of the 12 provinces in the dataset. After applying a 4-means clustering algorithm, we obtained the clustering results of the income distributions in the 12 provinces.

Bootstrap Method for k-Spatial Medians

  • Jhun, Myoung-Shic
    • Journal of the Korean Statistical Society
    • /
    • v.15 no.1
    • /
    • pp.1-8
    • /
    • 1986
  • The k-medians clustering method is considered to partition observations into k clusters. Consistency and advantage of bootstrap confidence sets of k optimal cluster centers are discussed. The k-medians and k-means clustering methods are compared by using actual data sets.

  • PDF

A Density-based Clustering Method

  • Ahn, Sung Mahn;Baik, Sung Wook
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.3
    • /
    • pp.715-723
    • /
    • 2002
  • This paper is to show a clustering application of a density estimation method that utilizes the Gaussian mixture model. We define "closeness measure" as a clustering criterion to see how close given two Gaussian components are. Closeness measure is defined as the ratio of log likelihood between two Gaussian components. According to simulations using artificial data, the clustering algorithm turned out to be very powerful in that it can correctly determine clusters in complex situations, and very flexible in that it can produce different sizes of clusters based on different threshold valuesold values

Comparison of graph clustering methods for analyzing the mathematical subject classification codes

  • Choi, Kwangju;Lee, June-Yub;Kim, Younjin;Lee, Donghwan
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.5
    • /
    • pp.569-578
    • /
    • 2020
  • Various graph clustering methods have been introduced to identify communities in social or biological networks. This paper studies the entropy-based and the Markov chain-based methods in clustering the undirected graph. We examine the performance of two clustering methods with conventional methods based on quality measures of clustering. For the real applications, we collect the mathematical subject classification (MSC) codes of research papers from published mathematical databases and construct the weighted code-to-document matrix for applying graph clustering methods. We pursue to group MSC codes into the same cluster if the corresponding MSC codes appear in many papers simultaneously. We compare the MSC clustering results based on the several assessment measures and conclude that the Markov chain-based method is suitable for clustering the MSC codes.

Projection Pursuit K-Means Visual Clustering

  • Kim, Mi-Kyung;Huh, Myung-Hoe
    • Journal of the Korean Statistical Society
    • /
    • v.31 no.4
    • /
    • pp.519-532
    • /
    • 2002
  • K-means clustering is a well-known partitioning method of multivariate observations. Recently, the method is implemented broadly in data mining softwares due to its computational efficiency in handling large data sets. However, it does not yield a suitable visual display of multivariate observations that is important especially in exploratory stage of data analysis. The aim of this study is to develop a K-means clustering method that enables visual display of multivariate observations in a low-dimensional space, for which the projection pursuit method is adopted. We propose a computationally inexpensive and reliable algorithm and provide two numerical examples.

A Study on Performance Evaluation of Clustering Algorithms using Neural and Statistical Method (신경망 및 통계적 방법에 의한 클러스터링 성능평가)

  • 윤석환;민준영;신용백
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.19 no.37
    • /
    • pp.41-51
    • /
    • 1996
  • This paper evaluates the clustering performance of a neural network and a statistical method. Algorithms which are used in this paper are the GLVQ(Generalized Learning vector Quantization) for a neural method and the k-means algorithm fer a statistical clustering method. For comparison of two methods, we calculate the Rand's c statistics. As a result, the mean of c value obtained with the GLVQ is higher than that obtained with the k-means algorithm, while standard deviation of c value is lower. Experimental data sets were the Fisher's IRIS data and patterns extracted from handwritten numerals.

  • PDF

Microblog User Geolocation by Extracting Local Words Based on Word Clustering and Wrapper Feature Selection

  • Tian, Hechan;Liu, Fenlin;Luo, Xiangyang;Zhang, Fan;Qiao, Yaqiong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.10
    • /
    • pp.3972-3988
    • /
    • 2020
  • Existing methods always rely on statistical features to extract local words for microblog user geolocation. There are many non-local words in extracted words, which makes geolocation accuracy lower. Considering the statistical and semantic features of local words, this paper proposes a microblog user geolocation method by extracting local words based on word clustering and wrapper feature selection. First, ordinary words without positional indications are initially filtered based on statistical features. Second, a word clustering algorithm based on word vectors is proposed. The remaining semantically similar words are clustered together based on the distance of word vectors with semantic meanings. Next, a wrapper feature selection algorithm based on sequential backward subset search is proposed. The cluster subset with the best geolocation effect is selected. Words in selected cluster subset are extracted as local words. Finally, the Naive Bayes classifier is trained based on local words to geolocate the microblog user. The proposed method is validated based on two different types of microblog data - Twitter and Weibo. The results show that the proposed method outperforms existing two typical methods based on statistical features in terms of accuracy, precision, recall, and F1-score.