• Title/Summary/Keyword: k-means clustering analysis

Search Result 462, Processing Time 0.026 seconds

Clustering Approaches to Identifying Gene Expression Patterns from DNA Microarray Data

  • Do, Jin Hwan;Choi, Dong-Kug
    • Molecules and Cells
    • /
    • v.25 no.2
    • /
    • pp.279-288
    • /
    • 2008
  • The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clustering of DNA microarray data from crisp clustering algorithms such as hierarchical clustering, K-means and self-organizing maps, to complex clustering algorithms like fuzzy clustering.

Nonparametric analysis of income distributions among different regions based on energy distance with applications to China Health and Nutrition Survey data

  • Ma, Zhihua;Xue, Yishu;Hu, Guanyu
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.1
    • /
    • pp.57-67
    • /
    • 2019
  • Income distribution is a major concern in economic theory. In regional economics, it is often of interest to compare income distributions in different regions. Traditional methods often compare the income inequality of different regions by assuming parametric forms of the income distributions, or using summary statistics like the Gini coefficient. In this paper, we propose a nonparametric procedure to test for heterogeneity in income distributions among different regions, and a K-means clustering procedure for clustering income distributions based on energy distance. In simulation studies, it is shown that the energy distance based method has competitive results with other common methods in hypothesis testing, and the energy distance based clustering method performs well in the clustering problem. The proposed approaches are applied in analyzing data from China Health and Nutrition Survey 2011. The results indicate that there are significant differences among income distributions of the 12 provinces in the dataset. After applying a 4-means clustering algorithm, we obtained the clustering results of the income distributions in the 12 provinces.

Colorectal Cancer Staging Using Three Clustering Methods Based on Preoperative Clinical Findings

  • Pourahmad, Saeedeh;Pourhashemi, Soudabeh;Mohammadianpanah, Mohammad
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.17 no.2
    • /
    • pp.823-827
    • /
    • 2016
  • Determination of the colorectal cancer stage is possible only after surgery based on pathology results. However, sometimes this may prove impossible. The aim of the present study was to determine colorectal cancer stage using three clustering methods based on preoperative clinical findings. All patients referred to the Colorectal Research Center of Shiraz University of Medical Sciences for colorectal cancer surgery during 2006 to 2014 were enrolled in the study. Accordingly, 117 cases participated. Three clustering algorithms were utilized including k-means, hierarchical and fuzzy c-means clustering methods. External validity measures such as sensitivity, specificity and accuracy were used for evaluation of the methods. The results revealed maximum accuracy and sensitivity values for the hierarchical and a maximum specificity value for the fuzzy c-means clustering methods. Furthermore, according to the internal validity measures for the present data set, the optimal number of clusters was two (silhouette coefficient) and the fuzzy c-means algorithm was more appropriate than the k-means clustering approach by increasing the number of clusters.

Projection Pursuit K-Means Visual Clustering

  • Kim, Mi-Kyung;Huh, Myung-Hoe
    • Journal of the Korean Statistical Society
    • /
    • v.31 no.4
    • /
    • pp.519-532
    • /
    • 2002
  • K-means clustering is a well-known partitioning method of multivariate observations. Recently, the method is implemented broadly in data mining softwares due to its computational efficiency in handling large data sets. However, it does not yield a suitable visual display of multivariate observations that is important especially in exploratory stage of data analysis. The aim of this study is to develop a K-means clustering method that enables visual display of multivariate observations in a low-dimensional space, for which the projection pursuit method is adopted. We propose a computationally inexpensive and reliable algorithm and provide two numerical examples.

Cluster analysis by month for meteorological stations using a gridded data of numerical model with temperatures and precipitation (기온과 강수량의 수치모델 격자자료를 이용한 기상관측지점의 월별 군집화)

  • Kim, Hee-Kyung;Kim, Kwang-Sub;Lee, Jae-Won;Lee, Yung-Seop
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.5
    • /
    • pp.1133-1144
    • /
    • 2017
  • Cluster analysis with meteorological data allows to segment meteorological region based on meteorological characteristics. By the way, meteorological observed data are not adequate for cluster analysis because meteorological stations which observe the data are located not uniformly. Therefore the clustering of meteorological observed data cannot reflect the climate characteristic of South Korea properly. The clustering of $5km{\times}5km$ gridded data derived from a numerical model, on the other hand, reflect it evenly. In this study, we analyzed long-term grid data for temperatures and precipitation using cluster analysis. Due to the monthly difference of climate characteristics, clustering was performed by month. As the result of K-Means cluster analysis is so sensitive to initial values, we used initial values with Ward method which is hierarchical cluster analysis method. Based on clustering of gridded data, cluster of meteorological stations were determined. As a result, clustering of meteorological stations in South Korea has been made spatio-temporal segmentation.

COUNTING OF FLOWERS BASED ON K-MEANS CLUSTERING AND WATERSHED SEGMENTATION

  • PAN ZHAO;BYEONG-CHUN SHIN
    • Journal of the Korean Society for Industrial and Applied Mathematics
    • /
    • v.27 no.2
    • /
    • pp.146-159
    • /
    • 2023
  • This paper proposes a hybrid algorithm combining K-means clustering and watershed algorithms for flower segmentation and counting. We use the K-means clustering algorithm to obtain the main colors in a complex background according to the cluster centers and then take a color space transformation to extract pixel values for the hue, saturation, and value of flower color. Next, we apply the threshold segmentation technique to segment flowers precisely and obtain the binary image of flowers. Based on this, we take the Euclidean distance transformation to obtain the distance map and apply it to find the local maxima of the connected components. Afterward, the proposed algorithm adaptively determines a minimum distance between each peak and apply it to label connected components using the watershed segmentation with eight-connectivity. On a dataset of 30 images, the test results reveal that the proposed method is more efficient and precise for the counting of overlapped flowers ignoring the degree of overlap, number of overlap, and relatively irregular shape.

Design of Pattern Classification Rule based on Local Linear Discriminant Analysis Classifier by using Differential Evolutionary Algorithm (차분진화 알고리즘을 이용한 지역 Linear Discriminant Analysis Classifier 기반 패턴 분류 규칙 설계)

  • Roh, Seok-Beom;Hwang, Eun-Jin;Ahn, Tae-Chon
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.1
    • /
    • pp.81-86
    • /
    • 2012
  • In this paper, we proposed a new design methodology of a pattern classification rule based on the local linear discriminant analysis expanded from the generic linear discriminant analysis which is used in the local area divided from the whole input space. There are two ways such as k-Means clustering method and the differential evolutionary algorithm to partition the whole input space into the several local areas. K-Means clustering method is the one of the unsupervised clustering methods and the differential evolutionary algorithm is the one of the optimization algorithms. In addition, the experimental application covers a comparative analysis including several previously commonly encountered methods.

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm (K-means 알고리즘 기반 클러스터링 인덱스 비교 연구)

  • Shim, Yo-Sung;Chung, Ji-Won;Choi, In-Chan
    • Asia pacific journal of information systems
    • /
    • v.16 no.1
    • /
    • pp.127-144
    • /
    • 2006
  • The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.

Clustering of Decision Making Units using DEA (DEA를 이용한 의사결정단위의 클러스터링)

  • Kim, Kyeongtaek
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.37 no.4
    • /
    • pp.239-244
    • /
    • 2014
  • The conventional clustering approaches are mostly based on minimizing total dissimilarity of input and output. However, the clustering approach may not be helpful in some cases of clustering decision making units (DMUs) with production feature converting multiple inputs into multiple outputs because it does not care converting functions. Data envelopment analysis (DEA) has been widely applied for efficiency estimation of such DMUs since it has non-parametric characteristics. We propose a new clustering method to identify groups of DMUs that are similar in terms of their input-output profiles. A real world example is given to explain the use and effectiveness of the proposed method. And we calculate similarity value between its result and the result of a conventional clustering method applied to the example. After the efficiency value was added to input of K-means algorithm, we calculate new similarity value and compare it with the previous one.

Prediction and visualization of CYP2D6 genotype-based phenotype using clustering algorithms

  • Kim, Eun-Young;Shin, Sang-Goo;Shin, Jae-Gook
    • Translational and Clinical Pharmacology
    • /
    • v.25 no.3
    • /
    • pp.147-152
    • /
    • 2017
  • This study focused on the role of cytochrome P450 2D6 (CYP2D6) genotypes to predict phenotypes in the metabolism of dextromethorphan. CYP2D6 genotypes and metabolic ratios (MRs) of dextromethorphan were determined in 201 Koreans. Unsupervised clustering algorithms, hierarchical and k-means clustering analysis, and color visualizations of CYP2D6 activity were performed on a subset of 130 subjects. A total of 23 different genotypes were identified, five of which were observed in one subject. Phenotype classifications were based on the means, medians, and standard deviations of the log MR values for each genotype. Color visualization was used to display the mean and median of each genotype as different color intensities. Cutoff values were determined using receiver operating characteristic curves from the k-means analysis, and the data were validated in the remaining subset of 71 subjects. Using the two highest silhouette values, the selected numbers of clusters were three (the best) and four. The findings from the two clustering algorithms were similar to those of other studies, classifying $^*5/^*5$ as a lowest activity group and genotypes containing duplicated alleles (i.e., $CYP2D6^*1/^*2N$) as a highest activity group. The validation of the k-means clustering results with data from the 71 subjects revealed relatively high concordance rates: 92.8% and 73.9% in three and four clusters, respectively. Additionally, color visualization allowed for rapid interpretation of results. Although the clustering approach to predict CYP2D6 phenotype from CYP2D6 genotype is not fully complete, it provides general information about the genotype to phenotype relationship, including rare genotypes with only one subject.