Browse > Article
http://dx.doi.org/10.5351/KJAS.2009.22.1.089

Descriptive and Systematic Comparison of Clustering Methods in Microarray Data Analysis  

Kim, Seo-Young (Statistics Research Institute, Korea National Statistical Office)
Publication Information
The Korean Journal of Applied Statistics / v.22, no.1, 2009 , pp. 89-106 More about this Journal
Abstract
There have been many new advances in the development of improved clustering methods for microarray data analysis, but traditional clustering methods are still often used in genomic data analysis, which maY be more due to their conceptual simplicity and their broad usability in commercial software packages than to their intrinsic merits. Thus, it is crucial to assess the performance of each existing method through a comprehensive comparative analysis so as to provide informed guidelines on choosing clustering methods. In this study, we investigated existing clustering methods applied to microarray data in various real scenarios. To this end, we focused on how the various methods differ, and why a particular method does not perform well. We applied both internal and external validation methods to the following eight clustering methods using various simulated data sets and real microarray data sets.
Keywords
Microarray; gene expression data; clustering;
Citations & Related Records
연도 인용수 순위
  • Reference
1 MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations, Proceedings of the 5th Berkely Symposium, 1, 281-297
2 McLachlan, G. J. and Basford, K. E. (1988). Mixture models: inference and applications to clustering, Marcel Dekker, New York
3 Milligan, G. W. and Cooper, M. C (1986). A study of the comparability of external criteria for hierarchical cluster analysis, Multivariate Behavioral Research, 21, 441-458   DOI
4 Monti,S., Tamayo, P., Mesirov, J. and Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data, Machine Learning Journal, 52, 91-118   DOI   ScienceOn
5 Nagy, G. (1968). State of the art in pattern recognition, Proceedings of the IEEE, 56, 836-862   DOI   ScienceOn
6 Pensa, R. G., Robardet, C and Boulicaut, J.-F. (2005). LNAI 3721, 643-650
7 Quanckenbush, J. (2001). Computational analysis of microarray data, Nature Review Genetics, 2, 418-427   DOI   ScienceOn
8 R Development Core Team. R: A language and environment for statistical computing. 2004 [http://www.Rproject. org]. R Foundation for Statistical Computing, Vienna, Austria [ISBN 3-900051-00-3]
9 Grotkjaer, T., Winther, O., Regenberg, B., Nielsen, J. and Hansen, L. K. (2006). Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm. Bioinformatics, 22, 58-67   DOI   ScienceOn
10 Guralnik, V. and Karypis, G. (2001). A scalable algorithm for clustering protein sequences, In Workshop on Data Mining in Bioinformatics, Proceedings of the U.S.A., 73-80
11 Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001). On clustering validation techniques, Journal of Intelligenet Information System, 17, 107-145   DOI   ScienceOn
12 Handl, J., Knowles, J. and Kell, D. B. (2005). Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 3201-3212   DOI   ScienceOn
13 Hastie, T., Tibshirani, R. and Fredman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, New York
14 Hosel, V. and Walcher, S. (2001). Clustering techniques: A brief survey, Technical Report, Institute of Biomathematics and Biometry
15 llana, B.-L. (2006). A generalized clustering problem, with application to DNA microarrays, Statistical Applications in Genetics and Molecular Biology, 5, Article 2   DOI
16 Jain, A. K. and Dubes, R. C (1988). Algorithms for Clustering Data, Prentice-Hall, Inc., Upper Saddle River, New Jersey
17 Jain, A K., Murty, M. N. and Flynn, P. J. (1999). Data clustering: A Review. ACM Computing Surveys, 31, 264-323   DOI   ScienceOn
18 Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis, John Wiley & Sons, New York
19 Kohonen, T. (1997). Self-Organizing Maps, Springer, Heidelberg
20 Lander, E. S. (1999). Array of hope, Nature Genetics, 21, 3-4   DOI   ScienceOn
21 Darlene, R. G., Debashis, G. and Erin, M. C (2002). Statistical issues in the clustering of gene expression data, Statistica Sinica, 12, 219-240
22 Datta, S. and Datta, S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data, Bioinformatics, 19, 459-466   DOI   ScienceOn
23 Dembele, D. and Kastner, P (2003). Fuzzy C-means method for clustering microarray data, Bioinformatics, 19, 973-980   DOI   ScienceOn
24 Dudoit, S. and Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biology, 3, research0036.1-0036.21
25 Efron, B. and Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation, American Statistician, 37, 36-48   DOI   ScienceOn
26 Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster analysis and display of genomewide expression patterns, Proceeding of the National Academy of Sciences, 95, 14863-14868   DOI   ScienceOn
27 Fraley, C and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97, 611-631   DOI   ScienceOn
28 Gasch, A. P. and Eisen, M. B. (2002). Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering, Genome Biology, 3, research0059
29 Goldstein, D. R., Ghosh, D. and Conlon, E. M. (2002). Statistical issues in the clustering of gene expression data, Statistica Sinica, 12, 219-240
30 Golub, T. R., Sionim, D. K. and Tamayo, P., Huard, C, Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A, Bloomfield, C D., Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science, 286, 531-537   DOI   ScienceOn
31 Bezdek, J. C (1981). Pattern Rcognition with Fuzzy Objective Function Algorithms, Plenum press, New York
32 Alizadeh, A. A, Eisen, M. B., Davis, R. E., Ma, C, Lossos, I. S., Rosenwald, A., Boldrick, J. C, Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C, Greiner, T. C, Weisenburger, D. D.,Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C, Botstein, D., Brown, P.O., and Staudt, L. M. (2000). Distinct types of diffuse large B-celllymphoma identified by gene expression profiling, Nature, 403, 503-511   DOI   ScienceOn
33 Alon, U., BarKai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceeding of the National Academy of Sciences, 96, 6745-6750   DOI   ScienceOn
34 Banfield, J. D., Raftery, A E. (1993). Model-based gaussian and non-gaussian clustering, Biometrics, 49, 803-822   DOI   ScienceOn
35 Bhattacharjee, A., Richards, W. G., Staunton, J. Li, C, Monti, S., Vasa, P., Ladd, C, Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J. and Meyerson, M. (2001). Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma sub-classes, Proceeding of the National Academy of Sciences, 98, 13790-13795   DOI   ScienceOn
36 Bittner, M., Meltzer, P. and Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C, Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C, Berens, M., Alberts, D., Sondak, V., Hayward, N. and Trent, J. (2000). Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature, 406, 536-540   DOI   ScienceOn
37 Brown, P. O. and Botstein, D. (1999). Exploring the new world of the genome with DNA microarrays, The Chipping Forecast, 21, 33-37   DOI   ScienceOn
38 Yeung, K.. Y. and Ruzzo, W. L. (2001). An empirical study on principal component analysis for clustering gene expression data, Bioinformatics, 17, 763-774   DOI   ScienceOn
39 Yeung, K. Y., Fraley, C, Murua, A., Raftery, A. E. and Ruzzo, W. L. (2001). Model-based clustering and data transformations for gene expression data, Bioinformatics, 17, 977-987   DOI   ScienceOn
40 Yeung, K. Y., Haynor, D. R. and Ruzzo, W. L. (2001). Validating clustering for gene expression data, Bioinformatics, 17, 309-318   DOI   ScienceOn
41 Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M., Rees, C, Spellman, P., Iyer, V., Jeffrey, S. S., Van de Rijn, M., Waltham, M., Pergamenschikov, A, Lee, J. C, Lashkari, D., Shalon, D., Myers, T. G., Weinstein, J. N., Botstein, D. and Brown, P. O. (2000). Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics, 24, 227-234   DOI   ScienceOn
42 Troyanskaya, O., Cantor, M., Sherlock, G. Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525   DOI   ScienceOn
43 Tamayo, P., Sionim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. and Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceeding of the National Academy of Science, 96, 2907-2912   DOI   ScienceOn
44 Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. and Church, G. M. (1999). Systematic determination of genetic network architecture, Nature Genetics, 22, 281-285   DOI   ScienceOn
45 Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Statistical Science, 18, 104-117   DOI   ScienceOn
46 Tseng, G. C and Wong, W. H. (2005). Tight clustering: A Resamping-based approach for identifyng stable and tight patterns in data, Biometrics, 61, 10-16   DOI   ScienceOn
47 Verhaak, R. G. W., Staal, F. J. T., Valk, P. J. M., Lowenberg, B., Reinders, M. J. and de Ridder, D. (2006). The effect of oligonucleotide microarray data pre-processing on the analysis of patient-cohort studies, BMC Bioinformatics, 7, 105   DOI
48 Yang, Y. H., Dudoit, S., Luu, P. and Speed, T. (2001). Normalization for cDNA microarray data, Optical Technologies and Informatics, 42, 141-152
49 Lee, J. W., Lee, J. B., Park, M. and Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data, Computational Statistics & Data Analysis, 48, 869-885   DOI   ScienceOn
50 Leisch, F. (1999). Bagged clustering. Working Paper Serise 51, SFB, Adaptive Information Systems and Modeling in Economics and Management Science