Browse > Article
http://dx.doi.org/10.5351/CSAM.2015.22.6.575

Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data  

Mehmood, Tahir (Statistics, Department of Basic Sciences, Riphah International University)
Rasheed, Zahid (Statistics, Department of Basic Sciences, Riphah International University)
Publication Information
Communications for Statistical Applications and Methods / v.22, no.6, 2015 , pp. 575-587 More about this Journal
Abstract
The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.
Keywords
partial least squares; classification; variable selection; parsimonious model; high dimensional data sets; identification; multi collinearity; microbial;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Alsberg, B. K., Kell, D. B. and Goodacre, R. (1998). Variable selection in discriminant partial least-squares analysis, Analytical Chemistry, 70, 4126-4133.   DOI
2 Bachvarov, B., Kirilov, K. and Ivanov, I. (2008). Codon usage in prokaryotes, Biotechnology & Biotechnological Equipment, 22, 669-682.   DOI
3 Barker, M. and Rayens, W. (2003). Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173.   DOI
4 Botzman, M. and Margalit, H. (2011). Variation in global codon usage bias among prokaryotic organisms is associated with their lifestyles, Genome Biol, 12, R109.   DOI
5 Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology, 3, 1-30.
6 Chen, R., Yan, H., Zhao, K. N., Martinac, B. and Liu, G. B. (2007). Comprehensive analysis of prokaryotic mechanosensation genes: Their characteristics in codon usage, DNA Sequence, 18, 269-278.   DOI
7 Chun, H. and Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 3-25.   DOI
8 Costello, E. K., Lauber, C. L., Hamady, M., Fierer, N., Gordon, J. I. and Knight, R. (2009). Bacterial community variation in human body habitats across space and time, Science, 326, 1694-1697.   DOI
9 Eriksson, L., Johansson, E., Kettaneh-Wold, N. and Wold, S. (2001). Multi-and Megavariate Data Analysis, Umetrics Academy, Umea.
10 Gosselin, R., Rodrigue, D. and Duchesne, C. (2010). A bootstrap-VIP approach for selecting wave-length intervals in spectral imaging applications, Chemometrics and Intelligent Laboratory Systems, 100. 12-21.   DOI
11 Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms, Microbiology and Molecular Biology Reviews, 68, 669-685.   DOI
12 Hanes, A., Raymer, M. L., Doom, T. E. and Krane, D. E. (2009). A comparision of codon usage trends in prokaryotes, In Proceedings of Ohio Collaborative Conference on Bioinformatics (OCCBIO'09), Cleveland, OH, 83-86.
13 Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York.
14 Hattenschwiler, S., Fromin, N. and Barantal, S. (2011). Functional diversity of terrestrial microbial decomposers and their substrates, Comptes Rendus Biologies, 334, 393-402.   DOI
15 Hubner, S., Rashkovetsky, E., Kim, Y. B., Oh, J. H., Michalak, K., Weiner, D., Korol, A. B. Nevo, E. and Michalak, P. (2013). Genome differentiation of Drosophila melanogaster from a microclimate contrast in Evolution Canyon, Israel, In Proceedings of the National Academy of Sciences, 110, 21059-21064.   DOI
16 Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F.W. and Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification, BMC Bioinformatics, 11, 119.   DOI
17 Jensen, D. B., Vesth, T. C., Hallin, P. F., Pedersen, A. G. and Ussery, D. W. (2012). Bayesian prediction of bacterial growth temperature range based on genome sequences, BMC Genomics, 13(Suppl 7), S3.
18 Lachenbruch, P. A. and Goldstein, M. (1979). Discriminant analysis, Biometrics, 35, 69-85.   DOI
19 Lejeusne, C. and Chevaldonne, P. (2006). Brooding crustaceans in a highly fragmented habitat: the genetic structure of Mediterranean marine cave-dwelling mysid populations, Molecular Ecology, 15, 4123-4140.   DOI
20 Le Cao, K. A., Rossouw, D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data, Statistical Applications in Genetics and Molecular Biology, 7, 1-32.
21 Liland, K. H., Hoy, M., Martens, H. and Saebo, S. (2013). Distribution based truncation for variable selection in subspace methods for multivariate regression, Chemometrics and Intelligent Laboratory Systems, 122, 103-111.   DOI
22 Lindgren, F., Geladi, P., Rannar, S. and Wold, S. (1994). Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms, Journal of Chemometrics, 8, 349-363.   DOI
23 Martens, H. and Naes, T. (1989). Multivariate Calibration, Wiley & Sons, New York.
24 Mehmood, T., Bohlin, J., Kristoffersen, A. B., Saebo, S., Warringer, J. and Snipen, L. (2012b). Exploration of multivariate analysis in microbial coding sequence modeling, BMC Bioinformatics, 13, 97.   DOI
25 Mehmood, T., Bohlin, J. and Snipen, L. (2014). A partial least squares based procedure for upstream sequence classification in prokaryotes., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12, 560-567.
26 Mehmood, T., Liland, K. H., Snipen, L. and Saebo, S. (2012a). A review of variable selection methods in partial least squares regression, Chemometrics and Intelligent Laboratory Systems, 118, 62-69.   DOI
27 Mehmood, T., Martens, H., Saebo, S., Warringer, J. and Snipen, L. (2011a). A partial least squares based algorithm for parsimonious variable selection, Algorithms for Molecular Biology, 6, 27.   DOI
28 Mehmood, T., Warringer, J., Snipen, L. and Saebo, S. (2012c). Improving stability and understand-ability of genotype-phenotype mapping in Saccharomyces using regularized variable selection in L-PLS regression, BMC Bioinformatics, 13, 327.   DOI
29 Mehmood, T., Martens, H. and Saebo, S., Warringer, J. and Snipen, L. (2011b). Mining for genotype-phenotype relations in Saccharomyces using partial least squares, BMC Bioinformatics, 12, 318.   DOI
30 Mehmood, T. and Snipen, L. (2013). Clustered variable selection by regularized elimination in PLS. In H. Abdi, et al. (Eds.), New Perspectives in Partial Least Squares and Related Methods (pp. 95-105), Springer, New York.
31 Nguyen, D. V. and Rocke, D. M. (2002a). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50.   DOI
32 Nguyen, D. V. and Rocke, D. M. (2002b). Multi-class cancer classification via partial least squares with gene expression profiles, Bioinformatics, 18, 1216-1226.   DOI
33 Nguyen, M. N., Ma, J., Fogel, G. B. and Rajapakse, J. C. (2009). Di-codon usage for gene classification. In V. Kadirkamanathan, et al. (Eds.), Pattern Recognition in Bioinformatics (pp. 211-221), Springer Berlin, Heidelberg.
34 Norgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L. and Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy, Applied Spectroscopy, 54, 413-419.   DOI
35 Saebo, S., Almoy, T., Aaroe, J. and Aastveit, A. H. (2008). ST-PLS: a multi-dimensional nearest shrunken centroid type classifier via PLS, Journal of Chemometrics, 22, 54-62.   DOI
36 Watson, J. E., Whittaker, R. J. and Dawson, T. P. (2004). Avifaunal responses to habitat fragmentation in the threatened littoral forests of south-eastern Madagascar, Journal of Biogeography, 31, 1791-1807.   DOI
37 Singh, B. K., Nazaries, L., Munro, S., Anderson, I. C. and Campbell, C. D. (2006). Use of multiplex terminal restriction fragment length polymorphism for rapid and simultaneous analysis of different components of the soil microbial community, Applied and Environmental Microbiology, 72, 7278-7285.   DOI
38 Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Statistical Science, 18, 104-117.   DOI
39 Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang, H. W., Podar, M., Short, J. M., Mathur, E. J., Detter, J. C., Bork, P., Hugenholtz, P. and Rubin, E. M. (2005). Comparative metagenomics of microbial communities, Science, 308, 554-557.   DOI
40 Wold, S., Ruhe, A., Wold, H. and Dunn, III, W. J. (1984). The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM Journal on Scientific and Statistical Computing, 5, 735-743.   DOI