DOI QR코드

DOI QR Code

Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data

  • Mehmood, Tahir (Statistics, Department of Basic Sciences, Riphah International University) ;
  • Rasheed, Zahid (Statistics, Department of Basic Sciences, Riphah International University)
  • Received : 2015.05.23
  • Accepted : 2015.09.23
  • Published : 2015.11.30

Abstract

The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.

Keywords

References

  1. Alsberg, B. K., Kell, D. B. and Goodacre, R. (1998). Variable selection in discriminant partial least-squares analysis, Analytical Chemistry, 70, 4126-4133. https://doi.org/10.1021/ac980506o
  2. Bachvarov, B., Kirilov, K. and Ivanov, I. (2008). Codon usage in prokaryotes, Biotechnology & Biotechnological Equipment, 22, 669-682. https://doi.org/10.1080/13102818.2008.10817533
  3. Barker, M. and Rayens, W. (2003). Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173. https://doi.org/10.1002/cem.785
  4. Botzman, M. and Margalit, H. (2011). Variation in global codon usage bias among prokaryotic organisms is associated with their lifestyles, Genome Biol, 12, R109. https://doi.org/10.1186/gb-2011-12-10-r109
  5. Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology, 3, 1-30.
  6. Chen, R., Yan, H., Zhao, K. N., Martinac, B. and Liu, G. B. (2007). Comprehensive analysis of prokaryotic mechanosensation genes: Their characteristics in codon usage, DNA Sequence, 18, 269-278. https://doi.org/10.1080/10425170601136564
  7. Chun, H. and Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 3-25. https://doi.org/10.1111/j.1467-9868.2009.00723.x
  8. Costello, E. K., Lauber, C. L., Hamady, M., Fierer, N., Gordon, J. I. and Knight, R. (2009). Bacterial community variation in human body habitats across space and time, Science, 326, 1694-1697. https://doi.org/10.1126/science.1177486
  9. Eriksson, L., Johansson, E., Kettaneh-Wold, N. and Wold, S. (2001). Multi-and Megavariate Data Analysis, Umetrics Academy, Umea.
  10. Gosselin, R., Rodrigue, D. and Duchesne, C. (2010). A bootstrap-VIP approach for selecting wave-length intervals in spectral imaging applications, Chemometrics and Intelligent Laboratory Systems, 100. 12-21. https://doi.org/10.1016/j.chemolab.2009.09.005
  11. Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms, Microbiology and Molecular Biology Reviews, 68, 669-685. https://doi.org/10.1128/MMBR.68.4.669-685.2004
  12. Hanes, A., Raymer, M. L., Doom, T. E. and Krane, D. E. (2009). A comparision of codon usage trends in prokaryotes, In Proceedings of Ohio Collaborative Conference on Bioinformatics (OCCBIO'09), Cleveland, OH, 83-86.
  13. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York.
  14. Hattenschwiler, S., Fromin, N. and Barantal, S. (2011). Functional diversity of terrestrial microbial decomposers and their substrates, Comptes Rendus Biologies, 334, 393-402. https://doi.org/10.1016/j.crvi.2011.03.001
  15. Hubner, S., Rashkovetsky, E., Kim, Y. B., Oh, J. H., Michalak, K., Weiner, D., Korol, A. B. Nevo, E. and Michalak, P. (2013). Genome differentiation of Drosophila melanogaster from a microclimate contrast in Evolution Canyon, Israel, In Proceedings of the National Academy of Sciences, 110, 21059-21064. https://doi.org/10.1073/pnas.1321533111
  16. Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F.W. and Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification, BMC Bioinformatics, 11, 119. https://doi.org/10.1186/1471-2105-11-119
  17. Jensen, D. B., Vesth, T. C., Hallin, P. F., Pedersen, A. G. and Ussery, D. W. (2012). Bayesian prediction of bacterial growth temperature range based on genome sequences, BMC Genomics, 13(Suppl 7), S3.
  18. Lachenbruch, P. A. and Goldstein, M. (1979). Discriminant analysis, Biometrics, 35, 69-85. https://doi.org/10.2307/2529937
  19. Le Cao, K. A., Rossouw, D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data, Statistical Applications in Genetics and Molecular Biology, 7, 1-32.
  20. Lejeusne, C. and Chevaldonne, P. (2006). Brooding crustaceans in a highly fragmented habitat: the genetic structure of Mediterranean marine cave-dwelling mysid populations, Molecular Ecology, 15, 4123-4140. https://doi.org/10.1111/j.1365-294X.2006.03101.x
  21. Liland, K. H., Hoy, M., Martens, H. and Saebo, S. (2013). Distribution based truncation for variable selection in subspace methods for multivariate regression, Chemometrics and Intelligent Laboratory Systems, 122, 103-111. https://doi.org/10.1016/j.chemolab.2013.01.008
  22. Lindgren, F., Geladi, P., Rannar, S. and Wold, S. (1994). Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms, Journal of Chemometrics, 8, 349-363. https://doi.org/10.1002/cem.1180080505
  23. Martens, H. and Naes, T. (1989). Multivariate Calibration, Wiley & Sons, New York.
  24. Mehmood, T., Bohlin, J., Kristoffersen, A. B., Saebo, S., Warringer, J. and Snipen, L. (2012b). Exploration of multivariate analysis in microbial coding sequence modeling, BMC Bioinformatics, 13, 97. https://doi.org/10.1186/1471-2105-13-97
  25. Mehmood, T., Bohlin, J. and Snipen, L. (2014). A partial least squares based procedure for upstream sequence classification in prokaryotes., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12, 560-567.
  26. Mehmood, T., Liland, K. H., Snipen, L. and Saebo, S. (2012a). A review of variable selection methods in partial least squares regression, Chemometrics and Intelligent Laboratory Systems, 118, 62-69. https://doi.org/10.1016/j.chemolab.2012.07.010
  27. Mehmood, T., Martens, H., Saebo, S., Warringer, J. and Snipen, L. (2011a). A partial least squares based algorithm for parsimonious variable selection, Algorithms for Molecular Biology, 6, 27. https://doi.org/10.1186/1748-7188-6-27
  28. Mehmood, T., Martens, H. and Saebo, S., Warringer, J. and Snipen, L. (2011b). Mining for genotype-phenotype relations in Saccharomyces using partial least squares, BMC Bioinformatics, 12, 318. https://doi.org/10.1186/1471-2105-12-318
  29. Mehmood, T. and Snipen, L. (2013). Clustered variable selection by regularized elimination in PLS. In H. Abdi, et al. (Eds.), New Perspectives in Partial Least Squares and Related Methods (pp. 95-105), Springer, New York.
  30. Mehmood, T., Warringer, J., Snipen, L. and Saebo, S. (2012c). Improving stability and understand-ability of genotype-phenotype mapping in Saccharomyces using regularized variable selection in L-PLS regression, BMC Bioinformatics, 13, 327. https://doi.org/10.1186/1471-2105-13-327
  31. Nguyen, D. V. and Rocke, D. M. (2002a). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50. https://doi.org/10.1093/bioinformatics/18.1.39
  32. Nguyen, D. V. and Rocke, D. M. (2002b). Multi-class cancer classification via partial least squares with gene expression profiles, Bioinformatics, 18, 1216-1226. https://doi.org/10.1093/bioinformatics/18.9.1216
  33. Nguyen, M. N., Ma, J., Fogel, G. B. and Rajapakse, J. C. (2009). Di-codon usage for gene classification. In V. Kadirkamanathan, et al. (Eds.), Pattern Recognition in Bioinformatics (pp. 211-221), Springer Berlin, Heidelberg.
  34. Norgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L. and Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy, Applied Spectroscopy, 54, 413-419. https://doi.org/10.1366/0003702001949500
  35. Saebo, S., Almoy, T., Aaroe, J. and Aastveit, A. H. (2008). ST-PLS: a multi-dimensional nearest shrunken centroid type classifier via PLS, Journal of Chemometrics, 22, 54-62. https://doi.org/10.1002/cem.1101
  36. Singh, B. K., Nazaries, L., Munro, S., Anderson, I. C. and Campbell, C. D. (2006). Use of multiplex terminal restriction fragment length polymorphism for rapid and simultaneous analysis of different components of the soil microbial community, Applied and Environmental Microbiology, 72, 7278-7285. https://doi.org/10.1128/AEM.00510-06
  37. Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Statistical Science, 18, 104-117. https://doi.org/10.1214/ss/1056397488
  38. Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang, H. W., Podar, M., Short, J. M., Mathur, E. J., Detter, J. C., Bork, P., Hugenholtz, P. and Rubin, E. M. (2005). Comparative metagenomics of microbial communities, Science, 308, 554-557. https://doi.org/10.1126/science.1107851
  39. Watson, J. E., Whittaker, R. J. and Dawson, T. P. (2004). Avifaunal responses to habitat fragmentation in the threatened littoral forests of south-eastern Madagascar, Journal of Biogeography, 31, 1791-1807. https://doi.org/10.1111/j.1365-2699.2004.01142.x
  40. Wold, S., Ruhe, A., Wold, H. and Dunn, III, W. J. (1984). The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM Journal on Scientific and Statistical Computing, 5, 735-743. https://doi.org/10.1137/0905052