DOI QR코드

DOI QR Code

Dimensionality Reduction of RNA-Seq Data

  • Al-Turaiki, Isra (College of Computer and Information Sciences, Information Technology Department King Saud University)
  • 투고 : 2021.03.05
  • 발행 : 2021.03.30

초록

RNA sequencing (RNA-Seq) is a technology that facilitates transcriptome analysis using next-generation sequencing (NSG) tools. Information on the quantity and sequences of RNA is vital to relate our genomes to functional protein expression. RNA-Seq data are characterized as being high-dimensional in that the number of variables (i.e., transcripts) far exceeds the number of observations (e.g., experiments). Given the wide range of dimensionality reduction techniques, it is not clear which is best for RNA-Seq data analysis. In this paper, we study the effect of three dimensionality reduction techniques to improve the classification of the RNA-Seq dataset. In particular, we use PCA, SVD, and SOM to obtain a reduced feature space. We built nine classification models for a cancer dataset and compared their performance. Our experimental results indicate that better classification performance is obtained with PCA and SOM. Overall, the combinations PCA+KNN, SOM+RF, and SOM+KNN produce preferred results.

키워드

참고문헌

  1. "GenBank and WGS Statistics." https://www.ncbi.nlm.nih.gov/genbank/statistics/ (accessed Jan. 17, 2021).
  2. D. Singh, P. K. Singh, S. Chaudhary, K. Mehla, and S. Kumar, "Chapter Three - Exome sequencing and advances in crop improvement," in Advances in Genetics, vol. 79, T. Friedmann, J. C. Dunlap, and S. F. Goodwin, Eds. Academic Press, 2012, pp. 87-121.
  3. A. Jabeen, N. Ahmad, and K. Raza, "Machine learning-based state-of-the-art methods for the classification of RNA-seq data," in Classification in BioApps: Automation of Decision Making, N. Dey, A. S. Ashour, and S. Borra, Eds. Cham: Springer International Publishing, 2018, pp. 133-172.
  4. K. Nirmalakumari, H. Rajaguru, and P. Rajkumar, "Performance analysis of classifiers for colon cancer detection from dimensionality reduced microarray gene data," Int. J. Imaging Syst. Technol., vol. 30, no. 4, pp. 1012-1032, 2020, doi: https://doi.org/10.1002/ima.22431.
  5. M. O. Arowolo, M. O. Adebiyi, A. A. Adebiyi, and O. J. Okesola, "A hybrid heuristic dimensionality reduction methods for classifying malaria vector gene expression data," IEEE Access, vol. 8, pp. 182422-182430, 2020, doi: 10.1109/ACCESS.2020.3029234.
  6. "Overview and comparative study of dimensionality reduction techniques for high dimensional data - ScienceDirect." https://www-sciencedirectcom.sdl.idm.oclc.org/science/article/pii/S156625351 930377X (accessed Jan. 17, 2021).
  7. L. H. Nguyen and S. Holmes, "Ten quick tips for effective dimensionality reduction," PLOS Comput. Biol., vol. 15, no. 6, p. e1006907, Jun. 2019, doi: 10.1371/journal.pcbi.1006907.
  8. "The Cancer Genome Atlas Program - National Cancer Institute," Jun. 13, 2018. https://www.cancer.gov/aboutnci/organization/ccg/research/structuralgenomics/tcga (accessed Jan. 17, 2021).
  9. C. Ferles, Y. Papanikolaou, and K. J. Naidoo, "Denoising Autoencoder Self-Organizing Map (DASOM)," Neural Netw., vol. 105, pp. 112-131, Sep. 2018, doi: 10.1016/j.neunet.2018.04.016.
  10. T. Kohonen, Self-Organizing Maps. Springer Science & Business Media, 2012.
  11. T. Ahvenlampi, R. Rantanen, and M. Tervaskanto, "Fault tolerant control application for continuous kraft pulping process," in Fault Detection, Supervision and Safety of Technical Processes 2006, H.-Y. Zhang, Ed. Oxford: Elsevier Science Ltd, 2007, pp. 849-854.
  12. D. Miljkovic, "Brief review of self-organizing maps," in 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), May 2017, pp. 1061-1066, doi: 10.23919/MIPRO.2017.7973581.
  13. I. T. Jolliffe and J. Cadima, "Principal component analysis: A review and recent developments," Philos. Trans. R. Soc. Math. Phys. Eng. Sci., vol. 374, no. 2065, p. 20150202, Apr. 2016, doi: 10.1098/rsta.2015.0202.
  14. S. A. Alsenan, I. M. Al-Turaiki, and A. M. Hafez, "Feature extraction methods in quantitative structure-activity relationship modeling: A comparative study," IEEE Access, vol. 8, pp. 78737-78752, 2020, doi: 10.1109/ACCESS.2020.2990375.
  15. G. H. Golub and C. Reinsch, "Singular value decomposition and least squares solutions," Numer. Math., vol. 14, no. 5, pp. 403-420, Apr. 1970, doi: 10.1007/BF02163027.
  16. J. Han and M. Kamber, Data Mining: Concepts and Techniques, 1st edition. San Francisco: Morgan Kaufmann, 2000.
  17. L. Breiman, "Random forests," Mach. Learn., vol. 45, no. 1, pp. 5-32, Oct. 2001, doi: 10.1023/A:1010933404324.
  18. N. S. Altman, "An introduction to kernel and nearest-neighbor nonparametric regressi0on," Am. Stat., vol. 46, no. 3, pp. 175-185, 1992, doi: 10.2307/2685209.
  19. "RapidMiner | Best Data Science & Machine Learning Platform," RapidMiner. https://rapidminer.com/ (accessed Mar. 15, 2020).
  20. S. Mahapatra, A. Kumar, A. Sharma, and S. S. Sahu, "Effect of dimensionality reduction on classification accuracy for protein-protein interaction prediction," in Advanced Computing and Intelligent Engineering, Singapore, 2020, pp. 3-12, doi: 10.1007/978-981-15-1081-6_1.
  21. K. Tsuyuzaki, H. Sato, K. Sato, and I. Nikaido, "Benchmarking principal component analysis for large-scale single-cell RNA-sequencing," Genome Biol., vol. 21, no. 1, p. 9, Jan. 2020, doi: 10.1186/s13059-019-1900-3.
  22. H. Wirth, M. Loffler, M. von Bergen, and H. Binder, "Expression cartography of human tissues using self organizing maps," Nat. Preced., pp. 1-1, Jun. 2011, doi: 10.1038/npre.2011.5825.2.
  23. L. D. Locati et al., "Mining of self-organizing map gene-expression portraits reveals prognostic stratification of HPV-positive head and neck squamous cell carcinoma," Cancers, vol. 11, no. 8, Art. no. 8, Aug. 2019, doi: 10.3390/cancers11081057.