DOI QR코드

DOI QR Code

High-performance computing for SARS-CoV-2 RNAs clustering: a data science-based genomics approach

  • Oujja, Anas (School of Science and Engineering, Al Akhawayn University in Ifrane) ;
  • Abid, Mohamed Riduan (School of Science and Engineering, Al Akhawayn University in Ifrane) ;
  • Boumhidi, Jaouad (Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Computer Science Department, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University) ;
  • Bourhnane, Safae (School of Science and Engineering, Al Akhawayn University in Ifrane) ;
  • Mourhir, Asmaa (School of Science and Engineering, Al Akhawayn University in Ifrane) ;
  • Merchant, Fatima (Computer Engineering Technology Faculty, University of Houston) ;
  • Benhaddou, Driss (Computer Engineering Technology Faculty, University of Houston)
  • Received : 2021.09.14
  • Accepted : 2021.12.08
  • Published : 2021.12.31

Abstract

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

Keywords

Acknowledgement

This work is sponsored by US-NAS/USAID under the PEER Cycle 5 project grant# 5-398, entitled 'T owards Smart Microgrid: Renewable Energy Integration into Smart Buildings".

References

  1. US National Institute of Health, National Human Genome Research Institute. The cost of sequencing a human genome. Bethesda: National Institute of Health, 2021. Accessed 2021 Jan 14. Available from: https://www.genome.gov/about-genomics/ fact-sheets/Sequencing-Human-Genome-cost.
  2. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: astronomical or genomical? PLoS Biol 2015;13:e1002195. https://doi.org/10.1371/journal.pbio.1002195
  3. U.S Department of Health, National Institute for Health. Bethesda: National Institute of Health, 2021. Accessed 2021 Jan 14. Available from: https://www.ncbi.nlm.nih.gov/.
  4. National Institute of Genetics. Mishima: Nataional Institute of Genetics. Accessed 2020 Dec 30. Available from: https://www.ddbj.nig.ac.jp/.
  5. NCBI SARS-CoV-2 Resources. Bethesda: National Library of Medicine, 2021. Accessed 2021 Jul 18. Available from: https://www.ncbi.nlm.nih.gov/sars-cov-2/.
  6. Zheng CH, Huang DS, Zhang L, Kong XZ. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Trans Inf Technol Biomed 2009;13:599-607. https://doi.org/10.1109/TITB.2009.2018115
  7. Ishida T, Nishimura T, Nozaki M, Inoue T, Terada T, Nakamura S, et al. Development of an ab initio protein structure prediction system ABLE. Genome Inform 2003;14:228-237.
  8. Delisi C. Cooperative phenomena in homopolymers: an alternative formulation of the partition function. Biopolymers 1974; 13:1511-1512. https://doi.org/10.1002/bip.1974.360130719
  9. Gurskii GV, Zasedatelev AS. Precise relationships for calculating the binding of regulatory proteins and other lattice ligands in double-stranded polynucleotides. Biofizika 1978;23:932-946.
  10. AnasOujja. SARS_COV_2. San Francisco: GitHubAccessed, 2021. Accessed 2021 Jan 14. Available from: https://github.com/AnasOujja/SARS_COV_2-Clust/.
  11. Hayashi C. What is data science? Fundamental concepts and a heuristic example. In: Data Science, Classification, and Related Methods (Hayashi C, Yajima K, Bock HH, Ohsumi N, Tanaka Y, Baba Y, eds.). Tokyo: Springer, 1998. pp. 40-51.
  12. Mount DW. Bioinformatics: Sequence and Genome Analysis. New York: Cold Spring Harbor Laboratory Press, 2004.
  13. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48:443-453. https://doi.org/10.1016/0022-2836(70)90057-4
  14. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol 1981;147:195-197. https://doi.org/10.1016/0022-2836(81)90087-5
  15. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990;215:403-410. https://doi.org/10.1016/S0022-2836(05)80360-2
  16. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988;85:2444-2448. https://doi.org/10.1073/pnas.85.8.2444
  17. Beal R, Afrin T, Farheen A, Adjeroh D. A new algorithm for "the LCS problem" with application in compressing genome resequencing data. BMC Genomics 2016;17 Suppl 4:544. https://doi.org/10.1186/s12864-016-2793-0
  18. SARS-CoV-2 Data Hub. Bethesda: National Library of Medicine, 2021. Accessed 2020 Sep 25. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049.
  19. Achahbar O, Abid MR, Bakhouya M, El Amrani C, Gaber J, Essaidi M, et al. Approaches for high-performance big data processing: applications and challenges. In: Big Data: Algorithms, Analytics, and Applications (Li KC, Jiang HY, Yang LT, Cuzzocrea A, eds.). New York: Chapman and Hall, 2015. pp. 91-104.
  20. O'Driscoll A, Daugelaite J, Sleator RD. 'Big data', Hadoop and cloud computing in genomics. J Biomed Inform 2013;46:774-781. https://doi.org/10.1016/j.jbi.2013.07.001
  21. Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: SC'04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, 2004 Nov 6-12, Pittsburgh, PA, USA. New York: Institute of Electrical and Electronics Engineers, 2004. p. 47.
  22. Achahbar O, Abid MR. The impact of virtualization on high performance computing clustering in the cloud. Int J Distrib Syst Technol 2015;6:65-81. https://doi.org/10.4018/IJDST.2015100104
  23. Dean J, Ghemwat S. MapReduce: simplified data processing on large clusters. In: OSDI'04: 6th Symposium on Operating System Design and Implementation, 2004 Dec 6-8, Sanfrancisco, CA, USA. pp. 137-150.
  24. Berkhout B, van Hemert F. On the biased nucleotide composition of the human coronavirus RNA genome. Virus Res 2015;202:41-47. https://doi.org/10.1016/j.virusres.2014.11.031
  25. Benhaddou D, Abid MR, Achahbar O, Khalil N, Rachidi T, Al Assaf M. Big data processing for smart grids. IADIS Int J Comput Sci Inf Syst 2015;10:32-46.
  26. Su S, Wong G, Shi W, Liu J, Lai ACK, Zhou J, et al. Epidemiology, genetic recombination, and pathogenesis of coronaviruses. Trends Microbiol 2016;24:490-502. https://doi.org/10.1016/j.tim.2016.03.003
  27. Naqvi AA, Fatima K, Mohammad T, Fatima U, Singh IK, Singh A, et al. Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: structural genomics approach. Biochim Biophys Acta Mol Basis Dis 2020;1866:165878. https://doi.org/10.1016/j.bbadis.2020.165878
  28. Polyanovsky VO, Roytberg MA, Tumanyan VG. Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithms Mol Biol 2011;6:25. https://doi.org/10.1186/1748-7188-6-25
  29. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010;26:2460-2461. https://doi.org/10.1093/bioinformatics/btq461
  30. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150-3152. https://doi.org/10.1093/bioinformatics/bts565
  31. Sokal RR, Michener C. A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 1958;38:1409-1438.