DOI QR코드

DOI QR Code

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang (Graduate School of Smart Convergence, Kwangwoon University) ;
  • Seok-Jae Moon (Graduate School of Smart Convergence, Kwangwoon University)
  • Received : 2024.03.28
  • Accepted : 2024.04.12
  • Published : 2024.05.31

Abstract

The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.

Keywords

Acknowledgement

This paper was supported by the KwangWoon University Research Grant of 2024.

References

  1. "Big Data Analysis using BigQuery on Cloud Computing Platform," Australian Journal of Engineering and Innovative Technology. Universe Publishing Group - UniversePG, pp. 1-9, Jan. 27, 2021, DOI: https://doi.org/10.34104/ajeit.021.0109. 
  2. https://www.ddbj.nig.ac.jp/services/ddbj-group-cloud-e.html 
  3. N. E. Allen et al., "Prospective study design and data analysis in UK Biobank," Science Translational Medicine, vol. 16, no. 729. American Association for the Advancement of Science (AAAS), Jan. 10, 2024, DOI: https://doi.org/10.1126/scitranslmed.adf4428. 
  4. https://cloud.google.com/bigquery/docs/bqml-introduction 
  5. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File System," 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, May 2010, DOI: https://doi.org/10.1109/msst.2010.5496972. 
  6. H. Zhang, H. Huang, and L. Wang, "Meteor: Optimizing spark-on-yarn for short applications," Future Generation Computer Systems, vol. 101. Elsevier BV, pp. 262-271, Dec. 2019, DOI: https://doi.org/10.1016/j.future.2019.05.077. 
  7. P. N. Lopez Gonzalez, J. E. Bautista-Gonzalez, M. J. Lopez-Gonzalez, J. E. Sosa-Escalante, and L. GonzalezHerrera, "The most frequent autosomal STRs involved in exclusion of paternity cases in a population from southeast, Mexico," Forensic Science International: Genetics Supplement Series, vol. 7, no. 1. Elsevier BV, pp. 465-467, Dec. 2019, DOI: https://doi.org/10.1016/j.fsigss.2019.10.053. 
  8. https://www.forensicinstitute.nl/products-and-services/forensic-products/dnaxs 
  9. N. Goonasekera, A. Mahmoud, J. Chilton, and E. Afgan, "GalaxyCloudRunner: enhancing scalable computing for Galaxy," Bioinformatics, vol. 37, no. 12. Oxford University Press (OUP), pp. 1763-1765, Oct. 26, 2020, DOI: https://doi.org/10.1093/bioinformatics/btaa860. 
  10. https://www.geneious.com/features/cloud/