DOI QR코드

DOI QR Code

Bioinformatics services for analyzing massive genomic datasets

  • Ko, Gunhwan (Korea Bioinformation Center (KOBIC), KRIBB) ;
  • Kim, Pan-Gyu (Korea Bioinformation Center (KOBIC), KRIBB) ;
  • Cho, Youngbum (Genome Editing Research Center, KRIBB) ;
  • Jeong, Seongmun (Genome Editing Research Center, KRIBB) ;
  • Kim, Jae-Yoon (Genome Editing Research Center, KRIBB) ;
  • Kim, Kyoung Hyoun (Genome Editing Research Center, KRIBB) ;
  • Lee, Ho-Yeon (Genome Editing Research Center, KRIBB) ;
  • Han, Jiyeon (Department of BioInformation Science, Ewha Womans University) ;
  • Yu, Namhee (Department of BioInformation Science, Ewha Womans University) ;
  • Ham, Seokjin (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
  • Jang, Insoon (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
  • Kang, Byunghee (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
  • Shin, Sunguk (Department of Systems, Biology Division of Life Sciences, and Institute for Life Science and Biotechnology, Yonsei University) ;
  • Kim, Lian (Bioposh Inc.) ;
  • Lee, Seung-Won (SeqGenesis) ;
  • Nam, Dougu (School of Life Sciences, Ulsan National Institute of Science and Technology) ;
  • Kim, Jihyun F. (Department of Systems, Biology Division of Life Sciences, and Institute for Life Science and Biotechnology, Yonsei University) ;
  • Kim, Namshin (Genome Editing Research Center, KRIBB) ;
  • Kim, Seon-Young (Genome Structure Research Center, KRIBB) ;
  • Lee, Sanghyuk (Department of BioInformation Science, Ewha Womans University) ;
  • Roh, Tae-Young (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
  • Lee, Byungwook (Korea Bioinformation Center (KOBIC), KRIBB)
  • Received : 2020.01.09
  • Accepted : 2020.03.11
  • Published : 2020.03.31

Abstract

The explosive growth of next-generation sequencing data has resulted in ultra-large-scale datasets and ensuing computational problems. In Korea, the amount of genomic data has been increasing rapidly in the recent years. Leveraging these big data requires researchers to use large-scale computational resources and analysis pipelines. A promising solution for addressing this computational challenge is cloud computing, where CPUs, memory, storage, and programs are accessible in the form of virtual machines. Here, we present a cloud computing-based system, Bio-Express, that provides user-friendly, cost-effective analysis of massive genomic datasets. Bio-Express is loaded with predefined multi-omics data analysis pipelines, which are divided into genome, transcriptome, epigenome, and metagenome pipelines. Users can employ predefined pipelines or create a new pipeline for analyzing their own omics data. We also developed several web-based services for facilitating downstream analysis of genome data. Bio-Express web service is freely available at https://www. bioexpress.re.kr/.

Keywords

References

  1. Bansal V, Boucher C. Sequencing technologies and analyses: where have we been and where are we going? iScience 2019;18:37-41. https://doi.org/10.1016/j.isci.2019.06.035
  2. Kodama Y, Shumway M, Leinonen R; International Nucleotide Sequence Database Collaboration. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res 2012;40:D54-D56. https://doi.org/10.1093/nar/gkr854
  3. O'Driscoll A, Daugelaite J, Sleator RD. 'Big data', Hadoop and cloud computing in genomics. J Biomed Inform 2013;46:774-781. https://doi.org/10.1016/j.jbi.2013.07.001
  4. Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 2018;19:208-219. https://doi.org/10.1038/nrg.2017.113
  5. Zhou S, Liao R, Guan J. When cloud computing meets bioinformatics: a review. J Bioinform Comput Biol 2013;11:1330002. https://doi.org/10.1142/S0219720013300025
  6. Navale V, Bourne PE. Cloud computing applications for biomedical science: a perspective. PLoS Comput Biol 2018;14:e1006144. https://doi.org/10.1371/journal.pcbi.1006144
  7. Taylor RC. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 2010;11 Suppl 12:S1. https://doi.org/10.1186/1471-2105-11-S12-S1
  8. Jeong S, Kim JY, Jeong SC, Kang ST, Moon JK, Kim N. GenoCore: a simple and fast algorithm for core subset selection from large genotype datasets. PLoS One 2017;12:e0181420. https://doi.org/10.1371/journal.pone.0181420
  9. Jeong S, Kim J, Park W, Jeon H, Kim N. SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing. PLoS One 2017;12:e0184087. https://doi.org/10.1371/journal.pone.0184087
  10. Jang YE, Jang I, Kim S, Cho S, Kim D, Kim K, et al. ChimerDB 4.0: an updated and expanded database of fusion genes. Nucleic Acids Res 2020;48:D817-D824.
  11. Jeong I, Yu N, Jang I, Jun Y, Kim MS, Choi J, et al. GEMiCCL: mining genotype and expression data of cancer cell lines with elaborate visualization. Database (Oxford) 2018;2018:bay041. https://doi.org/10.1093/database/bay041
  12. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 2012;7:562-578. https://doi.org/10.1038/nprot.2012.016
  13. Ghosh S, Chan CK. Analysis of RNA-Seq data using TopHat and Cufflinks. Methods Mol Biol 2016;1374:339-361. https://doi.org/10.1007/978-1-4939-3167-5_18
  14. Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014;15:R29. https://doi.org/10.1186/gb-2014-15-2-r29
  15. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011;12:323. https://doi.org/10.1186/1471-2105-12-323
  16. Lee S, Seo CH, Alver BH, Lee S, Park PJ. EMSAR: estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering. BMC Bioinformatics 2015;16:278. https://doi.org/10.1186/s12859-015-0704-z
  17. Anders S, Pyl PT, Huber W. HTSeq: a Python framework to work with high-throughput sequencing data. Bioinformatics 2015;31:166-169. https://doi.org/10.1093/bioinformatics/btu638
  18. Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res 2010;38:e132. https://doi.org/10.1093/nar/gkq275
  19. Yang D, Jang I, Choi J, Kim MS, Lee AJ, Kim H, et al. 3DIV: A 3D-genome Interaction Viewer and database. Nucleic Acids Res 2018;46:D52-D57. https://doi.org/10.1093/nar/gkx1017
  20. Jiang H, Wang F, Dyer NP, Wong WH. CisGenome Browser: a flexible tool for genomic data visualization. Bioinformatics 2010;26:1781-1782. https://doi.org/10.1093/bioinformatics/btq286
  21. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008;9:R137. https://doi.org/10.1186/gb-2008-9-9-r137
  22. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009;27:66-75. https://doi.org/10.1038/nbt.1518
  23. Narlikar L, Jothi R. ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. Methods Mol Biol 2012;802:305-322. https://doi.org/10.1007/978-1-61779-400-1_20
  24. Lamy P, Wiuf C, Orntoft TF, Andersen CL. Rseg: an R package to optimize segmentation of SNP array data. Bioinformatics 2011;27:419-420. https://doi.org/10.1093/bioinformatics/btq668
  25. Xu S, Grullon S, Ge K, Peng W. Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods Mol Biol 2014;1150:97-111. https://doi.org/10.1007/978-1-4939-0512-6_5
  26. Starmer J, Magnuson T. Detecting broad domains and narrow peaks in ChIP-seq data with hiddenDomains. BMC Bioinformatics 2016;17:144. https://doi.org/10.1186/s12859-016-0991-z
  27. Wang J, Lunyak VV, Jordan IK. BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets. Bioinformatics 2013;29:492-493. https://doi.org/10.1093/bioinformatics/bts722
  28. Feng X, Grossman R, Stein L. PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinformatics 2011;12:139. https://doi.org/10.1186/1471-2105-12-139
  29. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010;20:265-272. https://doi.org/10.1101/gr.097261.109
  30. Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res 2007;17:377-386. https://doi.org/10.1101/gr.5969107
  31. Eddy SR. Profile hidden Markov models. Bioinformatics 1998;14:755-763. https://doi.org/10.1093/bioinformatics/14.9.755
  32. Tang ZZ, Chen G, Alekseyenko AV. PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances. Bioinformatics 2016;32:2618-2625. https://doi.org/10.1093/bioinformatics/btw311
  33. Kim E, Bae D, Yang S, Ko G, Lee S, Lee B, et al. BiomeNet: a database for construction and analysis of functional interaction networks for any species with a sequenced genome. Bioinformatics 2020;36:1584-1589.
  34. Chi SM, Kim J, Kim SY, Nam D. ADGO 2.0: interpreting microarray data and list of genes using composite annotations. Nucleic Acids Res 2011;39:W302-W306. https://doi.org/10.1093/nar/gkr392
  35. Yoon S, Kim J, Kim SK, Baik B, Chi SM, Kim SY, et al. GScluster: network-weighted gene-set clustering analysis. BMC Genomics 2019;20:352. https://doi.org/10.1186/s12864-019-5738-6
  36. Nam D, Kim J, Kim SY, Kim S. GSA-SNP: a general approach for gene set analysis of polymorphisms. Nucleic Acids Res 2010;38:W749-W754. https://doi.org/10.1093/nar/gkq428
  37. Mun J, Kim DU, Hoe KL, Kim SY. Genome-wide functional analysis using the barcode sequence alignment and statistical analysis (Barcas) tool. BMC Bioinformatics 2016;17:475. https://doi.org/10.1186/s12859-016-1326-9
  38. Yoon S, Nguyen HC, Yoo YJ, Kim J, Baik B, Kim S, et al. Efficient pathway enrichment and network analysis of GWAS summary data using GSA-SNP2. Nucleic Acids Res 2018;46:e60. https://doi.org/10.1093/nar/gky175