Bioinformatics services for analyzing massive genomic datasets

Ko, Gunhwan;Kim, Pan-Gyu;Cho, Youngbum;Jeong, Seongmun;Kim, Jae-Yoon;Kim, Kyoung Hyoun;Lee, Ho-Yeon;Han, Jiyeon;Yu, Namhee;Ham, Seokjin;Jang, Insoon;Kang, Byunghee;Shin, Sunguk;Kim, Lian;Lee, Seung-Won;Nam, Dougu;Kim, Jihyun F.;Kim, Namshin;Kim, Seon-Young;Lee, Sanghyuk;Roh, Tae-Young;Lee, Byungwook;

doi:10.5808/GI.2020.18.1.e8

Genomics & Informatics

Volume 18 Issue 1
/
Pages.8.1-8.10
/
2020
/
1598-866X(pISSN)
/
2234-0742(eISSN)

Korea Genome Organization (한국유전체학회)

DOI QR Code

Bioinformatics services for analyzing massive genomic datasets

Ko, Gunhwan (Korea Bioinformation Center (KOBIC), KRIBB) ;
Kim, Pan-Gyu (Korea Bioinformation Center (KOBIC), KRIBB) ;
Cho, Youngbum (Genome Editing Research Center, KRIBB) ;
Jeong, Seongmun (Genome Editing Research Center, KRIBB) ;
Kim, Jae-Yoon (Genome Editing Research Center, KRIBB) ;
Kim, Kyoung Hyoun (Genome Editing Research Center, KRIBB) ;
Lee, Ho-Yeon (Genome Editing Research Center, KRIBB) ;
Han, Jiyeon (Department of BioInformation Science, Ewha Womans University) ;
Yu, Namhee (Department of BioInformation Science, Ewha Womans University) ;
Ham, Seokjin (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
Jang, Insoon (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
Kang, Byunghee (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
Shin, Sunguk (Department of Systems, Biology Division of Life Sciences, and Institute for Life Science and Biotechnology, Yonsei University) ;
Kim, Lian (Bioposh Inc.) ;
Lee, Seung-Won (SeqGenesis) ;
Nam, Dougu (School of Life Sciences, Ulsan National Institute of Science and Technology) ;
Kim, Jihyun F. (Department of Systems, Biology Division of Life Sciences, and Institute for Life Science and Biotechnology, Yonsei University) ;
Kim, Namshin (Genome Editing Research Center, KRIBB) ;
Kim, Seon-Young (Genome Structure Research Center, KRIBB) ;
Lee, Sanghyuk (Department of BioInformation Science, Ewha Womans University) ;
Roh, Tae-Young (Department of Life Sciences and Division of Integrative Biosciences & Biotechnology, Pohang University of Science & Technology (POSTECH)) ;
Lee, Byungwook (Korea Bioinformation Center (KOBIC), KRIBB)

Received : 2020.01.09
Accepted : 2020.03.11
Published : 2020.03.31

https://doi.org/10.5808/GI.2020.18.1.e8 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The explosive growth of next-generation sequencing data has resulted in ultra-large-scale datasets and ensuing computational problems. In Korea, the amount of genomic data has been increasing rapidly in the recent years. Leveraging these big data requires researchers to use large-scale computational resources and analysis pipelines. A promising solution for addressing this computational challenge is cloud computing, where CPUs, memory, storage, and programs are accessible in the form of virtual machines. Here, we present a cloud computing-based system, Bio-Express, that provides user-friendly, cost-effective analysis of massive genomic datasets. Bio-Express is loaded with predefined multi-omics data analysis pipelines, which are divided into genome, transcriptome, epigenome, and metagenome pipelines. Users can employ predefined pipelines or create a new pipeline for analyzing their own omics data. We also developed several web-based services for facilitating downstream analysis of genome data. Bio-Express web service is freely available at https://www. bioexpress.re.kr/.

Keywords

References

Bansal V, Boucher C. Sequencing technologies and analyses: where have we been and where are we going? iScience 2019;18:37-41. https://doi.org/10.1016/j.isci.2019.06.035
Kodama Y, Shumway M, Leinonen R; International Nucleotide Sequence Database Collaboration. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res 2012;40:D54-D56. https://doi.org/10.1093/nar/gkr854
O'Driscoll A, Daugelaite J, Sleator RD. 'Big data', Hadoop and cloud computing in genomics. J Biomed Inform 2013;46:774-781. https://doi.org/10.1016/j.jbi.2013.07.001
Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 2018;19:208-219. https://doi.org/10.1038/nrg.2017.113
Zhou S, Liao R, Guan J. When cloud computing meets bioinformatics: a review. J Bioinform Comput Biol 2013;11:1330002. https://doi.org/10.1142/S0219720013300025
Navale V, Bourne PE. Cloud computing applications for biomedical science: a perspective. PLoS Comput Biol 2018;14:e1006144. https://doi.org/10.1371/journal.pcbi.1006144
Taylor RC. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 2010;11 Suppl 12:S1. https://doi.org/10.1186/1471-2105-11-S12-S1
Jeong S, Kim JY, Jeong SC, Kang ST, Moon JK, Kim N. GenoCore: a simple and fast algorithm for core subset selection from large genotype datasets. PLoS One 2017;12:e0181420. https://doi.org/10.1371/journal.pone.0181420
Jeong S, Kim J, Park W, Jeon H, Kim N. SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing. PLoS One 2017;12:e0184087. https://doi.org/10.1371/journal.pone.0184087
Jang YE, Jang I, Kim S, Cho S, Kim D, Kim K, et al. ChimerDB 4.0: an updated and expanded database of fusion genes. Nucleic Acids Res 2020;48:D817-D824.
Jeong I, Yu N, Jang I, Jun Y, Kim MS, Choi J, et al. GEMiCCL: mining genotype and expression data of cancer cell lines with elaborate visualization. Database (Oxford) 2018;2018:bay041. https://doi.org/10.1093/database/bay041
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 2012;7:562-578. https://doi.org/10.1038/nprot.2012.016
Ghosh S, Chan CK. Analysis of RNA-Seq data using TopHat and Cufflinks. Methods Mol Biol 2016;1374:339-361. https://doi.org/10.1007/978-1-4939-3167-5_18
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014;15:R29. https://doi.org/10.1186/gb-2014-15-2-r29
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011;12:323. https://doi.org/10.1186/1471-2105-12-323
Lee S, Seo CH, Alver BH, Lee S, Park PJ. EMSAR: estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering. BMC Bioinformatics 2015;16:278. https://doi.org/10.1186/s12859-015-0704-z
Anders S, Pyl PT, Huber W. HTSeq: a Python framework to work with high-throughput sequencing data. Bioinformatics 2015;31:166-169. https://doi.org/10.1093/bioinformatics/btu638
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res 2010;38:e132. https://doi.org/10.1093/nar/gkq275
Yang D, Jang I, Choi J, Kim MS, Lee AJ, Kim H, et al. 3DIV: A 3D-genome Interaction Viewer and database. Nucleic Acids Res 2018;46:D52-D57. https://doi.org/10.1093/nar/gkx1017
Jiang H, Wang F, Dyer NP, Wong WH. CisGenome Browser: a flexible tool for genomic data visualization. Bioinformatics 2010;26:1781-1782. https://doi.org/10.1093/bioinformatics/btq286
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008;9:R137. https://doi.org/10.1186/gb-2008-9-9-r137
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009;27:66-75. https://doi.org/10.1038/nbt.1518
Narlikar L, Jothi R. ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. Methods Mol Biol 2012;802:305-322. https://doi.org/10.1007/978-1-61779-400-1_20
Lamy P, Wiuf C, Orntoft TF, Andersen CL. Rseg: an R package to optimize segmentation of SNP array data. Bioinformatics 2011;27:419-420. https://doi.org/10.1093/bioinformatics/btq668
Xu S, Grullon S, Ge K, Peng W. Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods Mol Biol 2014;1150:97-111. https://doi.org/10.1007/978-1-4939-0512-6_5
Starmer J, Magnuson T. Detecting broad domains and narrow peaks in ChIP-seq data with hiddenDomains. BMC Bioinformatics 2016;17:144. https://doi.org/10.1186/s12859-016-0991-z
Wang J, Lunyak VV, Jordan IK. BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets. Bioinformatics 2013;29:492-493. https://doi.org/10.1093/bioinformatics/bts722
Feng X, Grossman R, Stein L. PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinformatics 2011;12:139. https://doi.org/10.1186/1471-2105-12-139
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010;20:265-272. https://doi.org/10.1101/gr.097261.109
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res 2007;17:377-386. https://doi.org/10.1101/gr.5969107
Eddy SR. Profile hidden Markov models. Bioinformatics 1998;14:755-763. https://doi.org/10.1093/bioinformatics/14.9.755
Tang ZZ, Chen G, Alekseyenko AV. PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances. Bioinformatics 2016;32:2618-2625. https://doi.org/10.1093/bioinformatics/btw311
Kim E, Bae D, Yang S, Ko G, Lee S, Lee B, et al. BiomeNet: a database for construction and analysis of functional interaction networks for any species with a sequenced genome. Bioinformatics 2020;36:1584-1589.
Chi SM, Kim J, Kim SY, Nam D. ADGO 2.0: interpreting microarray data and list of genes using composite annotations. Nucleic Acids Res 2011;39:W302-W306. https://doi.org/10.1093/nar/gkr392
Yoon S, Kim J, Kim SK, Baik B, Chi SM, Kim SY, et al. GScluster: network-weighted gene-set clustering analysis. BMC Genomics 2019;20:352. https://doi.org/10.1186/s12864-019-5738-6
Nam D, Kim J, Kim SY, Kim S. GSA-SNP: a general approach for gene set analysis of polymorphisms. Nucleic Acids Res 2010;38:W749-W754. https://doi.org/10.1093/nar/gkq428
Mun J, Kim DU, Hoe KL, Kim SY. Genome-wide functional analysis using the barcode sequence alignment and statistical analysis (Barcas) tool. BMC Bioinformatics 2016;17:475. https://doi.org/10.1186/s12859-016-1326-9
Yoon S, Nguyen HC, Yoo YJ, Kim J, Baik B, Kim S, et al. Efficient pathway enrichment and network analysis of GWAS summary data using GSA-SNP2. Nucleic Acids Res 2018;46:e60. https://doi.org/10.1093/nar/gky175

Genomics & Informatics

Bioinformatics services for analyzing massive genomic datasets

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)