Use of Graph Database for the Integration of Heterogeneous Biological Data

Yoon, Byoung-Ha;Kim, Seon-Kyu;Kim, Seon-Young;

doi:10.5808/GI.2017.15.1.19

Genomics & Informatics

Volume 15 Issue 1
/
Pages.19-27
/
2017
/
1598-866X(pISSN)
/
2234-0742(eISSN)

Korea Genome Organization (한국유전체학회)

DOI QR Code

Use of Graph Database for the Integration of Heterogeneous Biological Data

Yoon, Byoung-Ha (Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB)) ;
Kim, Seon-Kyu (Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB)) ;
Kim, Seon-Young (Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB))

Received : 2016.11.29
Accepted : 2017.02.02
Published : 2017.03.31

https://doi.org/10.5808/GI.2017.15.1.19 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.

Keywords

References

Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature 1999;402(6761 Suppl):C47-C52. https://doi.org/10.1038/35011540
Kitano H. Computational systems biology. Nature 2002;420:206-210. https://doi.org/10.1038/nature01254
Koonin EV, Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature 2002;420:218-223. https://doi.org/10.1038/nature01256
Alon U. Biological networks: the tinkerer as an engineer. Science 2003;301:1866-1867. https://doi.org/10.1126/science.1089072
Bray D. Molecular networks: the top-down view. Science 2003;301:1864-1865. https://doi.org/10.1126/science.1089118
Barabasi AL, Oltvai ZN. Network biology: understanding the cell's functional organization. Nat Rev Genet 2004;5:101-113. https://doi.org/10.1038/nrg1272
Li J, Zhao PX. Mining functional modules in heterogeneous biological networks using multiplex PageRank approach. Front Plant Sci 2016;7:903.
Pavlopoulos GA, Secrier M, Moschopoulos CN, Soldatos TG, Kossida S, Aerts J, et al. Using graph theory to analyze biological networks. BioData Min 2011;4:10. https://doi.org/10.1186/1756-0381-4-10
Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat Biotechnol 2006;24:427-433. https://doi.org/10.1038/nbt1196
Lysenko A, Roznovat IA, Saqi M, Mazein A, Rawlings CJ, Auffray C. Representing and querying disease networks using graph databases. BioData Min 2016;9:23. https://doi.org/10.1186/s13040-016-0102-8
Angles R, Gutierrez C. Survey of graph database models. ACM Comput Surv 2008;40:1.
Henkel R, Wolkenhauer O, Waltemath D. Combining computational models, semantic annotations and simulation experiments in a graph database. Database (Oxford) 2015;2015:bau130. https://doi.org/10.1093/database/bau130
Mullen J, Cockell SJ, Woollard P, Wipat A. An integrated data driven approach to drug repositioning using gene-disease associations. PLoS One 2016;11:e0155811. https://doi.org/10.1371/journal.pone.0155811
Balaur I, Saqi M, Barat A, Lysenko A, Mazein A, Rawlings CJ, et al. EpiGeNet: a graph database of interdependencies between genetic and epigenetic events in colorectal cancer. J Comput Biol 2016 Sep 14 [Epub]. https://doi.org/10.1089/cmb.2016.0095.
Robinson I, Webber J, Eifrem E. Graph Databases: New Opportunities for Connected Data. 2nd ed. Sebastopol: O'Reilly Media, Inc., 2015.
Neo Technology Inc. The Neo4j Operations Manual v3.0, Performance [Internet]. Baltimore: Neo Technology, Inc., 2016 [cited 2016 Jan 10]. Available from: https://neo4j.com/docs/operations-manual/current.
Van Bruggen R. Learning Neo4j. Birmingham: Packt Publishing Ltd., 2014.
Bravo A, Cases M, Queralt-Rosinach N, Sanz F, Furlong LI. A knowledge-driven approach to extract disease-related biomarkers from the literature. Biomed Res Int 2014;2014:253128.
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006;34:D535-D539. https://doi.org/10.1093/nar/gkj109
Solomon BD, Nguyen AD, Bear KA, Wolfsberg TG. Clinical genomic database. Proc Natl Acad Sci U S A 2013;110:9851-9855. https://doi.org/10.1073/pnas.1302575110
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 2012;40:D1100-D1107. https://doi.org/10.1093/nar/gkr777
Mattingly CJ, Colby GT, Forrest JN, Boyer JL. The Comparative Toxicogenomics Database (CTD). Environ Health Perspect 2003;111:793-795. https://doi.org/10.1289/ehp.6028
Liu CC, Tseng YT, Li W, Wu CY, Mayzus I, Rzhetsky A, et al. DiseaseConnect: a comprehensive web server for mechanismbased disease-disease connections. Nucleic Acids Res 2014;42:W137-W146. https://doi.org/10.1093/nar/gku412
Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 2006;34:D668-D672. https://doi.org/10.1093/nar/gkj067
Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 2014;42:D1001-D1006. https://doi.org/10.1093/nar/gkt1229
Lipscomb CE. Medical Subject Headings (MeSH). Bull Med Libr Assoc 2000;88:265-266.
Bult CJ, Eppig JT, Kadin JA, Richardson JE, Blake JA; Mouse Genome Database Group. The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res 2008;36:D724-D728.
Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, et al. MINT: the Molecular INTeraction database. Nucleic Acids Res 2007;35:D572-D574. https://doi.org/10.1093/nar/gkl950
Peters LB, Bahr N, Bodenreider O. Evaluating drug-drug interaction information in NDF-RT and DrugBank. J Biomed Semantics 2015;6:19. https://doi.org/10.1186/s13326-015-0018-0
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005;33:D514-D517.
Schriml LM, Arze C, Nadendla S, Chang YW, Mazaitis M, Felix V, et al. Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res 2012;40:D940-D946. https://doi.org/10.1093/nar/gkr972
Rohde DD. The Orphan Drug Act: an engine of innovation? At what cost? Food Drug Law J 2000;55:125-143.
Gottlieb A, Stein GY, Ruppin E, Sharan R. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol Syst Biol 2011;7:496.
Twigger SN, Shimoyama M, Bromberg S, Kwitek AE, Jacob HJ; RGD Team. The Rat Genome Database, update 2007: easing the path from disease to data and back again. Nucleic Acids Res 2007;35:D658-D662. https://doi.org/10.1093/nar/gkl988
Krallinger M, Valencia A, Hirschman L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 2008;9 Suppl 2:S8.
Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol 2010;6:343.
Chen X, Ji ZL, Chen YZ. TTD: Therapeutic Target Database. Nucleic Acids Res 2002;30:412-415. https://doi.org/10.1093/nar/30.1.412
UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 2014;42:D191-D198. https://doi.org/10.1093/nar/gkt1140

Cited by

PlanNET: homology-based predicted interactome for multiple planarian transcriptomes pp.1460-2059, 2017, https://doi.org/10.1093/bioinformatics/btx738
Systematic integration of biomedical knowledge prioritizes drugs for repurposing vol.6, pp.2050-084X, 2017, https://doi.org/10.7554/eLife.26726
BED: a Biological Entity Dictionary based on a graph data model vol.7, pp.2046-1402, 2018, https://doi.org/10.12688/f1000research.13925.2
BED: a Biological Entity Dictionary based on a graph data model vol.7, pp.2046-1402, 2018, https://doi.org/10.12688/f1000research.13925.3
A Novel Graph-Based Approach for the Management of Health Data on Cloud-Based WSANs vol.16, pp.2, 2018, https://doi.org/10.1007/s10723-018-9438-2

Genomics & Informatics

Use of Graph Database for the Integration of Heterogeneous Biological Data

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)