DOI QR코드

DOI QR Code

Use of Graph Database for the Integration of Heterogeneous Biological Data

  • Yoon, Byoung-Ha (Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB)) ;
  • Kim, Seon-Kyu (Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB)) ;
  • Kim, Seon-Young (Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB))
  • Received : 2016.11.29
  • Accepted : 2017.02.02
  • Published : 2017.03.31

Abstract

Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.

Keywords

References

  1. Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature 1999;402(6761 Suppl):C47-C52. https://doi.org/10.1038/35011540
  2. Kitano H. Computational systems biology. Nature 2002;420:206-210. https://doi.org/10.1038/nature01254
  3. Koonin EV, Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature 2002;420:218-223. https://doi.org/10.1038/nature01256
  4. Alon U. Biological networks: the tinkerer as an engineer. Science 2003;301:1866-1867. https://doi.org/10.1126/science.1089072
  5. Bray D. Molecular networks: the top-down view. Science 2003;301:1864-1865. https://doi.org/10.1126/science.1089118
  6. Barabasi AL, Oltvai ZN. Network biology: understanding the cell's functional organization. Nat Rev Genet 2004;5:101-113. https://doi.org/10.1038/nrg1272
  7. Li J, Zhao PX. Mining functional modules in heterogeneous biological networks using multiplex PageRank approach. Front Plant Sci 2016;7:903.
  8. Pavlopoulos GA, Secrier M, Moschopoulos CN, Soldatos TG, Kossida S, Aerts J, et al. Using graph theory to analyze biological networks. BioData Min 2011;4:10. https://doi.org/10.1186/1756-0381-4-10
  9. Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat Biotechnol 2006;24:427-433. https://doi.org/10.1038/nbt1196
  10. Lysenko A, Roznovat IA, Saqi M, Mazein A, Rawlings CJ, Auffray C. Representing and querying disease networks using graph databases. BioData Min 2016;9:23. https://doi.org/10.1186/s13040-016-0102-8
  11. Angles R, Gutierrez C. Survey of graph database models. ACM Comput Surv 2008;40:1.
  12. Henkel R, Wolkenhauer O, Waltemath D. Combining computational models, semantic annotations and simulation experiments in a graph database. Database (Oxford) 2015;2015:bau130. https://doi.org/10.1093/database/bau130
  13. Mullen J, Cockell SJ, Woollard P, Wipat A. An integrated data driven approach to drug repositioning using gene-disease associations. PLoS One 2016;11:e0155811. https://doi.org/10.1371/journal.pone.0155811
  14. Balaur I, Saqi M, Barat A, Lysenko A, Mazein A, Rawlings CJ, et al. EpiGeNet: a graph database of interdependencies between genetic and epigenetic events in colorectal cancer. J Comput Biol 2016 Sep 14 [Epub]. https://doi.org/10.1089/cmb.2016.0095.
  15. Robinson I, Webber J, Eifrem E. Graph Databases: New Opportunities for Connected Data. 2nd ed. Sebastopol: O'Reilly Media, Inc., 2015.
  16. Neo Technology Inc. The Neo4j Operations Manual v3.0, Performance [Internet]. Baltimore: Neo Technology, Inc., 2016 [cited 2016 Jan 10]. Available from: https://neo4j.com/docs/operations-manual/current.
  17. Van Bruggen R. Learning Neo4j. Birmingham: Packt Publishing Ltd., 2014.
  18. Bravo A, Cases M, Queralt-Rosinach N, Sanz F, Furlong LI. A knowledge-driven approach to extract disease-related biomarkers from the literature. Biomed Res Int 2014;2014:253128.
  19. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006;34:D535-D539. https://doi.org/10.1093/nar/gkj109
  20. Solomon BD, Nguyen AD, Bear KA, Wolfsberg TG. Clinical genomic database. Proc Natl Acad Sci U S A 2013;110:9851-9855. https://doi.org/10.1073/pnas.1302575110
  21. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 2012;40:D1100-D1107. https://doi.org/10.1093/nar/gkr777
  22. Mattingly CJ, Colby GT, Forrest JN, Boyer JL. The Comparative Toxicogenomics Database (CTD). Environ Health Perspect 2003;111:793-795. https://doi.org/10.1289/ehp.6028
  23. Liu CC, Tseng YT, Li W, Wu CY, Mayzus I, Rzhetsky A, et al. DiseaseConnect: a comprehensive web server for mechanismbased disease-disease connections. Nucleic Acids Res 2014;42:W137-W146. https://doi.org/10.1093/nar/gku412
  24. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 2006;34:D668-D672. https://doi.org/10.1093/nar/gkj067
  25. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 2014;42:D1001-D1006. https://doi.org/10.1093/nar/gkt1229
  26. Lipscomb CE. Medical Subject Headings (MeSH). Bull Med Libr Assoc 2000;88:265-266.
  27. Bult CJ, Eppig JT, Kadin JA, Richardson JE, Blake JA; Mouse Genome Database Group. The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res 2008;36:D724-D728.
  28. Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, et al. MINT: the Molecular INTeraction database. Nucleic Acids Res 2007;35:D572-D574. https://doi.org/10.1093/nar/gkl950
  29. Peters LB, Bahr N, Bodenreider O. Evaluating drug-drug interaction information in NDF-RT and DrugBank. J Biomed Semantics 2015;6:19. https://doi.org/10.1186/s13326-015-0018-0
  30. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005;33:D514-D517.
  31. Schriml LM, Arze C, Nadendla S, Chang YW, Mazaitis M, Felix V, et al. Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res 2012;40:D940-D946. https://doi.org/10.1093/nar/gkr972
  32. Rohde DD. The Orphan Drug Act: an engine of innovation? At what cost? Food Drug Law J 2000;55:125-143.
  33. Gottlieb A, Stein GY, Ruppin E, Sharan R. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol Syst Biol 2011;7:496.
  34. Twigger SN, Shimoyama M, Bromberg S, Kwitek AE, Jacob HJ; RGD Team. The Rat Genome Database, update 2007: easing the path from disease to data and back again. Nucleic Acids Res 2007;35:D658-D662. https://doi.org/10.1093/nar/gkl988
  35. Krallinger M, Valencia A, Hirschman L. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 2008;9 Suppl 2:S8.
  36. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol 2010;6:343.
  37. Chen X, Ji ZL, Chen YZ. TTD: Therapeutic Target Database. Nucleic Acids Res 2002;30:412-415. https://doi.org/10.1093/nar/30.1.412
  38. UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 2014;42:D191-D198. https://doi.org/10.1093/nar/gkt1140

Cited by

  1. PlanNET: homology-based predicted interactome for multiple planarian transcriptomes pp.1460-2059, 2017, https://doi.org/10.1093/bioinformatics/btx738
  2. Systematic integration of biomedical knowledge prioritizes drugs for repurposing vol.6, pp.2050-084X, 2017, https://doi.org/10.7554/eLife.26726
  3. BED: a Biological Entity Dictionary based on a graph data model vol.7, pp.2046-1402, 2018, https://doi.org/10.12688/f1000research.13925.2
  4. BED: a Biological Entity Dictionary based on a graph data model vol.7, pp.2046-1402, 2018, https://doi.org/10.12688/f1000research.13925.3
  5. A Novel Graph-Based Approach for the Management of Health Data on Cloud-Based WSANs vol.16, pp.2, 2018, https://doi.org/10.1007/s10723-018-9438-2