Integration of Heterogeneous Protein Databases Based on RDF(S) Models

RDF(S) 모델에 기반한 다양한 형태의 단백질 데이타베이스 통합

  • 이강표 (서울대학교 컴퓨터공학부) ;
  • 유상원 (서울대학교 컴퓨터공학부) ;
  • 김형주 (서울대학교 컴퓨터공학부)
  • Published : 2008.04.15

Abstract

In biological domain, there exist a variety of protein analysis databases which have their own meaning toward the same target of protein. If we integrate these scattered heterogeneous data efficiently, we can obtain useful information which otherwise cannot be found from each original source. Reflecting the characteristics of biological data, each data source has its own syntax and semantics. If we describe these data through RDF(S) models, one of the Semantic Web standards, we can achieve not only syntactic but also semantic integration. In this paper, we propose a new concept of integration layer based on the RDF unified schema. As a conceptual model, we construct a unified schema focusing on the protein information; as a representational model, we propose a technique for the wrappers to aggregate necessary information from the relevant sources and dynamically generate RDF instances. Two example queries show that our integration layer succeeds in processing the integrated requests from users and displaying the appropriate results.

현재 생물학 분야에는 단백질이라는 동일한 대상에 대해 각기 고유한 의미를 지니고 있는 다양한 형태의 단백질 분석 데이타베이스들이 존재한다. 이렇게 산재되어 있는 이종의 단백질 정보들을 효과적으로 통합한다면 개개의 데이터베이스로부터는 얻을 수 없는 유용한 정보를 도출해낼 수 있다. 생물학 데이타의 특성상 이 각각의 정보들은 자신만의 고유한 형태와 의미를 지니는데, 시맨틱 웹 기술의 표준인 RDF(S) 모델을 이용하여 데이타를 기술하면 형태론적인 통합뿐만 아니라 의미론적인 통합까지 이루어낼 수 있다. 이에 본 논문에서는 RDF 통합 스키마에 기반한 새로운 통합 레이어(layer)를 제안하였다. 이를 위해 개념적 모델 차원으로서는 단백질 정보를 중심으로 통합 스키마를 구축하였고, 표현적 모델 차원으로 서는 래퍼(wrapper)가 해당 데이터베이스들로부터 필요한 정보를 취하여 동적으로 RDF 인스턴스를 구축하는 방법을 제안하였다. 실제로 이 통합 레이어는 연구자들이 필요로 하는 통합 질의 예제를 성공적으로 처리하여 그 결과를 보여줄 수 있음을 확인하였다.

Keywords

References

  1. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabási, A.-L., "The large-scale organization of metabolic networks," Nature, Vol.407, pp. 651-654, 2000 https://doi.org/10.1038/35036627
  2. Jeong, H., Mason, S. P., Barabási, A.-L. and Oltvai, Z. N., "Lethality and centrality in protein networks," Nature, Vol.411, pp. 41-42, 2001 https://doi.org/10.1038/35075138
  3. Papin, J. A. and Palsson, B. O., "Topological analysis of mass-balanced signaling networks: a framework to obtain network properties including crosstalk," Journal of Theoretical Biology, Vol.227, pp. 283-297, 2004 https://doi.org/10.1016/j.jtbi.2003.11.016
  4. O. Lassila, R. Swick, "Resource Description Framework (RDF) Model and Syntax Specification," W3C Recommendation, World Wide Web Consortium, 1999
  5. The World Wide Web Consortium, http://www.w3.org/
  6. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, M., Davis, A., Dolinski, K., Dwight, S., Eppig, J. et al., "Gene Ontology: tool for the unification of biology," Nature Genetics, Vol.25, pp. 25-29, 2000 https://doi.org/10.1038/75556
  7. Goldbeck, J., Fragoso, G., Hartel, F., Hendler, J., Parsia, B. and Oberthaler, J. "The national cancer institute's thesaurus and ontology," Journal of Web Semantics, Vol. 1, pp. 1-5, 2003 https://doi.org/10.1016/j.websem.2003.09.002
  8. Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al., "UniProt: the Universal Protein knowledgebase," Nucleic Acids Research, Vol.32, D115-D119, 2004 https://doi.org/10.1093/nar/gkh151
  9. Dan Brickley, Ramanathan V. Guha, "RDF Vocabulary Description Language 1.0: RDF Schema," W3C Recommendation, World Wide Web Consortium, 2004
  10. Thomas Hernandez and Subbarao Kambhampati, "Integration of Biological Sources: Current Systems and Challenges Ahead," ACM SIGMOD Record, Vol.33, Issue 3, pp. 51-60, 2004
  11. Brenton Louie, Peter Mork, Fernando Martin- Sanchez, Alon Halevy, and Peter Tarczy-Hornoch, "Methodological Review: Data integration and genomic medicine," Journal of Biomedical Informatics, Vol.40, pp. 5-16, 2007 https://doi.org/10.1016/j.jbi.2006.02.007
  12. Kei-Hoi Cheung, Kevin Y. Yip, Andrew Smith, Remko deKnikker, Andy Masiar and Mark Gerstein, "YeastHub: a semantic web use case for integrating data in the life sciences domain," Bioinformatics, Vol.21, pp. 85-96, 2005 https://doi.org/10.1093/bioinformatics/bti1026
  13. RDF Site Summary (RSS) 1.0, http://web.resource. org/rss/1.0/
  14. J. Broekstra, A. Kampman, F. Harmelen "Sesame: An Architecture for Storing and Querying RDF Data and Schema Information," International Semantic Web Conference, http://openrdf.org, 2002
  15. Jeen Broekstra, Arjohn Kampman, "SeRQL: An RDF Query and Transformation Language," International Semantic Web Conference, 2004
  16. Eric K. Neumann and Dennis Quan, "Biodash: A Semantic Web Dashboard for Drug Development," Pacific Symposium on Biocomputing, Vol.11, pp. 176-187, 2006
  17. OASIS (Omics AnalySIS), http://idb.snu.ac.kr/
  18. Bowers P. M., Pellegrini M., Thompson M. J., Fierro J., Yeates T. O., Eisenberg D., "Prolinks: a database of protein functional linkages derived from coevolution," Genome Biology, Vol.5, No.5, R35, 2004 https://doi.org/10.1186/gb-2004-5-5-r35
  19. Hiroyuki Ogata, Susumu Goto, Kazushige Sato, Wataru Fujibuchi, and Hidemasa Bono, "KEGG: Kyoto Encyclopedia of Genes and Genomes," Nucleic Acids Research, Vol.28, pp. 27-30, 2000 https://doi.org/10.1093/nar/28.1.27
  20. Pierre DÄonnes and Annette HÄoglund, "Predicting Protein Subcellular Localization: Past, Present, and Future," Genomics Proteomics Bioinformatics, Vol.2, pp. 209-215, 2004 https://doi.org/10.1016/S1672-0229(04)02027-3
  21. NCBI (National Center for Biotechnology Information), http://www.ncbi.nih.gov/