클러스터 세그먼트 인덱스를 이용한 단백질 이차 구조의 효율적인 유사 검색

Clustered Segment Index for Efficient Approximate Searching on the Secondary Structure of Protein Sequences

  • 서민구 (연세대학교 컴퓨터과학과) ;
  • 박상현 (연세대학교 컴퓨터과학과) ;
  • 원정임 (연세대학교 컴퓨터과학과)
  • 발행 : 2006.06.01

초록

단백질 일차 구조(아미노산 배열)에 대한 상동 검색은 유전자나 단백질의 기능과 진화 과정을 유추하기 위한 필수 연산이다. 그러나 진화 단계가 멀리 떨어진 경우 단백질 일차 구조는 보존되지 않기 때문에 단백질의 공간적 구조에 대한 유사 검색을 통해서만 진화 단계를 유추할 수 있다. 따라서 본 논문에서는 단백질의 공간적 구조를 표현하는 단백질 이차 구조를 대상으로 하여 RDBMS상에 쉽게 구현이 가능한 인덱싱 방안을 제안한다. 제안된 인덱싱 방안은 클러스터링 기법과 LookAhead 개념을 활용하여 Exact Match, Range Match, Wildcard Match 질의를 신속하게 처리한다. 제안된 방법의 우수성을 검증하기 위하여 실제의 단백질 데이타를 대상으로 성능 평가를 수행하였다. 실험 결과에 의하면, 제안된 방법은 기존의 방법과 비교하여 Exact Match의 경우 6.3배까지, Range Match의 경우 3.3배까지, Wildcard Match의 경우 1.5배까지의 개선된 검색 성능을 가지는 것으로 나타났다.

Homology searching on the primary structure (i.e., amino acid arrangement) of protein sequences is an essential part in predicting the functions and evolutionary histories of proteins. However, proteins distant in an evolutionary history do not conserve amino acid residue arrangements, while preserving their structures. Therefore, homology searching on proteins' secondary structure is quite important in finding out distant homology. In this manuscript, we propose an indexing scheme for efficient approximate searching on the secondary structure of protein sequences which can be easily implemented in RDBMS. Exploiting the concept of clustering and lookahead, the proposed indexing scheme processes three types of secondary structure queries (i.e., exact match, range match, and wildcard match) very quickly. To evaluate the performance of the proposed method, we conducted extensive experiments using a set of actual protein sequences. CSI was proved to be faster than the existing indexing methods up to 6.3 times in exact match, 3.3 times in range match, and 1.5 times in wildcard match, respectively.

키워드

참고문헌

  1. B. Alberts, D. Bray, J. Lweis, M. Raff, K. Roberts, and J. D.Watson (3rd), Molecular Biology of the Cell (Garland Publishing Inc., 1994)
  2. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J, Lipman, Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs, Nucleic Acids Research 25(17) (1997), pp. 3389-3402 https://doi.org/10.1093/nar/25.17.3389
  3. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, Basic Local Alignment Search Tool, Journal of Molecular Biology (1990), pp. 403-410 https://doi.org/10.1016/S0022-2836(05)80360-2
  4. Z. Aung, W. Fu, and K.-L. Tan, An Efficient Index-based Protein Structure Database Searching Method, Proc. IEEE DASFAA Conf. (2003), pp. 311 -318 https://doi.org/10.1109/DASFAA.2003.1192396
  5. O. Camoglu, T. Kahveci, and A. K. Singh, Towards Index-based Similarity Search for Protein Structure Databases, Proc. IEEE Computer Society Bioirformatics Conf. (2003), pp. 148-158 https://doi.org/10.1109/CSB.2003.1227314
  6. C. Fondrat and P. Dessen, A Rapid Access Motif Database(RAMdb) with a Searching Algorithm for the Retrieval Patterns in Nucleic Acids or Protein Databanks, Computer Applications in the Bioscience 11(3) (1995), pp. 273-279 https://doi.org/10.1093/bioinformatics/11.3.273
  7. D. Frishman and P. Argos, Seventy-five Accuracy in Protein Secondary Structure Prediction, Proteins 27(3) (1997), pp. 329-335 https://doi.org/10.1002/(SICI)1097-0134(199703)27:3<329::AID-PROT1>3.0.CO;2-8
  8. D. Frishman and P. Argos, Incorporation of Long-Distance Interactions into a Secondary Structure Prediction Algorithm, Protein Engineering 9(2) (1996), pp. 133-142 https://doi.org/10.1093/protein/9.2.133
  9. J, F. Gibrat, T. Madel, and S. H. Bryant, Surprising Similarities in Structure Comparison, Current Opinion in Structural Biology 6(3) (1996), pp. 377-385 https://doi.org/10.1016/S0959-440X(96)80058-3
  10. L. Hammel and J. M. Patel, Searching on the Secondary Structure of Protein Sequence, Proc. VLDB Conf. (2002), pp. 634-645
  11. L. Holm and C. Sander, Protein Structure Comparison by Alignment of Distance Matrices, Journal of Molecular Biology 233(1) (1993), pp, 123-138 https://doi.org/10.1006/jmbi.1993.1489
  12. E. Hunt, M.P. Atkinson, and R. W. Irving, Database Indexing for Large DNA and Protein Sequence Collections, The VLDB Journal 11(3) (2002), pp. 256-271 https://doi.org/10.1007/s007780200064
  13. P. Koehl, Protein Structure Similarities, Current Opinion in Structural Biology 11(3) (2001), pp. 348-353 https://doi.org/10.1016/S0959-440X(00)00214-1
  14. D. W. Mount, Bioinformatics (Cold Spring Harbor Laboratory Press, 2000)
  15. A. Pastore and A. Lesk, Comparison of Globins and Physocyanins: Evidence for Evolutionary Relationship, Proteins: Struct., Func., Gen. 8(2) (1990), pp. 133-155 https://doi.org/10.1002/prot.340080204
  16. G. A. Stephen, String Searching Algorithms (World Scientific Publishing, 1994)
  17. H. Wang, C.-S. Perng, W. Fan, S. Park, and P. S. Yu, Indexing Weighted Sequences in Large Databases, Proc. IEEE ICDE Conf. (2003), pp. 63-74
  18. H. E. Williams, Genomic Information Retrieval, Proc. Australasian Database Conf. (2003), pp. 27-35
  19. C. H. Wu, L.-S. L. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Hu, P. Kourtesis, R. S. Ledley, B. E. Suzek, C. R. Vinayaka, J. Zhang, andW. C. Barker, The Protein Information Resource, Nucleic Acids Research 31(1) (2003), pp. 345-347 https://doi.org/10.1093/nar/gkg040