DOI QR코드

DOI QR Code

Protein Disorder/Order Region Classification Using EPs-TFP Mining Method

EPs-TFP 마이닝 기법을 이용한 단백질 Disorder/Order 지역 분류

  • 이헌규 (한국전자통신연구원 융합기술연구부문) ;
  • 신용호 (영남대학교 경영학부)
  • Received : 2012.09.25
  • Accepted : 2012.11.30
  • Published : 2012.12.31

Abstract

Since a protein displays its specific functions when disorder region of protein sequence transits to order region with provoking a biological reaction, the separation of disorder region and order region from the sequence data is urgently necessary for predicting three dimensional structure and characteristics of the protein. To classify the disorder and order region efficiently, this paper proposes a classification/prediction method using sequence data while acquiring a non-biased result on a specific characteristics of protein and improving the classification speed. The emerging patterns based EPs-TFP methods utilizes only the essential emerging pattern in which the redundant emerging patterns are removed. This classification method finds the sequence patterns of disorder region, such sequence patterns are frequently shown in disorder region but relatively not frequently in the order region. We expand P-tree and T-tree conceptualized TFP method into a classification/prediction method in order to improve the performance of the proposed algorithm. We used Disprot 4.9 and CASP 7 data to evaluate EPs-TFP technique, the results of order/disorder classification show sensitivity 73.6, specificity 69.51 and accuracy 74.2.

단백질은 서열의 disorder 구역이 생물학적 반응을 일으켜 order로 변하는 과정에서 그 기능을 하게 되므로 서열 데이터에서 disorder 구역과 order 구역을 분리하는 것은 단백질의 3차 구조 및 특성을 예측하는데 반드시 필요하다. 따라서 이 논문에서는 효율적인 disorder와 order 구역 분류를 위해서 단백질의 특정 특징에 치우치지 않는 분류 결과를 얻으면서, 분류 속도를 향상 시킬 수 있도록 서열 데이터를 이용한 분류/예측 기법을 제안한다. 출현패턴 기반의 EPs-TFP 기법은 중복 출현패턴이 제거된 필수 출현패턴만을 이용하는 분류/예측 기법이다. 이 분류 기법은 disorder 구역의 서열 출현패턴들을 발견하며, 이러한 서열 출현패턴은 disorder 구역에서는 빈발하지만 order 구역에서는 상대적으로 빈발하지 않는 패턴들이다. 또한 제안 알고리즘의 성능 향상을 위해서 기존의 P-tree, T-tree 개념의 TFP 기법을 확장하여 분류/예측 기법으로 적용하였다. EPs-TFP 기법의 성능평가를 위해서 Disprot 4.9와 CASP 7 데이터를 활용하였고, disorder/order 구역을 분류한 결과, 민감도 73.6, 특이도 69.5, 정확도 74.2를 보였다.

Keywords

References

  1. J.F. Gibrat, T. Madej, and S.H. Bryant, "Surprising similarities in structure comparison," Curr. Opin. Struct. Biol., vol. 6, pp.377-385, 1996. https://doi.org/10.1016/S0959-440X(96)80058-3
  2. 안명상, 고정환, 유재수, 조완섭, "단백질 상호작용 네트워크에서 연결노드 추출과 그 중요도 측정," 한국산업정보학회논문지, vol. 12, no. 5, pp.1-13, 2007.
  3. S. Maslov, and K. Sneppen, "Specificity and stability in topology of protein networks," Science, vol. 296, pp.910-913, 2006.
  4. F. Ferron, S. Longhi, B. Canard, and D. Karlin, "A practical overview of protein disorder prediction methods," Proteins: Structure, Function, and Bioinformatics, vol. 5, pp.1-14, 2006.
  5. DT. Jones, and JJ. Ward, "Prediction of disordered regions in proteins from position specific score matrices," Proteins, vol. 53, pp.573-578, 2003. https://doi.org/10.1002/prot.10528
  6. K. Peng, P. Radivojac, S. Vucetic, AK. Dunker, et al., "Length dependent prediction of protein intrinsic disorder," BMC Bioinformatics, vol. 7 online, 2006.
  7. S. Hirose, and K. Shimizu, "POODLE-L: a twolevel SVM prediction system for reliably predicting long disordered regions," Bioinformatics, vol. 23, pp.2046-53, 2007. https://doi.org/10.1093/bioinformatics/btm302
  8. T. Ishida, and K. Kinoshita, "PrDOS: prediction of disordered protein regions from amino acid sequence," Nucleic Acids Research, vol. 35, pp.460-464, 2007. https://doi.org/10.1093/nar/gkm363
  9. ZR. Yang, R. Thomson, P. McNeil, and RM. Esnouf, "RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins," Bioinformatics, vol. 21, pp.3369-3376, 2005. https://doi.org/10.1093/bioinformatics/bti534
  10. J. Liu, H. Tan, and B. Rost, "Loopy proteins appear conserved in evolution," Mol. Biol., vol. 322, pp.53-64, 2002. https://doi.org/10.1016/S0022-2836(02)00736-2
  11. Z. Obradovic, K. Peng, S. Vucetic, P. Radivojac, et al., "Predicting intrinsic disorder from amino acid sequence," Proteins, vol. 53, pp.566-572, 2003. https://doi.org/10.1002/prot.10532
  12. R. Linding, LJ. Jensen, F. Diella, P. Bork, TJ. Gibson, RB. Russell, "Protein disorder prediction: implications for structural," Proteomics, vol. 11, pp.1453-1459, 2003.
  13. R. Linding, LJ. Jensen, F. Diella, P. Bork, et al., "Protein disorder prediction: implications for structural proteomics," Structure. vol. 11, pp.1453-1459, 2003. https://doi.org/10.1016/j.str.2003.10.002
  14. J. Cheng, M. Sweredoski, P. Baldi, "Accurate prediction of protein disordered regions by mining protein structure data," Data Mining and Knowledge Discovery, pp.213-222, 2005.
  15. J. Prilusky, C.E. Felder, T. Mordehai, E.H. Rydberg, et al., "FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded," Bioinformatics, vol. 21, pp.3435-3438, 2005. https://doi.org/10.1093/bioinformatics/bti537
  16. P. Han, X. Zhang, ZP. Feng, "Predicting disordered regions in proteins using the profiles using amino acid indices," BMC Bioinformatics, vol. 10 online, 2009.
  17. F. Coenen, P. Leng, and G. Goulbourne, "Tree Structures for Mining Association Rules," Data Mining and Knowledge Discovery, vol. 15, pp.391-398, 2004.
  18. 최해원, "대용량 DNA서열 처리를 위한 서픽스트리 생성 알고리즘의 개발," 한국산업정보학회논문지, vol. 15, no. 1, pp.37-46, 2010.
  19. G. Dong, X. Zhang, L. Wong, J. Li, "Classification by aggregating emerging patterns," Int'l Conf. on Discovery Science, pp.30-42, 1999.
  20. 이헌규, 노기용, 류근호 정두영, "심혈관계 질환 진단을 위한 출현 패턴 기반 분류 기법," 한국정보처리학회 16-D, pp.11-26, 2009.
  21. S. Vucetic, Z. Obradovic, V. Vacic, P. Radivojac, et al., "DisProt: A Database of Protein Disorder,"Bioinformatics, vol. 21, pp.137-140, 2005. https://doi.org/10.1093/bioinformatics/bth476
  22. U. Hobohm, C. Sander, "Enlarged representative set of protein structures," Protein Science, vol. 3, p.522, 1994.
  23. J. Moult, K. Fidelis, A. Zemla, T. Hubbard, "Critical assessment of methods of protein structure prediction (CASP)—round 5," Proteins, vol.53, pp.334-339, 2003. https://doi.org/10.1002/prot.10556
  24. J. Moult, K. Fidelis, B. Rost, T. Hubbard, et al., "Critical assessment of methods of protein structure prediction (CASP)—round 6," Proteins, vol. 61, pp.3-7, 2005. https://doi.org/10.1002/prot.20716
  25. J. Li, G. Dong, and K. Ramamohanarao, "Making use of the most expressive jumping emerging patterns for classification," Knowledge and Information Systems, vol. 3, no. 2, pp.131-145, 2001. https://doi.org/10.1007/PL00011662
  26. G. Dong, X. Zhang, L. Wong, and J. Li, "Classification by aggregating emerging patterns," Int'l Conf. on Discovery Science, Japan, pp.30-42, 1999.
  27. W. Li, J. Han and J. Pei, "CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules," ICDM 2001, pp.369-376, 2001.
  28. F. Coenen, "LUCS-KDD group, Dept. of Computer Science," The University of Liverpool, UK, "http://www.cSc.liv.ac.uk/-frans/KDD/," 2004.