Prediction of Implicit Protein - Protein Interaction Using Optimal Associative Feature Rule

최적 연관 속성 규칙을 이용한 비명시적 단백질 상호작용의 예측

  • 엄재홍 (서울대학교 전기컴퓨터공학부) ;
  • 장병탁 (서울대학교 전기컴퓨터공학부)
  • Published : 2006.04.01

Abstract

Proteins are known to perform a biological function by interacting with other proteins or compounds. Since protein interaction is intrinsic to most cellular processes, prediction of protein interaction is an important issue in post-genomic biology where abundant interaction data have been produced by many research groups. In this paper, we present an associative feature mining method to predict implicit protein-protein interactions of Saccharomyces cerevisiae from public protein interaction data. We discretized continuous-valued features by maximal interdependence-based discretization approach. We also employed feature dimension reduction filter (FDRF) method which is based on the information theory to select optimal informative features, to boost prediction accuracy and overall mining speed, and to overcome the dimensionality problem of conventional data mining approaches. We used association rule discovery algorithm for associative feature and rule mining to predict protein interaction. Using the discovered associative feature we predicted implicit protein interactions which have not been observed in training data. According to the experimental results, the proposed method accomplished about 96.5% prediction accuracy with reduced computation time which is about 29.4% faster than conventional method with no feature filter in association rule mining.

단백질들은 서로 다른 단백질들과 상호작용 하거나 복합물을 형성함으로써 생물학적으로 중요한 기능을 한다고 알려져 있다. 때문에 대부분의 세포작용에 있어 중요한 역할을 하는 단백질 상호작용의 분석 및 예측에 대한 연구는 여러 연구그룹으로부터 풍부한 데이타가 산출되고 있는 현(現) 게놈시대에서 또 하나의 중요한 이슈가 되고 있다. 본 논문에서는 효모(Saccharomyces cerevisiae)에 대해 공개되어있는 단백질 상호작용 데이타들에서 속성들 간의 연관을 통해 유추 가능한 잠재적 단백질 상호작용들을 예측하기 위한 연관속성 마이닝 방법을 제시한다. 단백질의 속성들 중 연속값을 가지는 속성값들은 최대상호 의존성에 기반을 두어 이산화 하였으며, 정보이론기반 속성선택 알고리즘을 사용하여 단백질들 간의 상호작용 예측을 위해 고려되는 단백질의 속성(attribute) 수 증가에 따른 속성차원문제를 극복하도록 하였다. 속성들 간의 연관성 발견은 데이타마이닝 분야에서 사용되는 연관규칙 발견(association rule discovery) 방법을 사용하였다 논문에서 제안한 방법은 발견된 연관규칙을 통한 단백질 상호작용 예측문제에 있어 최대 약 96.5%의 예측 정확도를 보였으며 속성필터링을 통하여 속성필터링을 하지 않는 기존의 방법에 비해 최대 약 29.4% 연관규칙 발견속도 향상을 보였다.

Keywords

References

  1. Deng, M., Mehta, S., Sun, F., and Chen, T., 'Inferring domain-domain interactions from protein - protein interactions,' Genome Res. Vo1.12, No.10, pp. 1540-1548, 2002 https://doi.org/10.1101/gr.153002
  2. Goffeau, A. and Barrell, B. G. et aI., 'Life with 6000 genes,' Science, Vol.274, pp. 563-567, 1996 https://doi.org/10.1126/science.274.5287.546
  3. Eisen, M. B., Spellman, P. T., Brown, P.O., and Botstein, D., 'Cluster analysis and display of genomewide expression patterns,' Proc. Nat!. Acad. Sci., Vol.95, pp. 14863-14868, 1998 https://doi.org/10.1073/pnas.95.25.14863
  4. Pavlidis, P. and Weston, J., Gene functional classification from heterogeneous data,' In Proc. 5th Int. Conf. Comput. Mol. Biol, (RECOMB2001), pp. 249-55, 2001 https://doi.org/10.1145/369133.369228
  5. Wu, L. F. and Hughes, T. R. et aI., 'Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters,' Nature Genetics, Vol.31, pp. 255-265, 2002 https://doi.org/10.1038/ng906
  6. Park, J., Lappe, M., and Teichmann, S. A., 'Mapping protein family interactions: intra-molecular and intermolecular protein family interaction repertoires in the PDB and yeast,' J. Mol. BioI. VoI.307, pp. 929-39, 2001 https://doi.org/10.1006/jmbi.2001.4526
  7. Iossifov, 1. and Krauthammer, M. et aI., 'Probabilistic inference of molecular networks from noisy data sources,' Bioinformatics, Vol.20, No.8, pp. 1205-12013, 2004 https://doi.org/10.1093/bioinformatics/bth061
  8. Ng, S. K., Zhang, Z., and Tan, S. H., 'Integrative approach for computationally inferring protein domain interactions,' Bioinformatics, Vol.19, No.8, pp. 923-29, 2003 https://doi.org/10.1093/bioinformatics/btg118
  9. Fields, S. and Stemglanz, R, 'The two-hybrid system: an assay for protein-protein interactions,' Trends in Genetics, Vol.10, pp. 286-92, 1994 https://doi.org/10.1016/0168-9525(90)90012-U
  10. Ito, T and Chiba, T et aI., 'A comprehensive two-hybrid analysis to explore the yeast protein interactome,' Proc. Natl Acad. Sci., Vol.98, pp. 4569-4574, 2001 https://doi.org/10.1073/pnas.061034498
  11. Ito, T, Matsui, Y., Ago, T, Ota, K, and Sumimoto, H., 'Novel modular domain PB1 recognizes PC motif to mediate functional protein-protein interactions,' EMBO J., Vol.20, pp. 3938-3946, 2001 https://doi.org/10.1093/emboj/20.15.3938
  12. Uetz, P. and Giot, L. et aI., 'A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae,' Nature, Vol.403, No.6770, pp. 623-627, 2000 https://doi.org/10.1038/35001009
  13. Bu, D. and Zhao, Y. et aI., 'Topological structure analysis of the protein-protein interaction network in budding yeast,' Nucl. Acids. Res., Vol.31, No.9, pp. 2443-2450, 2003 https://doi.org/10.1093/nar/gkg340
  14. Tong A. H. and Lesage G. et al., 'Global mapping of the yeast genetic interaction network,' Science, Vol.303, No.5659, pp. 808-813, 2004 https://doi.org/10.1126/science.1091317
  15. Hartwell L., 'Robust Interactions,' Science, Vol.303, No.5659, pp. 774-775, 2004 https://doi.org/10.1126/science.1094731
  16. Agrawal, R, Imielinski, T, and Swami, A., 'Mining association rules between sets of items in large data-bases,' In Proc. ACM SIGMOD-93, pp. 207-216, 1993 https://doi.org/10.1145/170035.170072
  17. Satou, K and Shibayama, G. et al., 'Finding association rules on heterogeneous genome data,' In Proc. Pac. Symp, Biocornput., pp. 397-408, 1997
  18. Creighton, C. and Hanash, S., 'Mining gene expression databases for association rules,' Bioinformatics, Vol.19, No.1, pp. 79-86, 2003 https://doi.org/10.1093/bioinformatics/19.1.79
  19. Fellenberg, M., Albermann, K, Zollner, A., Mewes, H. W., and Hani, J. 'Integrative analysis of protein interaction data,' In Proc. Int. Conf. Intell. Syst, Mol. BioI., Vol.8, pp. 152-161, 2000
  20. Oyama, T, Kitano, K., Satou, K, and Ito, T, 'Extraction of knowledge on protein-protein interaction by association rule discovery,' Bioinformatics, Vol.18, No.5, pp. 705-714, 2002 https://doi.org/10.1093/bioinformatics/18.5.705
  21. Yu, L. and Liu, H., 'Feature selection for high dimensional data: a fast correlation-based filter solution,' In Proceedings of the 20th International Conference on Machine Leaning (ICML-03), pp. 856-863, 2003
  22. Kurgan, L. A. and Cios, K. J., 'CAIM Discretization Algorithm,' IEEE Trans. Knowledge and Data Eng., Vol.16, No.2, pp. 145-153, 2004 https://doi.org/10.1109/TKDE.2004.1269594
  23. Quinlan, J. R., C4.5: Programs for machine learning, Morgan Kaufmann Publishers, San Francisco, 1993
  24. Press, W. H. and Flannery, B. P. et aI., 'Numerical recipes in C: The Art of Scientific Computing,' 2nd Ed., pp. 633-634, Cambridge University Press, Cambridge, 1992
  25. Csank C. and Costanzo M. C. et aI., 'Three yeast proteome databases: YPD, PombePD, and CalPD (MycoPathPD),' Methods Enzymol., VoI.350, pp. 347-373, 2002 https://doi.org/10.1016/S0076-6879(02)50973-3
  26. Qi, Y., Klein-Seetharaman, J., and Bar-Joseph, Z., 'Random forest similarity for protein-protein interaction prediction from multiple sources,' In Proc. Pac. Symp, Biocomput., pp. 531-542, 2005 https://doi.org/10.1142/9789812702456_0050
  27. Aytuna, A. S., Gursoy, A., and Keskin, O., 'Prediction of protein - protein interactions by combining structure and sequence conservation in protein interfaces,' Bioinformatics, Vol.21, No.12, pp. 2850-2855, 2005 https://doi.org/10.1093/bioinformatics/bti443
  28. Dohkan, S., Koike, A., and Takagi, T., 'Prediction of protein-protein interactions using support vector machines,' In Proc. 4th IEEE Symp. Bioinfo. Bioeng. (BIBE'04), pp. 576-586, 2004
  29. Chen, S.-C. and Bahar, I., 'Mining frequent patterns in protein structures: a study of protease families,' Bioinformatics, Vol.20, Suppl.1, pp. i77-i85, 2004 https://doi.org/10.1093/bioinformatics/bth912