PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis

  • Eom, Jae-Hong (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University) ;
  • Zhang, Byoung-Tak (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University)
  • Published : 2004.06.01

Abstract

In this paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as protein­protein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.

Keywords

References

  1. Agrawal, R., Imieli$\'n$ski, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data 207-216
  2. Andrade, M.A. and Valencia, A. (1998). Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600-607 https://doi.org/10.1093/bioinformatics/14.7.600
  3. Andrade, M.,A., and Borka, P. (2000). Automated extraction of information in molecular biology. FEBS Letters 476, 12-17 https://doi.org/10.1016/S0014-5793(00)01661-6
  4. Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. (1999). Automatic extraction of biological information from scientific text: protein-protein interactions. In Proc. Int. Conf. Intell. Syst. Mol. Biol., 60-67
  5. Blaschke, C. and Valencia, A. (2002). The frame-based module of the SUISEKI information extraction system. IEEE Intelligent Systems 17(2), 14-20
  6. Chiang, J.,H., Yu, H.,C., and Hsu, H., J. (2004). GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 20(1), 120-121 https://doi.org/10.1093/bioinformatics/btg369
  7. Christie, K.R., Weng, S., Balakrishnan, R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Feierbach, B., Fisk, D.G., Hirschman, J.E., Hong, E.L., Issel-Tarver, L., Nash, R., Sethuraman, A., Starr, B., Theesfeld, C.L., Andrada, R., Binkley, G., Dong, Q., Lane, C., Schroeder, M., Botstein, D., and Cherry, J.M. (2004). Saccharomyces Genome Database (SGD) Provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32(1), D311-D314䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮灤昀䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮瑩昀䝊䭈䉒弲〰㉟瘱㉮㔀瑫浳Ȁ᐀堰?⨀Ȁጀቇ䩋䡂剟㈰〲彶ㄲ渵樀敮最Ȁጀ퀰?⨀Ȁሀᅇ䩋䡂剟㈰〲彶ㄲ渵돀?⨀塨?⨀ࡌ?⨀섚돐잖⨀잖⨀が?⨀餚덐䁌?⨀頚砚 https://doi.org/10.1093/nar/gkh033
  8. Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., and Mazo, I. (2004). Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5), 604-611 https://doi.org/10.1093/bioinformatics/btg452
  9. Friedman, C., Kra, P., Yu, H., Krauthammer, M.,and Rzhetsky, A. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(SuppI.1), S74-S82 https://doi.org/10.1093/bioinformatics/17.suppl_1.S74
  10. Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., and Barnett, G.O. (1998). The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5(1), 1-11 https://doi.org/10.1136/jamia.1998.0050001
  11. Hwang, Y.S., Chung, H.J, and Rim, H.C. (2003). Weighted Probabilistic Sum Model based on Decision Tree Decomposition for Text Chunking, Journal of Computer Processing of Oriental Languages 16(1), 1 -20 https://doi.org/10.1142/S0219427903000796
  12. Kim, J.D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003). GENIA corpus - semantically annotated corpus for bio-textmining. Bioinformatics 19(SuppI. 1), i180-182 https://doi.org/10.1093/bioinformatics/btg1023
  13. Lee, K.J., Hwang, Y.S., and Rim, H.C. (2003). Two-Phase Biomedical NE Recognition based on SVMs. In Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, 33-40
  14. Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., and Ruepp, A. (2004). MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32(1), D41-D44 https://doi.org/10.1093/nar/gkh092
  15. Oyama, T., Kitano, K., Satou, K., and Ito, T. (2002). Extraction of knowledge on proteinprotein interaction by association rule discovery. Bioinformatics 18, 705-714 https://doi.org/10.1093/bioinformatics/18.5.705
  16. Perez-lratxeta, C., Bork, P., and Andrade, M.A. (2000). XpIorMed: a tool for exploring MEDLINE abstracts. Trends Biochem Sci. 26, 573-575 https://doi.org/10.1016/S0968-0004(01)01926-0
  17. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1998). Numerical recipes in C (Cambridge: Cambridge University Press)
  18. Quinlan, J.R. (1993). C4.5: Programs for machine learning (San Francisco: Morgan Kaufmann Publishers Inc.).
  19. Safran, M., ChaIifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen, M., Atarot, T., Groner, Y., and Lancet, D. (2003). Human gene-centric databases at the Weizmann institute of science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res. 31(1), 142-146 https://doi.org/10.1093/nar/gkg050
  20. Slonim, N. and Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 208-215
  21. Stapley, B.J. and Benoit, G. (2000). Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. In Proc. of Pac. Symp. Biocomput., 529-40
  22. Tanabe, L., Scherf, U., Smith, L.H., Lee. J.K., Hunter. L., and Weinstein, J.N. (1999). MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27, 1210-1217
  23. Yu, L. and Liu, H. (2003). Feature selection for high dimensional data: a fast correlation-based filter solution. In Proceeding of the 20th International Conference on Machine Learning, 856-863