[KSCI] Korea Science Citation Index Service

PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis

Eom, Jae-Hong (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University)
Zhang, Byoung-Tak (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University)

Publication Information

Abstract

In this paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as proteinprotein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.

Keywords

Biomedical Text Mining; Data Mining; Machine Learning; Software Application;

Citations & Related Records

Reference

1	Andrade, M.,A., and Borka, P. (2000). Automated extraction of information in molecular biology. FEBS Letters 476, 12-17 DOI PUBMED ScienceOn
2	Christie, K.R., Weng, S., Balakrishnan, R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Feierbach, B., Fisk, D.G., Hirschman, J.E., Hong, E.L., Issel-Tarver, L., Nash, R., Sethuraman, A., Starr, B., Theesfeld, C.L., Andrada, R., Binkley, G., Dong, Q., Lane, C., Schroeder, M., Botstein, D., and Cherry, J.M. (2004). Saccharomyces Genome Database (SGD) Provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32(1), D311-D314䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮灤昀䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮瑩昀䝊䭈䉒弲〰㉟瘱㉮㔀瑫浳Ȁ᐀堰?⨀Ȁጀቇ䩋䡂剟㈰〲彶ㄲ渵樀敮最Ȁጀ퀰?⨀Ȁሀᅇ䩋䡂剟㈰〲彶ㄲ渵돀?⨀塨?⨀ࡌ?⨀섚돐잖⨀잖⨀が?⨀餚덐䁌?⨀頚砚 DOI ScienceOn
3	Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., and Barnett, G.O. (1998). The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5(1), 1-11 DOI PUBMED ScienceOn
4	Stapley, B.J. and Benoit, G. (2000). Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. In Proc. of Pac. Symp. Biocomput., 529-40
5	Hwang, Y.S., Chung, H.J, and Rim, H.C. (2003). Weighted Probabilistic Sum Model based on Decision Tree Decomposition for Text Chunking, Journal of Computer Processing of Oriental Languages 16(1), 1 -20 DOI
6	Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1998). Numerical recipes in C (Cambridge: Cambridge University Press)
7	Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. (1999). Automatic extraction of biological information from scientific text: protein-protein interactions. In Proc. Int. Conf. Intell. Syst. Mol. Biol., 60-67
8	Slonim, N. and Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 208-215
9	Perez-lratxeta, C., Bork, P., and Andrade, M.A. (2000). XpIorMed: a tool for exploring MEDLINE abstracts. Trends Biochem Sci. 26, 573-575 DOI ScienceOn
10	Agrawal, R., Imieli $$\$ ski, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data 207-216
11	Chiang, J.,H., Yu, H.,C., and Hsu, H., J. (2004). GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 20(1), 120-121 DOI ScienceOn
12	Oyama, T., Kitano, K., Satou, K., and Ito, T. (2002). Extraction of knowledge on proteinprotein interaction by association rule discovery. Bioinformatics 18, 705-714 DOI ScienceOn
13	Friedman, C., Kra, P., Yu, H., Krauthammer, M.,and Rzhetsky, A. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(SuppI.1), S74-S82 DOI PUBMED
14	Kim, J.D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003). GENIA corpus - semantically annotated corpus for bio-textmining. Bioinformatics 19(SuppI. 1), i180-182 DOI ScienceOn
15	Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., and Mazo, I. (2004). Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5), 604-611 DOI ScienceOn
16	Tanabe, L., Scherf, U., Smith, L.H., Lee. J.K., Hunter. L., and Weinstein, J.N. (1999). MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27, 1210-1217 PUBMED
17	Yu, L. and Liu, H. (2003). Feature selection for high dimensional data: a fast correlation-based filter solution. In Proceeding of the 20th International Conference on Machine Learning, 856-863
18	Andrade, M.A. and Valencia, A. (1998). Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600-607 DOI ScienceOn
19	Blaschke, C. and Valencia, A. (2002). The frame-based module of the SUISEKI information extraction system. IEEE Intelligent Systems 17(2), 14-20
20	Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., and Ruepp, A. (2004). MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32(1), D41-D44 DOI ScienceOn
21	Quinlan, J.R. (1993). C4.5: Programs for machine learning (San Francisco: Morgan Kaufmann Publishers Inc.).
22	Safran, M., ChaIifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen, M., Atarot, T., Groner, Y., and Lancet, D. (2003). Human gene-centric databases at the Weizmann institute of science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res. 31(1), 142-146 DOI ScienceOn
23	Lee, K.J., Hwang, Y.S., and Rim, H.C. (2003). Two-Phase Biomedical NE Recognition based on SVMs. In Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, 33-40