Browse > Article

PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis  

Eom, Jae-Hong (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University)
Zhang, Byoung-Tak (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University)
Abstract
In this paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as protein­protein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.
Keywords
Biomedical Text Mining; Data Mining; Machine Learning; Software Application;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Andrade, M.,A., and Borka, P. (2000). Automated extraction of information in molecular biology. FEBS Letters 476, 12-17   DOI   PUBMED   ScienceOn
2 Christie, K.R., Weng, S., Balakrishnan, R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Feierbach, B., Fisk, D.G., Hirschman, J.E., Hong, E.L., Issel-Tarver, L., Nash, R., Sethuraman, A., Starr, B., Theesfeld, C.L., Andrada, R., Binkley, G., Dong, Q., Lane, C., Schroeder, M., Botstein, D., and Cherry, J.M. (2004). Saccharomyces Genome Database (SGD) Provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32(1), D311-D314䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮灤昀䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮瑩昀䝊䭈䉒弲〰㉟瘱㉮㔀瑫浳Ȁ᐀堰?⨀Ȁጀቇ䩋䡂剟㈰〲彶ㄲ渵樀敮最Ȁጀ퀰?⨀Ȁሀᅇ䩋䡂剟㈰〲彶ㄲ渵돀?⨀塨?⨀ࡌ?⨀섚돐잖⨀잖⨀が?⨀餚덐䁌?⨀頚砚   DOI   ScienceOn
3 Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., and Barnett, G.O. (1998). The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5(1), 1-11   DOI   PUBMED   ScienceOn
4 Stapley, B.J. and Benoit, G. (2000). Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. In Proc. of Pac. Symp. Biocomput., 529-40
5 Hwang, Y.S., Chung, H.J, and Rim, H.C. (2003). Weighted Probabilistic Sum Model based on Decision Tree Decomposition for Text Chunking, Journal of Computer Processing of Oriental Languages 16(1), 1 -20   DOI
6 Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1998). Numerical recipes in C (Cambridge: Cambridge University Press)
7 Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. (1999). Automatic extraction of biological information from scientific text: protein-protein interactions. In Proc. Int. Conf. Intell. Syst. Mol. Biol., 60-67
8 Slonim, N. and Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 208-215
9 Perez-lratxeta, C., Bork, P., and Andrade, M.A. (2000). XpIorMed: a tool for exploring MEDLINE abstracts. Trends Biochem Sci. 26, 573-575   DOI   ScienceOn
10 Agrawal, R., Imieli$\ski, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data 207-216
11 Chiang, J.,H., Yu, H.,C., and Hsu, H., J. (2004). GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 20(1), 120-121   DOI   ScienceOn
12 Oyama, T., Kitano, K., Satou, K., and Ito, T. (2002). Extraction of knowledge on proteinprotein interaction by association rule discovery. Bioinformatics 18, 705-714   DOI   ScienceOn
13 Friedman, C., Kra, P., Yu, H., Krauthammer, M.,and Rzhetsky, A. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(SuppI.1), S74-S82   DOI   PUBMED
14 Kim, J.D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003). GENIA corpus - semantically annotated corpus for bio-textmining. Bioinformatics 19(SuppI. 1), i180-182   DOI   ScienceOn
15 Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., and Mazo, I. (2004). Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5), 604-611   DOI   ScienceOn
16 Tanabe, L., Scherf, U., Smith, L.H., Lee. J.K., Hunter. L., and Weinstein, J.N. (1999). MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27, 1210-1217   PUBMED
17 Yu, L. and Liu, H. (2003). Feature selection for high dimensional data: a fast correlation-based filter solution. In Proceeding of the 20th International Conference on Machine Learning, 856-863
18 Andrade, M.A. and Valencia, A. (1998). Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600-607   DOI   ScienceOn
19 Blaschke, C. and Valencia, A. (2002). The frame-based module of the SUISEKI information extraction system. IEEE Intelligent Systems 17(2), 14-20
20 Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., and Ruepp, A. (2004). MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32(1), D41-D44   DOI   ScienceOn
21 Quinlan, J.R. (1993). C4.5: Programs for machine learning (San Francisco: Morgan Kaufmann Publishers Inc.).
22 Safran, M., ChaIifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen, M., Atarot, T., Groner, Y., and Lancet, D. (2003). Human gene-centric databases at the Weizmann institute of science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res. 31(1), 142-146   DOI   ScienceOn
23 Lee, K.J., Hwang, Y.S., and Rim, H.C. (2003). Two-Phase Biomedical NE Recognition based on SVMs. In Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, 33-40