PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis

Eom, Jae-Hong;Zhang, Byoung-Tak;

Genomics & Informatics

Volume 2 Issue 2
/
Pages.99-106
/
2004
/
1598-866X(pISSN)
/
2234-0742(eISSN)

Korea Genome Organization (한국유전체학회)

PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis

Eom, Jae-Hong (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University) ;
Zhang, Byoung-Tak (Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University)

Published : 2004.06.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as proteinprotein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.

Keywords

References

Agrawal, R., Imieli$\'n$ski, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data 207-216
Andrade, M.A. and Valencia, A. (1998). Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600-607 https://doi.org/10.1093/bioinformatics/14.7.600
Andrade, M.,A., and Borka, P. (2000). Automated extraction of information in molecular biology. FEBS Letters 476, 12-17 https://doi.org/10.1016/S0014-5793(00)01661-6
Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. (1999). Automatic extraction of biological information from scientific text: protein-protein interactions. In Proc. Int. Conf. Intell. Syst. Mol. Biol., 60-67
Blaschke, C. and Valencia, A. (2002). The frame-based module of the SUISEKI information extraction system. IEEE Intelligent Systems 17(2), 14-20
Chiang, J.,H., Yu, H.,C., and Hsu, H., J. (2004). GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 20(1), 120-121 https://doi.org/10.1093/bioinformatics/btg369
Christie, K.R., Weng, S., Balakrishnan, R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Feierbach, B., Fisk, D.G., Hirschman, J.E., Hong, E.L., Issel-Tarver, L., Nash, R., Sethuraman, A., Starr, B., Theesfeld, C.L., Andrada, R., Binkley, G., Dong, Q., Lane, C., Schroeder, M., Botstein, D., and Cherry, J.M. (2004). Saccharomyces Genome Database (SGD) Provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32(1), D311-D314䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮灤昀䝊䭈䉒弲〰㉟瘱㉮㕟ㄶ㌮瑩昀䝊䭈䉒弲〰㉟瘱㉮㔀瑫浳Ȁ᐀堰?⨀Ȁጀቇ䩋䡂剟㈰〲彶ㄲ渵樀敮最Ȁጀ퀰?⨀Ȁሀᅇ䩋䡂剟㈰〲彶ㄲ渵돀?⨀塨?⨀ࡌ?⨀섚돐잖⨀잖⨀が?⨀餚덐䁌?⨀頚砚 https://doi.org/10.1093/nar/gkh033
Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., and Mazo, I. (2004). Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5), 604-611 https://doi.org/10.1093/bioinformatics/btg452
Friedman, C., Kra, P., Yu, H., Krauthammer, M.,and Rzhetsky, A. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(SuppI.1), S74-S82 https://doi.org/10.1093/bioinformatics/17.suppl_1.S74
Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., and Barnett, G.O. (1998). The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5(1), 1-11 https://doi.org/10.1136/jamia.1998.0050001
Hwang, Y.S., Chung, H.J, and Rim, H.C. (2003). Weighted Probabilistic Sum Model based on Decision Tree Decomposition for Text Chunking, Journal of Computer Processing of Oriental Languages 16(1), 1 -20 https://doi.org/10.1142/S0219427903000796
Kim, J.D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003). GENIA corpus - semantically annotated corpus for bio-textmining. Bioinformatics 19(SuppI. 1), i180-182 https://doi.org/10.1093/bioinformatics/btg1023
Lee, K.J., Hwang, Y.S., and Rim, H.C. (2003). Two-Phase Biomedical NE Recognition based on SVMs. In Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, 33-40
Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., and Ruepp, A. (2004). MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32(1), D41-D44 https://doi.org/10.1093/nar/gkh092
Oyama, T., Kitano, K., Satou, K., and Ito, T. (2002). Extraction of knowledge on proteinprotein interaction by association rule discovery. Bioinformatics 18, 705-714 https://doi.org/10.1093/bioinformatics/18.5.705
Perez-lratxeta, C., Bork, P., and Andrade, M.A. (2000). XpIorMed: a tool for exploring MEDLINE abstracts. Trends Biochem Sci. 26, 573-575 https://doi.org/10.1016/S0968-0004(01)01926-0
Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1998). Numerical recipes in C (Cambridge: Cambridge University Press)
Quinlan, J.R. (1993). C4.5: Programs for machine learning (San Francisco: Morgan Kaufmann Publishers Inc.).
Safran, M., ChaIifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen, M., Atarot, T., Groner, Y., and Lancet, D. (2003). Human gene-centric databases at the Weizmann institute of science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res. 31(1), 142-146 https://doi.org/10.1093/nar/gkg050
Slonim, N. and Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 208-215
Stapley, B.J. and Benoit, G. (2000). Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. In Proc. of Pac. Symp. Biocomput., 529-40
Tanabe, L., Scherf, U., Smith, L.H., Lee. J.K., Hunter. L., and Weinstein, J.N. (1999). MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27, 1210-1217
Yu, L. and Liu, H. (2003). Feature selection for high dimensional data: a fast correlation-based filter solution. In Proceeding of the 20th International Conference on Machine Learning, 856-863

Genomics & Informatics

PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)