[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2019.03.030

Towards Effective Entity Extraction of Scientific Documents using Discriminative Linguistic Features

Hwang, Sangwon (Department of Computer & Telecommunications Engineering, Yonsei University)
Hong, Jang-Eui (Department of Computer Science, Chungbuk National University)
Nam, Young-Kwang (Department of Computer & Telecommunications Engineering, Yonsei University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.13, no.3, 2019 , pp. 1639-1658 More about this Journal

Abstract

Named entity recognition (NER) is an important technique for improving the performance of data mining and big data analytics. In previous studies, NER systems have been employed to identify named-entities using statistical methods based on prior information or linguistic features; however, such methods are limited in that they are unable to recognize unregistered or unlearned objects. In this paper, a method is proposed to extract objects, such as technologies, theories, or person names, by analyzing the collocation relationship between certain words that simultaneously appear around specific words in the abstracts of academic journals. The method is executed as follows. First, the data is preprocessed using data cleaning and sentence detection to separate the text into single sentences. Then, part-of-speech (POS) tagging is applied to the individual sentences. After this, the appearance and collocation information of the other POS tags is analyzed, excluding the entity candidates, such as nouns. Finally, an entity recognition model is created based on analyzing and classifying the information in the sentences.

Keywords

Named entity recognition; entity extraction; data mining; data cleaning; sentence segmentation; information extraction;

Citations & Related Records

Reference

1	O. Bender, F. J. Och and H. Ney," Maximum entropy models for named entity recognition," in Proc. of the seventh conference on Natural language learning at HLT-NAACL, vol. 4, pp. 148-151, 2003.
2	G. Petasis, A. Cucchiarelli, P. Velardi, G. Paliouras, V. Karkaletsis and C. D. Spyropoulos, "Automatic adaptation of Proper Noun Dictionaries through cooperation of machine learning and probabilistic methods," in Proc. of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 128-135, 2000.
3	N. Chinchor and P. Robinson, "MUC-7 named entity task definition," in Proc. of the 7th Conference on Message Understanding, 1997.
4	R. Zhang, M. J. Cairelli, M. Fiszman, G. Rosemblat, H. Kilicoglu, T. C. Rindflesch, S. V. Pakhomov and G. B. Melton, "Using semantic predications to uncover drug-drug interactions in clinical data," Journal of biomedical informatics, vol. 49, pp. 134-147, 2014. DOI
5	A. Benton, L. Ungar, S. Hill, S. Hennessy, J. Mao, A. Chung, C. E. Leonard and J. H. Holmes, "Identifying potential adverse effects using the web: A new approach to medical hypothesis generation," Journal of biomedical informatics, vol. 44, issue. 6, pp. 989-996, 2011. DOI
6	F. Zhu, P. Patumcharoenpol, C. Zhang, Y. Yang, J. Chan, A. Meechai, W. Vongsangnak and B. Shen, "Biomedical text mining and its applications in cancer research," Journal of biomedical informatics, vol. 46, issue. 2, pp. 200-211, 2013. DOI
7	L. De Bruijn, A. Hasman and J. Arends, "Automatic SNOMED classification-a corpus-based method," Computer methods and programs in biomedicine, vol. 54, issue. 1-2, pp. 115-122, 1997. DOI
8	U. Leser and J. Hakenberg, "What makes a gene name? Named entity recognition in the biomedical literature," Briefings in bioinformatics, vol. 6, issue. 4, pp. 357-369, 2005. DOI
9	V. Uren, P. Cimiano, J. Iria, S. Handschuh, M. Vargas-Vera, E. Motta and F. Ciravegna, "Semantic annotation for knowledge management: Requirements and a survey of the state of the art," Web Semantics: science, services and agents on the World Wide Web, vol. 4, issue. 1, pp. 14-28, 2006. DOI
10	M. Marrero, J. Urbano, S. Sanchez-Cuadrado, J. Morato and J. M. Gomez-Berbís, "Named entity recognition: fallacies, challenges and opportunities," Computer Standards and Interfaces, vol. 35, issue. 5, pp. 482-489, 2013. DOI
11	L. Reeve and H. Han, "Survey of semantic annotation platforms," in Proc. of the 2005 ACM symposium on Applied computing, pp. 1634-1638, 2005.
12	R. Srihari, W. Li, "Information extraction supported question answering," in Proc. of the 8th Text REtrieval Conference (TREC-8), no. 500, pp. 185-196, 2000.
13	A. Maedche and S. Staab, "Ontology learning for the semantic web," IEEE Intelligent systems, vol. 16, issue. 2, pp. 72-79, 2001. DOI
14	X. Wang, C. Yang and R. Guan, "A comparative study for biomedical named entity recognition," International Journal of Machine Learning and Cybernetics, vol. 9, issue. 3, pp. 373-382, 2018. DOI
15	T. M. Mitchell, Machine learning. WCB, McGraw-Hill Boston, 1997.
16	K. Shaalan, "Rule-based approach in Arabic natural language processing," The International Journal on Information and Communication Technologies, vol. 3, issue. 3, pp. 11-19, 2010.
17	J. P. Chiu and E. Nichols, "Named Entity Recognition with Bidirectional LSTM-CNNs," Transactions of the Association for Computational Linguistics, vol. 4, pp. 357-370, 2016. DOI
18	E. J. Huth, "Structured abstracts for papers reporting clinical trials," Annals of Internal Medicine, vol. 106, issue. 4, pp. 626-627, 1987. DOI
19	L. B. Sollaci and M. G. Pereira, "The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey," Journal of the medical library association, vol. 92, issue. 3, pp. 364, 2004.
20	A. Ratnaparkhi, "A simple introduction to maximum entropy models for natural language processing," IRCS Technical Reports Series, pp. 81, 1997.
21	C. D. Manning, C. D. Manning and H. Schutze, Foundations of statistical natural language processing, MIT press, 1999.
22	A. M. Mood, F. A. Graybill and D. C. Boes, Introduction to the Theory of Statistics, McGraw-Hill Kogakusha, 1974.
23	S. Nam, S. K. Kim, H. G. Kim, V. Ngo and N. Zong, "Structuralizing biomedical abstracts with discriminative linguistic features," Computers in biology and medicine, vol. 79, pp. 276-285, 2016. DOI
24	D. Arthur and V. Sergei, "k-means++: The advantages of careful seeding," Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027-1035, 2007.
25	S. K. Saha, S. Narayan, S. Sarkar and P. Mitra, "A composite kernel for named entity recognition," Pattern Recognition Letters, vol. 31, issue. 12, pp. 1591-1597, 2010. DOI
26	S. Sarawagi, "Information extraction," Foundations and Trends in Databases, vol. 1, issue. 3, pp. 261-377, 2008. DOI
27	D. M. Bikel, S. Miller, R. Schwartz and R. Weischedel," Nymble: a high-performance learning name-finder," in Proc. of the fifth conference on Applied natural language processing, pp. 194-201, 1997.
28	Y. Wang, Z. Yu, L. Chen, Y. Chen, Y. Liu, X. Hu and Y. Jiang, "Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: an empirical study," Journal of biomedical informatics, vol. 47, pp. 91-104, 2014. DOI
29	M. Majumder, U. Barman, R. Prasad, K. Saurabh and S. K. Saha, "A novel technique for name identification from homeopathy diagnosis discussion forum," Procedia Technology, vol. 6, pp. 379-386, 2012. DOI
30	J. R. Finkel, T. Grenager and C. Manning, "Incorporating non-local information into information extraction systems by gibbs sampling," in Proc. of the 43rd annual meeting on association for computational linguistics, pp. 363-370, 2005.
31	R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa, "Natural language processing (almost) from scratch," Journal of Machine Learning Research, vol. 12, pp. 2493-2537, 2011.
32	N. Okazaki, "Crfsuite: a fast implementation of conditional random fields (crfs)," http://www.chokkan.org/software/crfsuite/.
33	B. Babych and A. Hartley, "Improving machine translation quality with automatic named entity recognition," in Proc. of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT, pp. 1-8, 2003.
34	O. Etzioni, M. Cafarella, D. Downey, A. M. Popescu, T. Shaked, S. Soderland, D. S. Weld and A. Yates, "Unsupervised named-entity extraction from the web: An experimental study," Artificial intelligence, vol. 165, issue. 1, pp. 91-134, 2005. DOI
35	B. Pang and L. Lee, "Opinion mining and sentiment analysis," Foundations and Trends in Information Retrieval, vol. 2, issue. 1-2, pp. 1-135, 2008. DOI
36	A. M. Popescu and O. Etzioni, "Extracting product features and opinions from reviews," in Proc. of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, vol. 5, pp. 339-346, 2005.
37	Y. Chen, C. Zong and K. Y. Su, "A joint model to identify and align bilingual named entities," Computational Linguistics, vol. 39, issue. 2, pp. 229-266, 2013. DOI
38	C. Nobata, S. Sekine, H. Isahara and R. Grishman, "Summarization System Integrated with Named Entity Tagging and IE pattern Discovery," in Proc. of the Third International Conference on Language Resources and Evaluation (LREC'02), pp. 1742-1745, 2002.
39	E. Baralis, L. Cagliero, S. Jabeen, A. Fiori and S. Shah, "Multi-document summarization based on the Yago ontology," Expert Systems with Applications, vol. 40, issue. 17, pp. 6976-6984, 2013. DOI
40	M. Hassel, "Exploitation of named entities in automatic text summarization for swedish," NODALIDA'03-14th Nordic Conference on Computational Linguistics, pp. 9, 2003.
41	T. H. Cao, T. M. Tang and C. K. Chau, "Text clustering with named entities: a model, experimentation and realization," Data mining: Foundations and intelligent paradigms, Springer, pp. 267-287, 2012.
42	D. Nadeau, S. Sekine. "A survey of named entity recognition and classification," Linguisticae Investigationes, vol. 30, issue. 1, pp. 3-26, 2007. DOI
43	B. Mohit, "Named Entity Recognition," Natural Language Processing of Semitic Languages, pp. 221-245, 2014.
44	J. Cowie, W. Lehnert, "Information extraction," Communications of the ACM, Vol 39, issue. 1, pp. 80-91, 1996. DOI
45	H. LeHong, J. Fenn, Hype Cycle for Emerging Technologies, Gartner, 2013.
46	A. Ritter, S. Clark and O. Etzioni, "Named entity recognition in tweets: an experimental study," in Proc. of the conference on empirical methods in natural language processing, pp. 1524-1534, 2011.
47	Y. Choi and J. Cha, "Korean Named Entity Recognition and Classification using Word Embedding Features," Journal of KIISE, vol. 43, issue. 6, pp. 678-685, 2016. https://dx.doi.org/doi:10.5626/JOK.2016.43.6.678 DOI
48	R. O. Duda, P. E. Hart and D. G. Stork, Pattern classification, Wiley Interscience, pp. 526-528, USA, 2001.
49	Y. Lu, D. Ji, X. Yao, X. Wei and X. Liang, "CHEMDNER system with mixed conditional random fields and multi-scale word clustering," Journal of cheminformatics, vol. 7, issue. 1, 2016.
50	D. B. Nguyen, M. Theobald and G. Weikum, "J-NERD: joint named entity recognition and disambiguation with rich linguistic features," Transactions of the Association for Computational Linguistics, vol. 4, pp. 215-229, 2016. DOI
51	G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami and C. Dyer, "Neural Architectures for Named Entity Recognition," in Proc. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260-270, 2016.
52	A. E. Borthwick, A maximum entropy approach to named entity recognition, New York University, 1999.
53	S. K. Saha, S. Sarkar and P. Mitra, "Feature selection techniques for maximum entropy based biomedical named entity recognition," Journal of biomedical informatics, vol. 42, issue. 5, pp. 905-911, 2009. DOI
54	T. Ek, C. Kirkegaard, H. Jonsson and P. Nugues, "Named entity recognition for short text messages," Procedia-Social and Behavioral Sciences, vol. 27, pp. 178-187, 2011. DOI
55	M. Konkol, T. Brychcín and M. Konopík, "Latent semantics in named entity recognition," Expert Systems with Applications, vol. 42, issue. 7, pp. 3470-3479, 2015. DOI
56	J. Zhang, Q. Dang, Y. Lu and S. Sun, "Suffix tree clustering with named entity recognition," in Proc. of the 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia), pp. 549-556, 2013.
57	T. Mandl and C. Womser-Hacker, "The effect of named entities on effectiveness in cross-language information retrieval evaluation," in Proc. of the 2005 ACM symposium on Applied computing, pp. 1059-1064, 2005.
58	M. Herrero-Zazo, I. Segura-Bedmar, P. Martinez and T. Declerck, "The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions," Journal of biomedical informatics, vol. 46, issue. 5, pp. 914-920, 2013. DOI
59	Z. Munkhjargal, G. Bella, A. Chagnaa and F. Giunchiglia, "Named entity recognition for Mongolian language," in Proc. of the International Conference on Text, Speech, and Dialogue, pp. 243-251, 2015.
60	S. K. Siencnik,"Adapting word2vec to named entity recognition," in Proc. of the 20th nordic conference of computational linguistics, pp. 239-243, 2015.
61	L. Li, R. Zhou and D. Huang, "Two-phase biomedical named entity recognition using CRFs," Computational biology and chemistry, vol. 33, issue. 4, pp. 334-338, 2009. DOI
62	D. Kucuk and A. Yazici, "A hybrid named entity recognizer for Turkish," Expert Systems with Applications, vol. 39, issue. 3, pp. 2733-2742, 2012. DOI