Browse > Article
http://dx.doi.org/10.13088/jiis.2019.25.1.043

Knowledge Extraction Methodology and Framework from Wikipedia Articles for Construction of Knowledge-Base  

Kim, JaeHun (Research Laboratory, LiST)
Lee, Myungjin (Research Laboratory, LiST)
Publication Information
Journal of Intelligence and Information Systems / v.25, no.1, 2019 , pp. 43-61 More about this Journal
Abstract
Development of technologies in artificial intelligence has been rapidly increasing with the Fourth Industrial Revolution, and researches related to AI have been actively conducted in a variety of fields such as autonomous vehicles, natural language processing, and robotics. These researches have been focused on solving cognitive problems such as learning and problem solving related to human intelligence from the 1950s. The field of artificial intelligence has achieved more technological advance than ever, due to recent interest in technology and research on various algorithms. The knowledge-based system is a sub-domain of artificial intelligence, and it aims to enable artificial intelligence agents to make decisions by using machine-readable and processible knowledge constructed from complex and informal human knowledge and rules in various fields. A knowledge base is used to optimize information collection, organization, and retrieval, and recently it is used with statistical artificial intelligence such as machine learning. Recently, the purpose of the knowledge base is to express, publish, and share knowledge on the web by describing and connecting web resources such as pages and data. These knowledge bases are used for intelligent processing in various fields of artificial intelligence such as question answering system of the smart speaker. However, building a useful knowledge base is a time-consuming task and still requires a lot of effort of the experts. In recent years, many kinds of research and technologies of knowledge based artificial intelligence use DBpedia that is one of the biggest knowledge base aiming to extract structured content from the various information of Wikipedia. DBpedia contains various information extracted from Wikipedia such as a title, categories, and links, but the most useful knowledge is from infobox of Wikipedia that presents a summary of some unifying aspect created by users. These knowledge are created by the mapping rule between infobox structures and DBpedia ontology schema defined in DBpedia Extraction Framework. In this way, DBpedia can expect high reliability in terms of accuracy of knowledge by using the method of generating knowledge from semi-structured infobox data created by users. However, since only about 50% of all wiki pages contain infobox in Korean Wikipedia, DBpedia has limitations in term of knowledge scalability. This paper proposes a method to extract knowledge from text documents according to the ontology schema using machine learning. In order to demonstrate the appropriateness of this method, we explain a knowledge extraction model according to the DBpedia ontology schema by learning Wikipedia infoboxes. Our knowledge extraction model consists of three steps, document classification as ontology classes, proper sentence classification to extract triples, and value selection and transformation into RDF triple structure. The structure of Wikipedia infobox are defined as infobox templates that provide standardized information across related articles, and DBpedia ontology schema can be mapped these infobox templates. Based on these mapping relations, we classify the input document according to infobox categories which means ontology classes. After determining the classification of the input document, we classify the appropriate sentence according to attributes belonging to the classification. Finally, we extract knowledge from sentences that are classified as appropriate, and we convert knowledge into a form of triples. In order to train models, we generated training data set from Wikipedia dump using a method to add BIO tags to sentences, so we trained about 200 classes and about 2,500 relations for extracting knowledge. Furthermore, we evaluated comparative experiments of CRF and Bi-LSTM-CRF for the knowledge extraction process. Through this proposed process, it is possible to utilize structured knowledge by extracting knowledge according to the ontology schema from text documents. In addition, this methodology can significantly reduce the effort of the experts to construct instances according to the ontology schema.
Keywords
Deep learning; Artificial Intelligence; Ontology; Knowledge base; Knowledge extraction;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Wu, F. and D.S. Weld, "Autonomously semantifying Wikipedia," Proceedings of the sixteenth ACM conference on Conference on Information and knowledge management, (2007), 41-50.
2 Wu, J., X. Hu, R. Zhao, F. Ren, and M. Hu, "Clinical Named Entity Recognition via Bi-directional LSTM-CRF Model," Proceedings of the Evaluation Task at the China Conference on Knowledge Graph and Semantic Computing, (2017), 31-36.
3 Suchanek, F. M., G. Kasneci, and G. Weikum, "Yago:a core of semantic knowledge," Proceedings of the 16th international conference on World Wide Web, (2007), 697-706.
4 Berger, A. L., V. J. D. Pietra, and S. A. D. Pietra, "A maximum entropy approach to natural language processing," Computational linguistics, Vol.22, No.1(1996), 39-71.
5 Bergman, M., Knowledge-based Artificial Intelligence, AI3, 2014. Available at http://www.mkbergman.com/1816/knowledge-based-artificial-intelligence/ (Accessed 13 November, 2018).
6 Bhuiyan, H., K. J. Oh, M. D. Hong, and G. S. Jo, "An effective approach to generate Wikipedia infobox of movie domain using semi-structured data," Journal of Internet Computing and Services, Vol.18, No.3(2017), 49-61.   DOI
7 Bizer, C., T. Heath, K. Idehen, and T. Berners-Lee, "Linked Data on the Web (LDOW2008)," Workshop at the 17th International World Wide Web Conference, (2008).
8 Chiu, J. and E. Nichols, "Named Entity Recognition with Bidirectional LSTM-CNNs," Transactions of the Association for Computational Linguistics, Vol. 4, No. 1(2016), 357-370.   DOI
9 Bizer, C., J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann, "DBpedia - A Crystallization Point for the Web of Data," Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 7, No. 3(2009), 154-165.   DOI
10 Brandao, W. C., E. S. Moura, A. S. Silva, and N. Ziviani, "A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles," Proceedings of the 17th international conference on String processing and information retrieval, (2010), 279-289.
11 Choi, H., M. Kim, W. Kim, D. Shin, and Y. H. Lee, "Development of Information Extraction System from Multi Source Unstructured Documents for Knowledge Base Expansion," Journal of Intelligence and Information Systems, Vol. 24, No. 4(2018), 111-136.   DOI
12 Hearst, M. A., S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, "Support vector machines," IEEE Intelligent Systems and their Applications, Vol.13, No.4(1998), 18-28.   DOI
13 Dai, A. M., C. Olah, and Q. V. Le, "Document Embedding with Paragraph Vectors," NIPS Deep Learning Workshop, (2014).
14 Engelmore, R. S., "Artificial Intelligence and Knowledge Based Systems: Origins, Methods and Opportunities for NDE," Review of Progress in Quantitative Nondestructive Evaluation, Springer Science, New York, 1987.
15 Forsythe, D. E., "Engineering Knowledge: The Construction of Knowledge in Artificial Intelligence," Social Studies of Science, Vol.23, No.3(1993), 445-477.   DOI
16 Higashinaka, R., K. Dohsaka, and H. Isozaki, "Learning to rank definitions to generate quizzes for interactive information presentation," Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, (2007), 117-120.
17 Huang, Z., W. Xu, and K. Yu, "Bidirectional LSTM-CRF models for sequence tagging," arXiv.org preprint, 2015. Available at https://arxiv.org/pdf/1508.01991.pdf (Downloaded 15 November, 2018).
18 Kaisser, M., "The qualim question answering demo: Supplementing answers with paragraphs drawn from wikipedia," Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, (2008), 32-35.
19 Jeong, S., M. Choi, and H. Kim, "Construction of Korean Knowledge Base Based on Machine Learning from Wikipedia," Journal of KIISE, Vol. 42, No. 8(2015), 1065-1070.   DOI
20 Jin, S., H. Jang, and W. Kim, "Improving Bidirectional LSTM-CRF model Of Sequence Tagging by using Ontology knowledge based feature," Journal of intelligence and information systems, Vol.24, No.1(2018), 253-266.   DOI
21 Kingma, D. and J. Ba, "Adam: A method for stochastic optimization," Proceedings of the 3rd International Conference for Learning Representations, (2015).
22 Russell, S. J., and P. Norvig, Artificial Intelligence : A Modern Approach, Prentice Hall, 2009.
23 Lafferty, J., A. McCallum, and F. C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proceedings of the Eighteenth International Conference on Machine Learning, (2001), 282-289.
24 Lange, D., C. Bohm, and F. Naumann, "Extracting structured information from Wikipedia articles to populate infoboxes," Proceedings of the 19th ACM international conference on Information and knowledge management, (2010), 1661-1664.
25 Lehmann, J. R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, "DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia," Semantic Web, Vol.6, No.2(2015), 167-195.   DOI
26 Ljubesic, N., "Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages," Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, (2018), 156-163.
27 Ramshaw, L. A. and M. P. Marcus, "Text Chunking using Transformation-Based Learning," ACL Third Workshop on Very Large Corpora, (1995), 82-94.
28 Sun, R., Artificial intelligence: Connectionist and symbolic approaches, In: N. J. Smelser and P. B. Baltes (eds.), International Encyclopedia of the Social and Behavioral Sciences, Pergamon/Elsevier, Oxford, 2001.
29 Viterbi, A. J., "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Transactions on Information Theory, Vol.13, No.2(1967), 260-269.   DOI
30 Krishna, S, Introduction to Database and Knowledge-base Systems, World Scientific Publishing, Singapore, 1992.