DOI QR코드

DOI QR Code

Semi-Automatic Ontology Construction from HTML Documents: A conversion of Text-formed Information into OWL 2

  • Received : 2016.04.28
  • Accepted : 2016.05.23
  • Published : 2016.06.28

Abstract

Ontology is known to be one of the most important technologies in achieving semantic web. It is critical as it represents the knowledge in a machine readable state. World Wide Web Consortium (W3C) has been contributing to the development of ontology for the last several years. However, the recommendation of W3C left out HTML despite the massive amount of information it contains. Also, it is difficult and time consuming to keep up with all the technologies especially in the case of constructing ontology. Thus, we propose a module and methods that reuse HTML documents, extract necessary information from HTML tags and mapping it to OWL 2. We will be combining two kinds of approaches which will be the structural refinement for making an ontology skeleton and linguistic approach for adding detailed information onto the skeleton.

Keywords

1. INTRODUCTION

The concept semantic web is to develop a web into an intelligent space where all the information and knowledge is machine readable and refineable. Among the technologies recommended by the World Wide Web Consortium (W3C), ontology is one of the most important technologies in representing knowledge and sharing it [1].

There are several well established upper ontologies such as [2], [3] which allows the domain information to be linked and shared through the formal ontologies. However, process of constructing domain ontology is difficult and time consuming even working with an expert of the domain. Moreover, most of the researches on automatic ontology construction are based on fundamental technologies such as XML, RDF(s), and OWL while HTML is left aside despite the abundant information it contains [10]. Thus, researches for the better usage of information which HTML documents contain is needed.

Development into HTML5 has made HTML documents to have more semantic meanings. However, former version of HTML, version 4.01, is still widely used. Its predominance makes it more difficult to make web into an intelligent space. Thus, we propose our method that mainly uses the sequences of list tags that are consist of

References

  1. Thomas R. GRUBER, “Toward principles for the design of ontologies used for knowledge sharing?,” International journal of human-computer studies, vol. 43, issue. 5, 1995, pp. 907-928. https://doi.org/10.1006/ijhc.1995.1081
  2. Basic Formal Ontology, Overview, 2014. Online, http://infomis.uni-saarland.de/bfo/overview - [Last accessed Jul. 14, 2015]
  3. Suggested Upper Merged Ontology, Home, 2015. Online, http://www.adampease.org/OP/index.html - [Last accessed Jul. 14, 2015]
  4. Hyoun-Soo KWAK, Su-Kayoung Kim, Yeong-Geun Kim, and Kee-Hong Ann, “A Conversion System of HTML Document into OWL Ontology Language, Korean journal Information Processing Society, vol. 11, no. 2, 2004, pp. 539-542.
  5. Taimao SUN, Yiyeon YOON, Wooju KIM, “A Conversion from HTML5 to OWL Ontology,” Journal of Society for e-Business Studies, vol. 18, no. 3, 2013. https://doi.org/10.7838/jsebs.2013.18.3.143
  6. Jsoup: Java HTML Parser, 2015. Online, http://jsoup.org/ - [Last accessed Aug. 25, 2015]
  7. TOUTANOVA, Kristina, et al., "Feature-rich part-of-speech tagging with a cyclic dependency network," In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics, 2003, pp. 173-180.
  8. Luciano DEL CORRO and Rainer GEMULLA, "Clausie: clause-based open information extraction," In: Proceedings of the 22nd international conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2013, pp. 355-366.
  9. Gabor ANGELI, Melvin Johnson PREMKUMAR, and Christopher D. MANNING, "Leveraging Linguistic Structure for Open Domain Information Extraction," In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL, 2015, pp. 26-31.
  10. Hoon HWANGBO and Hongchul LEE, "Reusing of information constructed in HTML documents: A conversion of HTML into OWL," In: Control, Automation and Systems, ICCAS 2008, International Conference on. IEEE, 2008, pp. 871-875.
  11. Uta PRISS, “Formal concept analysis in information science,” Arist, vol. 40, no. 1, 2006, pp. 521-543.
  12. Lauren WOOD, et al. Document Object Model (DOM) Level 3 Core Specification, 2000.
  13. Saikat MUKHERJEE, et al., "Automatic discovery of semantic structures in html documents," In: Proceedings of the Seventh International Conference on Document Analysis and Recognition-Volume 1, IEEE Computer Society, 2003, p. 245.
  14. Min-Gu Kim, "An Intelligent Taxonomy Relation Extraction System for Automatic Ontology Construction," Ph.D. Thesis, Ajou University, Suwon, Republic of Korea, p. 105.
  15. Bernardo Cuenca GRAU, et al, “OWL 2: The next step for OWL,” Web Semantics: science, services and agents on the World Wide Web, vol. 6, no. 4, 2008, pp. 309-322. https://doi.org/10.1016/j.websem.2008.05.001
  16. The Stanford Natural Language Processing Group: Software, 2014. Online, http://nlp.stanford.edu/software/index.shtml - [Last accessed Jul. 22, 2015].
  17. George A. MILLER, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, 1995, pp. 39-41. https://doi.org/10.1145/219717.219748
  18. Universal Dependencies, Universal dependency relations, 2014. Online, http://universaldependencies.github.io/docs/#language-u - [Last accessed Aug. 5, 2015].
  19. GLOMIS, What is GLOMIS?, 2014. Online, http://glomis.pcu.ac.kr/ - [Last accessed August 18, 2015].
  20. David NADEAU and Satoshi SEKINE, “A survey of named entity recognition and classification,” Lingvisticae Investigationes, vol. 30, no. 1, 2007, pp. 3-26. https://doi.org/10.1075/li.30.1.03nad
  21. The Stanford Natural Language Processing Group, Stanford Named Entity Recognizer, 2015. Online, http://nlp.stanford.edu/software/CRF-NER.html - [Last accessed Feb. 15, 2016].
  22. The Stanford Natural Language Processing Group, Stanford Open Information Extraction, 2015. Online, http://nlp.stanford.edu/software/openie.html - [Last accessed Feb. 15, 2016].
  23. Protégé, Products, 2015. Online, http://protege.stanford.edu/support.php - [Last accessed Feb. 15, 2016].