Semi-Automatic Ontology Construction from HTML Documents: A conversion of Text-formed Information into OWL 2

Im, Chan jong;Kim, Do wan;

doi:10.5392/IJoC.2016.12.2.024

International Journal of Contents

제12권2호
/
Pages.24-30
/
2016
/
1738-6764(pISSN)
/
2093-7504(eISSN)

한국콘텐츠학회 (The Korea Contents Association)

DOI QR Code

Semi-Automatic Ontology Construction from HTML Documents: A conversion of Text-formed Information into OWL 2

Im, Chan jong (Department of Information and Telecommunication Pai Chai University) ;
Kim, Do wan (Pai Chai University)

투고 : 2016.04.28
심사 : 2016.05.23
발행 : 2016.06.28

https://doi.org/10.5392/IJoC.2016.12.2.024 인용 PDF KSCI KPUBS HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Ontology is known to be one of the most important technologies in achieving semantic web. It is critical as it represents the knowledge in a machine readable state. World Wide Web Consortium (W3C) has been contributing to the development of ontology for the last several years. However, the recommendation of W3C left out HTML despite the massive amount of information it contains. Also, it is difficult and time consuming to keep up with all the technologies especially in the case of constructing ontology. Thus, we propose a module and methods that reuse HTML documents, extract necessary information from HTML tags and mapping it to OWL 2. We will be combining two kinds of approaches which will be the structural refinement for making an ontology skeleton and linguistic approach for adding detailed information onto the skeleton.

키워드

1. INTRODUCTION

The concept semantic web is to develop a web into an intelligent space where all the information and knowledge is machine readable and refineable. Among the technologies recommended by the World Wide Web Consortium (W3C), ontology is one of the most important technologies in representing knowledge and sharing it [1].

There are several well established upper ontologies such as [2], [3] which allows the domain information to be linked and shared through the formal ontologies. However, process of constructing domain ontology is difficult and time consuming even working with an expert of the domain. Moreover, most of the researches on automatic ontology construction are based on fundamental technologies such as XML, RDF(s), and OWL while HTML is left aside despite the abundant information it contains [10]. Thus, researches for the better usage of information which HTML documents contain is needed.

Development into HTML5 has made HTML documents to have more semantic meanings. However, former version of HTML, version 4.01, is still widely used. Its predominance makes it more difficult to make web into an intelligent space. Thus, we propose our method that mainly uses the sequences of list tags that are consist of

tags, to build a skeleton ontology containing information about the relationships among the HTML documents in the domain. In addition, more detailed information is extracted from the tags containing contents and they are added to the skeleton through English syntax analysis using natural language processing tools provided by Stanford University.

2. RELATED WORK

There are several approaches in building ontology from HTML tags. Firstly, the structural analysis approach which has information on simple and direct structural mappings from HTML to OWL. [4] showed the mappings for HTML table, checkbox, radio, select tags into OWL. It proposed the mapping rules for HTML tags and had no problem in OWL Lite validation. [5] tried to classify newly established tags in HTML5 and proposed schema level mapping rules based on the semantic elements and instance mapping rules.

These works are helpful in making the draft of ontologies. However, the established ontologies are mapped from HTML tags without considering the concept of representing the knowledge [1]. The fact that HTML was originally designed for better presentation to humans and the fact that it is not containing any semantic information make these structural mapping methods unreasonable.

Assigning semantic information to HTML is a significant task for establishing semantic web. For this purpose, a second approach for building an ontology, linguistic approach was brought about which dealt with syntactical analysis as a part of natural language processing (NLP). One of the most recent methods in analyzing English sentences is done by Part-Of- Speech Tagging [7]. Furthermore, with the return sets of POS tagging elements, several attempts to systematically convert it into triples has been conducted as a part of Open Information Extraction (IOE) in works [8] and [9].

Hwangbo et al [10] used the structural mappings and some NLP tools to make the ontology. It pointed out that fundamental technologies of semantic web that has been recommended by W3C were too much concentrated in XML and RDF(s) which made HTML useless for semantic web. The recommendation of Gleaning Resource Descriptions from Dialects of Languages (GRDDL), a technology of HTML conversion, was noted to have deficiency since it needed valid style sheets and had limitations in making RDFS. Thus, the article proposed procedural steps in making an ontology from HTML tags. These steps contain extracting information from general HTML documents, classifying the tags, rules for extracting data of each tag group, transforming it into triples for both text-formed information and mixed-formed information as a part of structural mapping methods, and analyzing the triples with WordNet [17] and To-and-To web application as a part of a linguistic approach. Though these approaches seemed promising and realizable, they only focused on the ways to map the data contained in a single HTML document. In other words, they were not able to construct an ontology with the information of relationships among other pages in the same domain.

Since HTML documents lack semantic information, the process of refinement was needed in order to extract the data from HTML documents. This is done by using the object Document Object Model (DOM) [12]. The article [13] tried to discover the semantic pattern referred to as ‘similarity’ in the implicit fixed schema of template-driven HTML documents and automatically generated a semantic partition tree. This was done upon the observation of spatial locality of the contents in template-driven HTML documents. Though their work seemed well working in finding semantic structure of the document, it was limited to the template-driven documents and modification was needed for ontology construction.

Another way to assign semantic information from relationship retrieval which specifically is related to extracting key terms and forming them into a concept hierarchy. Article [14] tried to extract hierarchical relationships based on modified Formal Concept Analysis (FCA) theory [11]. Modified form of FCA theory allowed the attributes of a term to have a certain level of unnecessary values. In other words, it allowed to have some attributes of a term that are not perfectly fitted to form a hierarchy relationship with a chosen threshold value. With the keywords given by the domain experts, attributes of the keywords were chosen with the window size that ranged from one to five. These attributes were rearranged into document-weight vector and were clustered with k-means methods to put all the similar attributes together. The rules of modified FCA theory were applied to the keyword with attribute clusters to form a hierarchical concept.

The article shows a good performance in allocating the words and terms into higher concept. However, it is insufficient to build the ontology with only hierarchical relationships. Also, the problem in choosing keyword and cluster size remained unsolved.

3. PROCEDURES FOR ONTOLOGY CONSTRUCTION

In order to build a domain ontology using OWL 2 which is more expressive than OWL 1 [15], we will be using the combination of structure and linguistic approaches in this paper. Using the combination of two approaches serves as complementary one to the other in building an ontology. For the problem of semantic information scarcity in the case of using structural mapping method, for example, can be supplemented by linguistic approach. On the other hand, the problems in linguistic approach such as concept labeling or limitation on document types, can be solved with comprehensively defined structural rules.

Our goal is to make a basic ontology from general HTML documents, specifically from all HTML documents in the domain. The proposed module is composed of 3 phases which are extracting nodes of trees, mapping onto the ontology, and adding details onto the skeleton ontology. All tags that are used in first phase of our module are shown in Table 1. We left out all the unused tags from the tag classification mentioned in article [10]. In the introduction, it was mentioned that we will be using the tags that are used for easy navigation to make the skeleton ontology for the specific domain. In the case of HTML 5, for example, all tags constructed for easy navigation are grouped together under
tag and it holds

참고문헌

Thomas R. GRUBER, “Toward principles for the design of ontologies used for knowledge sharing?,” International journal of human-computer studies, vol. 43, issue. 5, 1995, pp. 907-928. https://doi.org/10.1006/ijhc.1995.1081

Basic Formal Ontology, Overview, 2014. Online, http://infomis.uni-saarland.de/bfo/overview - [Last accessed Jul. 14, 2015]

Hyoun-Soo KWAK, Su-Kayoung Kim, Yeong-Geun Kim, and Kee-Hong Ann, “A Conversion System of HTML Document into OWL Ontology Language, Korean journal Information Processing Society, vol. 11, no. 2, 2004, pp. 539-542.

Taimao SUN, Yiyeon YOON, Wooju KIM, “A Conversion from HTML5 to OWL Ontology,” Journal of Society for e-Business Studies, vol. 18, no. 3, 2013. https://doi.org/10.7838/jsebs.2013.18.3.143
Jsoup: Java HTML Parser, 2015. Online, http://jsoup.org/ - [Last accessed Aug. 25, 2015]
TOUTANOVA, Kristina, et al., "Feature-rich part-of-speech tagging with a cyclic dependency network," In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics, 2003, pp. 173-180.
Luciano DEL CORRO and Rainer GEMULLA, "Clausie: clause-based open information extraction," In: Proceedings of the 22nd international conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2013, pp. 355-366.
Gabor ANGELI, Melvin Johnson PREMKUMAR, and Christopher D. MANNING, "Leveraging Linguistic Structure for Open Domain Information Extraction," In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL, 2015, pp. 26-31.
Hoon HWANGBO and Hongchul LEE, "Reusing of information constructed in HTML documents: A conversion of HTML into OWL," In: Control, Automation and Systems, ICCAS 2008, International Conference on. IEEE, 2008, pp. 871-875.
Uta PRISS, “Formal concept analysis in information science,” Arist, vol. 40, no. 1, 2006, pp. 521-543.
Lauren WOOD, et al. Document Object Model (DOM) Level 3 Core Specification, 2000.
Saikat MUKHERJEE, et al., "Automatic discovery of semantic structures in html documents," In: Proceedings of the Seventh International Conference on Document Analysis and Recognition-Volume 1, IEEE Computer Society, 2003, p. 245.
Min-Gu Kim, "An Intelligent Taxonomy Relation Extraction System for Automatic Ontology Construction," Ph.D. Thesis, Ajou University, Suwon, Republic of Korea, p. 105.
Bernardo Cuenca GRAU, et al, “OWL 2: The next step for OWL,” Web Semantics: science, services and agents on the World Wide Web, vol. 6, no. 4, 2008, pp. 309-322. https://doi.org/10.1016/j.websem.2008.05.001
The Stanford Natural Language Processing Group: Software, 2014. Online, http://nlp.stanford.edu/software/index.shtml - [Last accessed Jul. 22, 2015].
George A. MILLER, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, 1995, pp. 39-41. https://doi.org/10.1145/219717.219748
Universal Dependencies, Universal dependency relations, 2014. Online, http://universaldependencies.github.io/docs/#language-u - [Last accessed Aug. 5, 2015].
GLOMIS, What is GLOMIS?, 2014. Online, http://glomis.pcu.ac.kr/ - [Last accessed August 18, 2015].
David NADEAU and Satoshi SEKINE, “A survey of named entity recognition and classification,” Lingvisticae Investigationes, vol. 30, no. 1, 2007, pp. 3-26. https://doi.org/10.1075/li.30.1.03nad
The Stanford Natural Language Processing Group, Stanford Named Entity Recognizer, 2015. Online, http://nlp.stanford.edu/software/CRF-NER.html - [Last accessed Feb. 15, 2016].
The Stanford Natural Language Processing Group, Stanford Open Information Extraction, 2015. Online, http://nlp.stanford.edu/software/openie.html - [Last accessed Feb. 15, 2016].
Protégé, Products, 2015. Online, http://protege.stanford.edu/support.php - [Last accessed Feb. 15, 2016].

International Journal of Contents

Semi-Automatic Ontology Construction from HTML Documents: A conversion of Text-formed Information into OWL 2

초록

키워드

1. INTRODUCTION

2. RELATED WORK

3. PROCEDURES FOR ONTOLOGY CONSTRUCTION

3.1 Extracting tree from list tags

3.2 Mapping tags on skeleton ontology and adding detailed information

4. BUILDING ONTOLOGY ON GLOMIS DOMAIN

4.1 Domain GLOMIS

4.3 Mapping onto ontology and adding details

5. CONCLUSION

참고문헌

International Journal of Contents

Semi-Automatic Ontology Construction from HTML Documents: A conversion of Text-formed Information into OWL 2

초록

키워드

1. INTRODUCTION

2. RELATED WORK

3. PROCEDURES FOR ONTOLOGY CONSTRUCTION

3.1 Extracting tree from list tags

3.2 Mapping tags on skeleton ontology and adding detailed information

4. BUILDING ONTOLOGY ON GLOMIS DOMAIN

4.1 Domain GLOMIS

4.3 Mapping onto ontology and adding details

5. CONCLUSION

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)