• Title/Summary/Keyword: XPath Grouping

Search Result 3, Processing Time 0.017 seconds

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Sun, Bok-Keun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.1-7
    • /
    • 2015
  • Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

Design and Adaptation for Internet News Data Extraction Middleware(INDEM) System

  • Sun, Bok-Keun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.4
    • /
    • pp.55-62
    • /
    • 2016
  • In this paper, we propose the INDEM(Internet News Data Extraction Middleware) system for the removal of the unnecessary data in internet news. Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information service, it contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page. The INDEM system parses html and explores the XPath, and it is to perform the analysis. The user simply utilize INDEM by implementing an abstract class that provides INDEM, and can obtain the analysis information. INDEM System through this process delivers the analysis information including the main contents of news site to the users. In this paper, the INDEM system was adapted in a stand-alone and web service system and it was evaluated on the basis of 16 news site. As a result, performance of the INDEM system is affected in html source data size and complexity of used html grammar than the main news data size.

The Application and Integration of an Improvement Technique for Layers of NETCONF (NETCONF 계층에 대한 개선 기법 적용 및 통합)

  • Lee, YangMin;Lee, JaeKee
    • Journal of KIISE
    • /
    • v.43 no.2
    • /
    • pp.256-268
    • /
    • 2016
  • Modern networks consisting of various heterogeneous equipment are often installed in a distributed manner. Thus the NETCONF standard was established to manage networks centrally and efficiently. In this paper, we present a method that integrates each NETCONF layer into a single system based on the results of previous studies. In the RPC Layer, an asynchronous communication channel and parallel processes are possible using multi-threading. In the Operation Layer, operational efficiency is increased by using a data group with dependencies between the equipment configuration data and by improving the data structure, enabling efficiently processing of XML queries even with multiple managers. The data modeling techniques and grouping methods in the Content Layer are presented in detail for interoperability between the Operation Layer and the Content Layer. Finally, the GUI program was implemented and its implementation is reported. We performed an experiment comparing the improved NETCONF with the standard NETCONF to measure factors, such as query processing ratio, query processing speed, and CPU utilization. The improved NETCONF demonstrated excellent query processing ratio and query processing speed, whereas the standard NETCONF had excellent CPU utilization.