DOI QR코드

DOI QR Code

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Sun, Bok-Keun (Dept. of Computer Engineering, Hoseo University)
  • Received : 2015.04.29
  • Accepted : 2015.06.29
  • Published : 2015.07.31

Abstract

Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

Keywords

References

  1. HTML5, http://www.w3.org/TR/html5/
  2. Myeong-Chul Park, Seok-Gyu Park, Hyun-Syug Kang, "Interactive Learning Tool Based on HTML5 Using Unplugged Contents", Journal of The Korea Society of Computer and Information, Vol.19, No.11, pp. 73-79, November 2014 https://doi.org/10.9708/jksci.2014.19.11.073
  3. D. Shen, Q. Yang, Z. Chen, "Noise reduction through summarization for Web-page classification", Information Processing and Management vol.43, pp.1735-1747, 2007. https://doi.org/10.1016/j.ipm.2007.01.013
  4. J. Si, W. Wang, "A Template-based forum posts content extraction method", International Conference on ICECE, pp.38-41, 2011.
  5. H. Mohammadzadeh, T. Gottron, F. Schweiggert, G. Nakhaeiza, "A Fast and accurate approach for main content extraction based on character encoding", 22nd International workshop on database and expert systems applications, pp.167-171. 2011.
  6. S.Gupta, G. Kaiser, D. Neistadt, and P. GS.Gupta, G. Kaiser, D. Neistadt, and P. Grimm, "DOM-based content extraction of HTML documents", in WWW '03: Proceedings of the 12th International Conference on WWW, ACM, pp.207-214, 2003.
  7. R. Gunasundari, S. Karthikeyan, "A Study of content extraction from web pages based on links", International Journal of Data Mining & Knowledge management Process(IJDKP) vol.2, No.3, 2012.
  8. B. Zhou, C. Wang, Q. Su, "Chinese web page content extraction based on page content analysis", Journal of Computational Information Systems vol.5, No.6, pp.1861-1871, 2009.
  9. S.Pretzsch, K.Muthmann, A.Schill, "FODEX-Towards generic data extraction from web forums", 26th International conference on advanced information networking and applications workshops, pp.821-826, 2012
  10. Clearly, https://chrome.google.com/webstore/detail/clearly/iooicodkiihhpojmmeghjclgihfjdjhj
  11. Readability, https://www.readability.com/
  12. A. Arasu, H.Garcia-Molina, "Extracting structured adta from web pages", SIGMOD '03:Proceedings of the 2003 ACM SIGMOD international conference on Management of data, ACM, pp.337-348, 2003.
  13. Sangyoon Oh, "X2RD: Storing and Querying XML Data Using XPath To Relational Database", Journal of The Korea Society of Computer and Information, Vol.14, No.3, pp. 57-64, March 2009.
  14. XPath, http://www.w3.org/TR/xpath/