A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

Sun, Bok-Keun;

doi:10.9708/jksci.2015.20.7.001

Journal of the Korea Society of Computer and Information (한국컴퓨터정보학회논문지)

Volume 20 Issue 7
/
Pages.1-7
/
2015
/
1598-849X(pISSN)
/
2383-9945(eISSN)

Korean Society of Computer Information (한국컴퓨터정보학회)

DOI QR Code

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

Sun, Bok-Keun (Dept. of Computer Engineering, Hoseo University)

Received : 2015.04.29
Accepted : 2015.06.29
Published : 2015.07.31

https://doi.org/10.9708/jksci.2015.20.7.001 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

Keywords

References

HTML5, http://www.w3.org/TR/html5/
Myeong-Chul Park, Seok-Gyu Park, Hyun-Syug Kang, "Interactive Learning Tool Based on HTML5 Using Unplugged Contents", Journal of The Korea Society of Computer and Information, Vol.19, No.11, pp. 73-79, November 2014 https://doi.org/10.9708/jksci.2014.19.11.073
D. Shen, Q. Yang, Z. Chen, "Noise reduction through summarization for Web-page classification", Information Processing and Management vol.43, pp.1735-1747, 2007. https://doi.org/10.1016/j.ipm.2007.01.013
J. Si, W. Wang, "A Template-based forum posts content extraction method", International Conference on ICECE, pp.38-41, 2011.
H. Mohammadzadeh, T. Gottron, F. Schweiggert, G. Nakhaeiza, "A Fast and accurate approach for main content extraction based on character encoding", 22nd International workshop on database and expert systems applications, pp.167-171. 2011.
S.Gupta, G. Kaiser, D. Neistadt, and P. GS.Gupta, G. Kaiser, D. Neistadt, and P. Grimm, "DOM-based content extraction of HTML documents", in WWW '03: Proceedings of the 12th International Conference on WWW, ACM, pp.207-214, 2003.
R. Gunasundari, S. Karthikeyan, "A Study of content extraction from web pages based on links", International Journal of Data Mining & Knowledge management Process(IJDKP) vol.2, No.3, 2012.
B. Zhou, C. Wang, Q. Su, "Chinese web page content extraction based on page content analysis", Journal of Computational Information Systems vol.5, No.6, pp.1861-1871, 2009.
S.Pretzsch, K.Muthmann, A.Schill, "FODEX-Towards generic data extraction from web forums", 26th International conference on advanced information networking and applications workshops, pp.821-826, 2012
Clearly, https://chrome.google.com/webstore/detail/clearly/iooicodkiihhpojmmeghjclgihfjdjhj
Readability, https://www.readability.com/
A. Arasu, H.Garcia-Molina, "Extracting structured adta from web pages", SIGMOD '03:Proceedings of the 2003 ACM SIGMOD international conference on Management of data, ACM, pp.337-348, 2003.
Sangyoon Oh, "X2RD: Storing and Querying XML Data Using XPath To Relational Database", Journal of The Korea Society of Computer and Information, Vol.14, No.3, pp. 57-64, March 2009.
XPath, http://www.w3.org/TR/xpath/

Journal of the Korea Society of Computer and Information (한국컴퓨터정보학회논문지)

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)