Korean Web Content Extraction using Tag Rank Position and Gradient Boosting

Mo, Jonghoon;Yu, Jae-Myung;

doi:10.5626/JOK.2017.44.6.581

정보과학회 논문지 (Journal of KIISE)

제44권6호
/
Pages.581-586
/
2017
/
2383-630X(pISSN)
/
2383-6296(eISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

DOI QR Code

태그 서열 위치와 경사 부스팅을 활용한 한국어 웹 본문 추출

Korean Web Content Extraction using Tag Rank Position and Gradient Boosting

모종훈 ((주)퀀트랩) ;
유재명 ((주)퀀트랩)

Mo, Jonghoon ;
Yu, Jae-Myung (QuantLab)

투고 : 2017.01.02
심사 : 2017.04.07
발행 : 2017.06.15

https://doi.org/10.5626/JOK.2017.44.6.581 인용 KSCI

⟨ 이전 논문 다음 논문 ⟩

초록

웹 문서를 자동으로 수집하면 대량의 정보를 손쉽게 모을 수 있다. 이러한 정보 수집 과정을 위해 웹 문서에서 메뉴, 광고 등 불필요한 정보를 제거하고 본문을 자동으로 추출할 필요가 있다. 특히 한국어 웹문서는 영어권과 달리 메타데이터가 포함된 경우가 드물고 디자인이 복잡하여 한국어 웹에 맞는 자동 본문 추출 방법이 필요하다. 기존의 본문 추출 방법은 주로 본문 블록의 문자적, 구조적 특성을 활용한다. 시각적 특성을 처리하기 위해서는 렌더링, 이미지 처리 등에 많은 계산이 필요하기 때문이다. 이 논문에서는 HTML에서 태그 위치를 준-시각적 특성으로 활용한 새로운 본문 추출 방법을 제시한다. 태그 위치는 텍스트의 길이에 따라 가변적이기 때문에 태그 서열 위치라는 특성을 개발하였고, 이를 경사 부스팅과 함께 이용하면 정확한 본문 추출이 가능함을 보인다. 본 논문의 연구 결과는 텍스트 분석에 필요한 양질의 문서 자료를 다양한 형태의 웹페이지에서 자동으로 수집하는 데에 쓰일 수 있다.

For automatic web scraping, unnecessary components such as menus and advertisements need to be removed from web pages and main contents should be extracted automatically. A content block tends to be located in the middle of a web page. In particular, Korean web documents rarely include metadata and have a complex design; a suitable method of content extraction is therefore needed. Existing content extraction algorithms use the textual and structural features of content blocks because processing visual features requires heavy computation for rendering and image processing. In this paper, we propose a new content extraction method using the tag positions in HTML as a quasi-visual feature. In addition, we develop a tag rank position, a type of tag position not affected by text length, and show that gradient boosting with the tag rank position is a very accurate content extraction method. The result of this paper shows that the content extraction method can be used to collect high-quality text data automatically from various web pages.

키워드

과제정보

연구 과제 주관 기관 : 중소기업청

참고문헌

S. Gupta, G. E. Kaiser, P. Grimm, M. F. Chiang, J. Starren, "Automating Content Extractionof HTML Documents," World Wide Web, Vol. 8, No. 2, pp. 179 -224, Jun. 2005. https://doi.org/10.1007/s11280-004-4873-3
A. Finn, N. Kushmerick, B. Smyth, "Fact or fiction: Content classification for digital libraries," presented at the Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, 2001.
D. Pinto et al., "QuASM: a system for question answering using semi-structured data," Proc. of the 2nc ACM/IEEE-CS joint conference on Digital libraries, pp. 46-55, 2002.
S. Debnath, P. Mitra, and C. L. Giles, "Automatic extraction of informative blocks from webpages," p. 1722, 2005.
T. Gottron, "Combining content extraction heuristics: the CombinE system," p. 591, 2008.
R. Palacios, Eatiht. 2015.
S. Wu, J. Liu, J. Fan, "Automatic Web Content Extraction by Combination of Learning and Grouping," pp. 1264-1274, 2015.
W. Song, W. Kim, and M. Kim, "Content extraction from HTML documents using text block context," Journal of KIISE: Software and Applications, Vol. 40, No. 3, pp. 155-163, 2013.
T. Weninger, P. Rodrigo, V. Crescenzi, T. Gottron, P. Merialdo, "Web Content Extraction - a metaanalysis of its past and thoughts on its future," [Online]. Available: https://arxiv.org/abs/1508.04066.
C. Kohlschutter, P. Fankhauser, W. Nejdl, "Boilerplate detection using shallow text features," p. 441, 2010.
J. H. Friedman, "Greedy function approximation: A gradient boosting machine," The Annals of Statistics, Vol. 29, No. 5, pp. 1189-1232, 2001. https://doi.org/10.1214/aos/1013203451
R. E. Schapire, "The strength of weak learnability," Journal of Machine Learning," Vol. 5, No. 2, pp. 197-227, 1990.
T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system," Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, 2016.