[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.9728/dcs.2015.16.3.445

Text Extraction Algorithm using the HTML Logical Structure Analysis

Jeon, Hyun-Gee (Seoul National University of Science & Technology)
KOH, Chan (Seoul National University of Science & Technology)

Publication Information

Journal of Digital Contents Society / v.16, no.3, 2015 , pp. 445-455 More about this Journal

Abstract

According as internet and computer technology develops, the amount of information has increased exponentially, arising from a variety of web authoring tools and is a new web standard of appearance and a wide variety of web content accessibility as more convenient for the web are produced very quickly. However, web documents are put out on a variety of topics divided into some blocks where each of the blocks are dealing with a topic unrelated to one another as well as you can not see with contents such as many navigations, simple decorations, advertisements, copyright. Extract only the exact area of the web document body to solve this problem and to meet user requirements, and to study the effective information. Later on, as the reconstruction method, we propose a web search system can be optimized systematically manage documents.

Keywords

HTML; Data Mining; Text Extraction; HTML Structure;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	J.M. Lim, S.J. Jang, M.Y. Kim, J. H. Lee, "2014 Status of Utilization of Internet," Korea Internet Agency, 2014
2	Deng C., Shipeng Y., Ji-Rong W., Wei-Ying M., "VIPS: a Vision-based Page Segmentation Algorithm," Microsoft Technical Report(MSR-TR-2003-79), 2003.
3	Suhit G., Gail E. K., Peter G., Michael F. C., Justin S., "Automating Content Extraction of HTML Documents," World Wide Web, vol.8, Issue2, pp.179-224, 2005. DOI
4	Jeff P., Dan R., "Extracting Article Text from the Web with Maximum Subsequence Segmentation," The 18th international conference on World wide web, pp.971-980, 2009.
5	Stefan E., "A lightweight and efficient tool for clcaning Web pages", The 6th International Conference on Language Resources and Evaluation, 2008.
6	Christian K., Peter F., Wolfgang N., "Boilerplate Detection using Shallow Text Features," The third ACM international conference on Web search and data mining, pp.441-450, 2010.
7	Jian F., Ping L., Suk Hwan L., Sam L., Parag J., Jerry L., "Article Clipper-A System for Web Article Extraction," 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.743-746, 2011.
8	Tim W., William H. H., Jiawei H., "CETR-Content Extraction via Tag Ratios," 19th international conference on World wide web, pp.971-980, 2010.
9	Jung-chan Yun, Sung-dae Yun, "Design of personalized Web mining using association rules ", Journal of Korea multimedia society, Vol. 11-11, pp.1566-1574, 2008.
10	Hyung-woo Lee, Tae-su Kim, "Research of knowledge inference algorithm with associated mining method based on Ontology", Journal of Korea multimedia society, Vol. 11-11, pp.1601-1614, 2008.
11	Tomaz K., Evaluating Text Extraction Algorithms. [Online]. Available: http://tomazkovacic.com/blog/(downloaded 2012, Jul.)
12	W3C Recommendation. (1999, Dec. 24). HTML 4.01 Specification [Online]. Available:http://www.w3.org/TR/html401/ (downloaded 2012, Jul.)
13	Ju-gil Hong, Eun-young Shin, Jue-il Lee, Won-Seok Lee, "Automatic Hierarchical Classification of news articles using association rules", Journal of Korea multimedia society, Vol. 14-6, pp.730-741, 2011. DOI ScienceOn
14	Won-moon Song, Woo-seung Kim, Mung-won Kim, "HTML document, extraction using the context of the surrounding text blocks", Journal of Korean Institute of Information Scientists and Engineers : Software and Applications, Vol. 40-3, pp.155-163, 2013.
15	S.-H. Lin, J.-M. Ho, Discobering Informative Content Blocks from Web Documents. Proc. of 8th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, 2002.
16	L. Bing, Y. Wang, Y. Zhang, Primary Content Extraction with Mountain Model. Proc. 8th IEEE CIT, 2008.
17	Young-gu Lee, "Study on the article text extraction from news web page", Journal of Korea Society for Information Management, Vol. 26, pp.305-320, 2009. DOI ScienceOn

5	(2015) 디지털콘텐츠학회 논문지 한국 인터넷신문 HTML 규격 및 시맨틱스 수준 분석 / 18 (5) , 949
12	(2015) 韓國情報技術學會論文誌 Software Implementation to Covert Table and Text-Based Hangul Files(.hwp) to HTML / 17 (12) , 155

KSCI

Text Extraction Algorithm using the HTML Logical Structure Analysis HTML 논리적 구조분석을 통한 본문추출 알고리즘

Text Extraction Algorithm using the HTML Logical Structure Analysis