Browse > Article
http://dx.doi.org/10.9728/dcs.2015.16.3.445

Text Extraction Algorithm using the HTML Logical Structure Analysis  

Jeon, Hyun-Gee (Seoul National University of Science & Technology)
KOH, Chan (Seoul National University of Science & Technology)
Publication Information
Journal of Digital Contents Society / v.16, no.3, 2015 , pp. 445-455 More about this Journal
Abstract
According as internet and computer technology develops, the amount of information has increased exponentially, arising from a variety of web authoring tools and is a new web standard of appearance and a wide variety of web content accessibility as more convenient for the web are produced very quickly. However, web documents are put out on a variety of topics divided into some blocks where each of the blocks are dealing with a topic unrelated to one another as well as you can not see with contents such as many navigations, simple decorations, advertisements, copyright. Extract only the exact area of the web document body to solve this problem and to meet user requirements, and to study the effective information. Later on, as the reconstruction method, we propose a web search system can be optimized systematically manage documents.
Keywords
HTML; Data Mining; Text Extraction; HTML Structure;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 J.M. Lim, S.J. Jang, M.Y. Kim, J. H. Lee, "2014 Status of Utilization of Internet," Korea Internet Agency, 2014
2 Deng C., Shipeng Y., Ji-Rong W., Wei-Ying M., "VIPS: a Vision-based Page Segmentation Algorithm," Microsoft Technical Report(MSR-TR-2003-79), 2003.
3 Suhit G., Gail E. K., Peter G., Michael F. C., Justin S., "Automating Content Extraction of HTML Documents," World Wide Web, vol.8, Issue2, pp.179-224, 2005.   DOI
4 Jeff P., Dan R., "Extracting Article Text from the Web with Maximum Subsequence Segmentation," The 18th international conference on World wide web, pp.971-980, 2009.
5 Stefan E., "A lightweight and efficient tool for clcaning Web pages", The 6th International Conference on Language Resources and Evaluation, 2008.
6 Christian K., Peter F., Wolfgang N., "Boilerplate Detection using Shallow Text Features," The third ACM international conference on Web search and data mining, pp.441-450, 2010.
7 Jian F., Ping L., Suk Hwan L., Sam L., Parag J., Jerry L., "Article Clipper-A System for Web Article Extraction," 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.743-746, 2011.
8 Tim W., William H. H., Jiawei H., "CETR-Content Extraction via Tag Ratios," 19th international conference on World wide web, pp.971-980, 2010.
9 Jung-chan Yun, Sung-dae Yun, "Design of personalized Web mining using association rules ", Journal of Korea multimedia society, Vol. 11-11, pp.1566-1574, 2008.
10 Hyung-woo Lee, Tae-su Kim, "Research of knowledge inference algorithm with associated mining method based on Ontology", Journal of Korea multimedia society, Vol. 11-11, pp.1601-1614, 2008.
11 Tomaz K., Evaluating Text Extraction Algorithms. [Online]. Available: http://tomazkovacic.com/blog/(downloaded 2012, Jul.)
12 W3C Recommendation. (1999, Dec. 24). HTML 4.01 Specification [Online]. Available:http://www.w3.org/TR/html401/ (downloaded 2012, Jul.)
13 Ju-gil Hong, Eun-young Shin, Jue-il Lee, Won-Seok Lee, "Automatic Hierarchical Classification of news articles using association rules", Journal of Korea multimedia society, Vol. 14-6, pp.730-741, 2011.   DOI   ScienceOn
14 Won-moon Song, Woo-seung Kim, Mung-won Kim, "HTML document, extraction using the context of the surrounding text blocks", Journal of Korean Institute of Information Scientists and Engineers : Software and Applications, Vol. 40-3, pp.155-163, 2013.
15 S.-H. Lin, J.-M. Ho, Discobering Informative Content Blocks from Web Documents. Proc. of 8th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, 2002.
16 L. Bing, Y. Wang, Y. Zhang, Primary Content Extraction with Mountain Model. Proc. 8th IEEE CIT, 2008.
17 Young-gu Lee, "Study on the article text extraction from news web page", Journal of Korea Society for Information Management, Vol. 26, pp.305-320, 2009.   DOI   ScienceOn