DOI QR코드

DOI QR Code

Main Content Extraction from Web Pages Based on Node Characteristics

  • Liu, Qingtang (School of Educational Information Technology, Central China Normal University) ;
  • Shao, Mingbo (School of Educational Information Technology, Central China Normal University) ;
  • Wu, Linjing (School of Educational Information Technology, Central China Normal University) ;
  • Zhao, Gang (School of Educational Information Technology, Central China Normal University) ;
  • Fan, Guilin (School of Educational Information Technology, Central China Normal University) ;
  • Li, Jun (School of Information Engineering, Hubei University for Nationalities)
  • Received : 2016.10.16
  • Accepted : 2017.06.07
  • Published : 2017.06.30

Abstract

Main content extraction of web pages is widely used in search engines, web content aggregation and mobile Internet browsing. However, a mass of irrelevant information such as advertisement, irrelevant navigation and trash information is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. The purpose of this paper is to propose an automatic main content extraction method of web pages. In this method, we use two indicators to describe characteristics of web pages: text density and hyperlink density. According to continuous distribution of similar content on a page, we use an estimation algorithm to judge if a node is a content node or a noisy node based on characteristics of the node and neighboring nodes. This algorithm enables us to filter advertisement nodes and irrelevant navigation. Experimental results on 10 news websites revealed that our algorithm could achieve a 96.34% average acceptable rate.

Keywords

References

  1. A. Bhardwaj and V. Mangat, "An improvised algorithm for relevant content extraction from web pages," Journal of Emerging Technologies in Web Intelligence, vol. 6, no. 2, pp. 226-230, 2014.
  2. J. O. Wobbrock, J. Forlizzi, S. E. Hudson, and B. A. Myers, "WebThumb: interaction techniques for small-screen browsers," in Proceedings of the 15th Annual ACM Symposium on User Interface Software and Technology, Paris, France, 2002, pp. 205-208.
  3. W. Petprasit and S. Jaiyen, "E-commerce web page classification based on automatic content extraction," in the Proceedings of 12th International Joint Conference on Computer Science and Software Engineering (JCSSE), Songkhla, Thailand, 2015, pp. 74-77.
  4. A. Schieber and A. Hilbert, "Process model for content extraction from Weblogs," International Journal of Intelligent Information Technologies, vol. 10, no. 2, pp. 20-36, 2014. https://doi.org/10.4018/ijiit.2014040102
  5. S. Debnath, P. Mitra, and C. L. Giles, "Identifying content blocks from web documents," in International Symposium on Methodologies for Intelligent Systems. Heidelberg: Springer, 2005, pp. 285-293.
  6. S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, "DOMbased content extraction of HTML documents," in Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, 2003, pp. 207-214.
  7. F. Sun, D. Song, and L. Liao, "DOM based content extraction via text density," in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 2011, pp. 245-254.
  8. D. Insa, J. Silva, and S. Tamarit, "Using the words/leafs ratio in the DOM tree for content extraction," Journal of Logic and Algebraic Programming, vol. 82, no. 8, pp. 311-325, 2013. https://doi.org/10.1016/j.jlap.2013.01.002
  9. D. Cai, S. Yu, J. R. Wen, and W. Y. Ma, "VIPS: a vision based page segmentation algorithm," Microsoft Corporation, Redmond, WA, Technical Report MSR-TR-2003-79, 2003.
  10. L. Q. Chen, X. Xie, W. Y. Ma, H. J. Zhang, H. Q. Zhou, and H. Q. Feng, "DRESS: a slicing tree based web representation for various display sizes," Microsoft Corporation, Redmond, WA, Technical Report MSR-TR-2002-126, 2002.
  11. Z. Ahmad and J. L. Hong, "Mobile web browsing techniques," in Neural Information Processing. Heidelberg: Springer, 2012, pp. 283-291.
  12. S. H. Lin and J. M. Ho, "Discovering informative content blocks from Web documents," in Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, pp. 588-593.
  13. D. Song, F. Sun, and L. Liao, "A hybrid approach for content extraction with text density and visual importance of DOM nodes," Knowledge and Information Systems, vol. 42, no. 1, pp. 75-96, 2015. https://doi.org/10.1007/s10115-013-0687-x
  14. A. Pouramini and S. Nasiri, "Web content extraction using contextual rules," in Proceedings of 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, Iran, 2015, pp. 1014-1018.
  15. B. Sun, "A study of main contents extraction from web news pages based on xpath analysis," Journal of the Korea Society of Computer and Information, vol. 20, no. 7, pp. 1-7, 2015. https://doi.org/10.9708/jksci.2015.20.7.001
  16. J. Chen, B. Zhou, J. Shi, H. Zhang, and F. Qiu, "Functionbased object model towards website adaptation," in Proceedings of the 10th International Conference on World Wide Web, Hong Kong, 2001, pp. 587-596.
  17. J. Park, J. Kim, and J. H. Lee, "Keyword extraction for blogs based on content richness," Journal of Information Science, vol. 40, no. 1, pp. 38-49, 2014. https://doi.org/10.1177/0165551513508877
  18. Z. Y. Xiong, X. Q. Lin, Y. F. Zhang, and Y. A. Man, "Content extraction method combining web page structure and text feature," Computer Engineering, vol. 39, no. 12, pp. 200-203, 2013.
  19. A. Finn, N. Kushmerick, and B. Smyth, "Fact or fiction: content classification for digital libraries," in Proceedings of Joint DELOS-NSF Workshop: Personalization and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001.
  20. W. Song and M. Kim, "A text block context information based multiple Web contents extraction," in Proceedings of IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, 2015, pp. 1-8.