Browse > Article

Web Information Retrieval Exploiting Markup Pattern  

Kim, Min-Soo (아주대학교 정보통신학)
Kim, Min-Koo (아주대학교 정보통신학)
Abstract
Over the years, great attention has been paid to the question of exploiting inherent semantic of HTML in the area of web document retrieval. Although HTML is mainly presentation oriented, HTML tags implicitly contain useful semantics that can be catch meaning of text. Focusing on this idea. in this paper we define 'markup pattern' and try to improve performance of web document retrieval using markup patterns. Markup pattern is a mirror of intends of web document publisher and an internal semantic of text on web document. To discover the markup pattern and exploit it, we suggest a new scheme for extracting concepts and weighting documents. For evaluation task, we select two domains-BBC and CNN web sites, and use their search engines to gather domain documents. We re-weight and re-score documents using proposed scheme, and show the performance improvement in the two domains.
Keywords
Markup Pattern; Web Document Retrieval; Information Retrieval;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Lawrie, D. J. and Croft, W. B. 2003. Generating Hierarchical Summaries for Web Searches. In Proceedings of the 26th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, pages 457-458, Toronto, Canada
2 Reiner, K. and Jason, Z. 2004. Mining Anchor Text for Query Refinement. In Proceedings of WWW2004, New York, USA
3 Udo, K. 2005. Intelligent Document Retrieval Exploiting Markup Structure. : Springer, Berlin Heidelberg New York
4 Brin, S. and Page, L. 1998. The anatomy of a largescale hypertextual web search engine. In Proceedings of the seventh international conference on World Wide Web 7 (WWW7), Brisbane, Australia
5 Ruth, Y. Z., Laks, V. S. L., Ruben, H. Z. 2004. Extracting Relational Data from HTML Repositories. ACM SIGKDD Explorations Newsletter, 6(2): 5-12   DOI
6 Sanderson, M. and Croft, W. B. 1999. Deriving Concept Hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 206-213, Berkeley, CA
7 Kleinberg, J. M. 1998. Authoritative Sources in Hyperlinked Environment. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, pages 668-677, ACM
8 Silverstein, C., Marais, H., Henzinger, M., Morics, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6-12
9 Hodgson, J. 2001. Do HTML Tags Semantic Content? IEEE Internet Computing, 5(1):20-25