Browse > Article

A Research for Web Documents Genre Classification using STW  

Ko, Byeong-Kyu (Dept. of Computer Engineering, Chosun University)
Oh, Kun-Seok (Dept. of Hospital information Management, Gwangju Health University)
Kim, Pan-Koo (Dept. of Computer Engineering, Chosun University)
Abstract
Many researchers have been studied to reveal human natural language to let machine understand its meaning by text based, page rank based or more. Particularly, it has been considered that URL and HTML Tag information in web documents are attracting people' attention again to analyze huge amount of web document automatically. In this paper, we propose a STW (Semantic Term Weight) approach based on syntactic and linguistic structure of web documents in order to classify what genres are. For the evaluation, we analyzed more than 1,000 documents from 20-Genre-collection corpus for training the documents based on SVM algorithm. Afterwards, we tested KI-04 corpus to evaluate performance of our proposed method. This paper measured their accuracy by classifying them into an experiment using STW and one without u sing STW. As the results, the proposed STW based approach showed approximately 10.2% which Is higher than one without use of STW.
Keywords
Web Document Genre Classification; SVM; Semantic Term Weight;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Boese, E. S., and Howe, A. E., "Effects of web document evolution on genre classification," CIKM'05 Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 632-639, 2005.
2 임철수, "웹 검색 시스템을 위한 자동 문서 장르 분류," 한국과학기술원 전자전산학과 전산학 전공 박사학위 논문, 2005년 2월.
3 이용배, 맹성현,"장르분류 모델의 도메인 변경에 따른 적응력 분석," 정보과학회지 논문지: 소프트웨어 및 응용, Vol. 38 No. 8, pp. 441-451, 2011.
4 Biber, "The Multidimensional approach to linguistic analyses of genre vari-ation: An overview of methodology and finding," Computer in the Humanities, 26(5-6), pp. 331-347, 1992.   DOI
5 A. McCallum, "Building Domain-Specific Search Engines with Machine Learning Techniques", Proceeding AAAI Symp. Intelligent Agents in Cyberspace, AAAI Press, pp. 28-39, 1999.
6 Stamatatos, E., N. Fakotakis, and G.Kokkinakis, "Automatic text categorization in terms of genre and author". Computational Linguistics, Vol. 26, No. 4, pp. 471-495, 2000a.   DOI
7 Stamatatos, E., N. Fakotakis, and G. Kokkinakis. "Text genre detection using Com-mon word frequencies," In Proceedings of the International Conference on Computational Linguistics (COLING2000), pp. 808-814, 2000b.
8 Kessler, Brett, Geoffrey Nunberg, and Hinrich Schutze, "Automatic detection of text genre", Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth conference of the European Chapter of the Association for Computational Linguistics, pp. 32-38, Somerset, New Jersey. Association for Computational Linguistics, 1997.
9 황명권, "지능적인 웹 검색을 위한 의미적 문서 태깅 방법 연구", 조선대학교 대학원, 박사 학위 논문, 2010.
10 Kraaij, Wessel, Thijs Westerveld, and Djoerd Hiemstra. "The importance of prior probabilities for entry page search". In proceedings of the 25th Annual International ACM SIGIR Conference on Research and development in information retrieval, pp. 27-34, 2002.
11 Karlgren, Jussi, Ivan Bretan, Johan Dewe, Anders Hallberg, and Niklas Wolkert, "Iter-ative information retrieval using fast clustering and usage-specific genres". In Proceedings of the Eighth DELOS Workshop on User Interfaces in Digital Libraries, pp. 85-92, 1998.
12 Vedrana Vidulin, Mitja Luštrek, Matjaž Gams, "Multi-Label Approaches to Web Genre Identification", Journal for Language Technology and Computational Linguistics, Vol. 24, No. 1, pp. 97-144, 2009.
13 Eissen, S. M. Z. and Stein, B., "Genre Classification of web pages: User study and feasibility analysis", In In: Biundo S., Fruhwirth T., Palm G. (eds.): Advances in Artificial Intelligence, pp. 256-269, 2004.
14 Rosso, M. A., "User-based identification of Web genres", Journal of the American Society for Information Science and Technology, Vol. 59, Issue. 7, pp. 1053-1072, 2008.   DOI
15 K. Crowston and M. Williams, "Reproduced and emergent genres of communication on the worldwide web", In proceedings of the 30th Hawaiian International Conference on System Sciences, Wailea, Hawaii, pp. 201-215, 2000.