DOI QR코드

DOI QR Code

Automatic Classification of Web documents According to their Styles

스타일에 따른 웹 문서의 자동 분류

  • 이공주 (경인여자대학 전산정보과) ;
  • 임철수 (한국과학기술원 대학원 전산학과, (주) 시멘틱퀘스트 연구) ;
  • 김재훈 (한국해양대학교 컴퓨터공학과)
  • Published : 2004.08.01

Abstract

A genre or a style is another view of documents different from a subject or a topic. The style is also a criterion to classify the documents. There have been several studies on detecting a style of textual documents. However, only a few of them dealt with web documents. In this paper we suggest sets of features to detect styles of web documents. Web documents are different from textual documents in that Dey contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances.

스타일 또는 장르는 문서의 주제와는 다른 문서를 보는 또 하나의 관점이 될 수 있다. 그렇기 때문에 문서의 스타일은 문서 분류의 기준으로 사용될 수 있다. 문서의 스타일에 따른 자동 분류 시스템에 대한 여러 연구들이 수행되어 왔다. 그러나 이런 연구들의 대부분이 일반 문서를 대상으로 수행하였으며, 몇몇 일부의 연구만이 웹 문서를 대상으로 스타일 분류에 대한 연구를 수행하였다. 웹 문서는 일반 문서와는 달리 URL HTML을 갖고 있다. 본 연구에서는 이와 같은 URL과 HTML로부터 추출한 자질들을 웹 문서의 스타일 분류에 사용해 보고자 한다. 실험을 통해서 이와 같은 자질들이 웹 문서의 스타일 분류에 어떤 영향을 미치는지를 밝혀보고자 한다.

Keywords

References

  1. Lee, Yong-Bae and Myaeng, Sung Hyon, 'Text genre classification with genre-revealing and subject-revealing features,' In Proceedings of the 25th Annual International ACL SIGIR Conference on Research and Development in Information Retrieval, pp.145-150, 2002 https://doi.org/10.1145/564376.564403
  2. Biber, Douglas, 'Spoken and written textual dimensions in English : Resolving the contradictory findings,' Language, Vol.62, No.2, pp.384-413, 1986 https://doi.org/10.2307/414678
  3. Biber, Douglas, 'The multidimensional approach to linguistic analyses of genre variation: An overview of methodology and finding,' Computers in the Humanities, Vol.26, No.5-6, pp.331-347, 1992 https://doi.org/10.1007/BF00136979
  4. Karlgren, Jussi, Cutting and Douglass, 'Recognizing text genres with simple metrics using discriminant analysis,' In Proceedings of the 15th International Conference on Computational Linguistics, pp.1071-1075, 1094 https://doi.org/10.3115/991250.991324
  5. Biber, Douglas, Dimensions of register variation : A crosslinguistic comparison. Cambridge University Press. Cambridge, England, 1995
  6. Michos, Stefanos, Stamatatos, Efstathios, Fakotakis, Nikos, Kokkonakis and George, 'An empirical text categorizing computational model based in stylistic aspects,' In Proceedings of the Eighth International Conference on Tools with Artificial Intelligence, pp.71-77, 1996 https://doi.org/10.1109/TAI.1996.560403
  7. Kessler, Brett, Numberg, Geoffrey, Schutze and Hinrich, 'Automatic detection of text genre,' in Proceedings of the 35th Annual Meeting ACL, pp.32-38, 1997 https://doi.org/10.3115/976909.979622
  8. Stamatatos, Efstathios, Fakotakis, Nikos Kokkinakis and George, 'Automatic text categorization in terms of genre and author,' Computational Linguistics, Vol.26, No.4, pp.471-495, 2000 https://doi.org/10.1162/089120100750105920
  9. Stamatatos, Efstathios, Fakotakis, Nikos Kokkonakis and George, 'Text genre detection using common word frequencies,' In Proceedings of the International Conference on Computational Linguistics, pp.808-814, 2000 https://doi.org/10.3115/992730.992763
  10. Karlgren, Jussi, Bretan, Ivan, Dewe, Johan, Hallberg, Anders Wolkert and Niklas,' In Proceedings of the Eighth DELOS Workshop on User Interfaces in Digital Libraries, pp.85-92, 1998
  11. Dewe, Johan, Bretan, Ivan, Karlgren and Jussi, 'Assembling a balanced corpus from the internet,' In Proceedings of 11th Nordic Computational Linguistics Conference, 1998
  12. Berners-Lee, Tim, Masinter, Larry McCahill and Mark, Uniform resource locators, Internet RFC 1738, 1994
  13. Daelemans, Walter, Zavrel, Jakub,Ko van der Sloot and Antal van den Bosch, 'TiMBL: Tilburg Memory-Based Learner version 4.0 reference guide,' 2001
  14. Wang, Yitong Kitsuregawa and Masaru, 'Evaluating contents-link coupled web page clustering for web search results,' In Proceeding of 11th International Conference on Information and Knowledge Management, pp.499-506, 2002 https://doi.org/10.1145/584792.584875
  15. Pierre, John M., 'Practical issues for automated categorization of web sites,' ECDL 2000 Workshop on the Semantic Web, 2000
  16. Caruana, Rich Freitag and Dayne, 'Greedy attribute selection,' In International Conference on Machine Learning, pp.28-36, 1994
  17. Kraaij, Wessel, Westerveld, Thijs Hiemstra and Djoerd, 'The importance of prior probabilities for entry page search,' In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.27-34, 2002 https://doi.org/10.1145/564376.564383