DOI QR코드

DOI QR Code

Web Page Classification System based upon Ontology

온톨로지 기반의 웹 페이지 분류 시스템

  • 최재혁 (삼성전자 네트워크 사업부) ;
  • 서혜성 (아주대학교 정보통신전문대학원 정보통신공학과) ;
  • 노상욱 (가톨릭대학교 컴퓨터정보공학부) ;
  • 최경희 (아주대학교 정보통신전문대학원) ;
  • 정기현 (아주대학교 전자공학부)
  • Published : 2004.10.01

Abstract

In this paper, we present an automated Web page classification system based upon ontology. As a first step, to identify the representative terms given a set of classes, we compute the product of term frequency and document frequency. Secondly, the information gain of each term prioritizes it based on the possibility of classification. We compile a pair of the terms selected and a web page classification into rules using machine learning algorithms. The compiled rules classify any Web page into categories defined on a domain ontology. In the experiments, 78 terms out of 240 terms were identified as representative features given a set of Web pages. The resulting accuracy of the classification was, on the average, 83.52%.

본 논문은 온톨로지(ontology)에 기반 한 자동화된 웹 페이지 분류 시스템을 제안한다. 웹 페이지의 분류를 위하여 첫 번째 단계에서는 각 웹 페이지가 속한 범주(category)를 대표할 수 있는 단어를 선정하며, 이를 위하여 단어빈도와 문서빈도를 곱한 값을 계산한다. 두 번째 단계에서는 첫 번째 단계에 의해 선택된 단어의 정보이득(information gain)을 계산해 분류 확률이 높은 단어를 우선적으로 선정한다. 두 단계를 통하여 선정된 단어들과 웹 페이지의 분류 정보를 가지고, 기계학습에 의하여 컴파일 된 규칙(compiled rules)을 생성한다. 생성된 규칙은 임의의 웹 페이지들을 도메인 온톨로지에 의해 정의된 범주 별로 분류할 수 있도록 한다. 본 논문의 실험에서는 주어진 웹 페이지 집합에서 각 범주 별로 평균 240개의 단어로부터 78개의 단어를 결과적으로 선정하였으며, 이를 바탕으로 웹 페이지 분류 규칙을 생성하였다. 실험 결과에서 제안한 시스템의 평균 분류 정확도는 약 83.52%로 측정되었다.

Keywords

References

  1. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, ACM Press/Addison-Wesley, New York, 1999
  2. P. Clark and T. Niblett, 'The CN2 Induction algorithm,' Machine Learning Journal, Vol.3, No.4, pp.261-283, 1989 https://doi.org/10.1023/A:1022641700528
  3. C. Ding, C. Chi, J. Deng and C. Dong, 'Centralized content-based Web filtering and blocking: how far can it go?,' Proc. of 1999 IEEE International Conference on Systems, Man and Cybernetics, Vol.2, pp.115-119, October, 1999 https://doi.org/10.1109/ICSMC.1999.825218
  4. R. Hanson, J. Stutz and P. Cheeseman, Bayesian Classification Theory, Technical Report FIA-90-12-7-01, NASA Ames research Center, AI Branch, 1991
  5. L. Holder, ML v2.0, Machine Learning Program Evaluator, available on-line, http://ranger.uta.edu/-holder/courses/cse6363/ml2.0.tar.gz
  6. C. Jenkins, M. Jackson, P. Burden and J. Wallis, 'Automatic RDF metadata generation for resource discovery,' Proc. of 8th International WWW Conference, Toronto, pp.11-14, May, 1999
  7. Lawrence Berkeley National Labs Network Research Group, libpcap, available on-line, http://ftp.ee.lbl.gov
  8. Y. Ng, J. Tang and M. Goodrich, 'A binary-categorization approach for classifying multiple-record Web documents using application ontologies and a probabilistic model,' Proc. of 7th International Conference on Database Systems for Advanced Applications, pp.58-65, April, 2001 https://doi.org/10.1109/DASFAA.2001.916365
  9. S. Noh, C. Lee, K. Choi and G. Jung, 'Detecting Distributed Denial of Service(DDoS) Attacks Through Inductive Learning,' Lecture Notes in Computer Science 2690, pp.286-295, Springer, 2003 https://doi.org/10.1007/978-3-540-45080-1_38
  10. S. Noh, H. Seo, J. Choi, K. Choi and G. Jung, 'Classifying Web Pages Using Adaptive Ontology,' Proc. of the IEEE International Conference on Systems, Man and Cybernetics, pp.2144-2149, Washington, D.C., October, 2003 https://doi.org/10.1109/ICC.2003.1204024
  11. N. F. Noy and D. L. Mcguinness, 'Ontology development 101 : A guide to creating your first ontology,' Knowledge Systems Laboratory(KSL), Department of Computer Science, Stanford: Technical report, KSL-01-05, 2001
  12. S. Parent, B. Mobasher and S. Lytinen, 'An adaptive agent for web exploration based on concept hierarchies,' Proc. of 9th International Conference on Human Computer Interaction, New Orleans, August, 2001
  13. R. Prabowo, M. Jackson, P. Burden and H. Knoell, 'Ontology-Based Automatic Classification for the WEB Pages : Design, Implementation an Evaluation,' Proc. of 3rd International Conference, Singapore, pp.182-191, 2002 https://doi.org/10.1109/WISE.2002.1181655
  14. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993
  15. J. Reynolds and J. Postel, 'Assigned Numbers,' STD 2, RFC 1700, October, 1994
  16. D. Riboni, 'Feature Selection for Web Page Classification,' EURASIA - ICT 2002 Proc. of the Workshops, Shiraz, Iran, October 2002
  17. S. M. Ruger and S. E. Gauch, Feature Reduction for Document Clustering and Classification, Technical report, Computing Department, Imperial College, London, 2000
  18. G. Salton, and C. Buckley, 'Term weighting approaches in automatic text retrieval,' Information Processing and Management, Vol.24, No.5, pp. 513-523, 1988 https://doi.org/10.1016/0306-4573(88)90021-0
  19. F. Sebastiani, 'Machine learning in automated text categorization,' ACM Computing Surveys, Vol.34, No.1, pp.1-47, 2002 https://doi.org/10.1145/505282.505283
  20. C. E. Shannon, 'A mathematical theory of communication,' Bell System Technical Journal, Vol. 27, pp. 379-423 and 623-656, July/October, 1948 https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  21. M. P. Sinka and D. W. Corne, 'A large benchmark dataset for web document clustering,' Soft Computing Systems : Design, Management and Applications, Frontiers in Artificial Intelligence and Applications, Vol.87, pp.881-890, 2002
  22. N. Soonthornphisaj, P. Chartbanchachai, T. Pratheeptham, and B. Kijsirikul, 'Web page categorization using hierarchical headings structure,' Proc. of 24th International Conference on Information Technology Interfaces, Vol.1 pp.37-42, 2002 https://doi.org/10.1109/ITI.2002.1024649
  23. A. Sun, E. Lim and W. Ng, 'Web classification using support vector machine,' WlDM'02, Virginia, November, 2002 https://doi.org/10.1145/584931.584952
  24. D. R. Tveter, Backprop Package, available on-line, http://www.dontveter.com/nnsoft/bp042796.zip, 1996