Browse > Article

Automatic Document Classification Based on k-NN Classifier and Object-Based Thesaurus  

Bang Sun-Iee (전북대학교 컴퓨터통계정보학과)
Yang Jae-Dong (전북대학교 전자정보공학부)
Yang Hyung-Jeong (카네기멜론대학 컴퓨터과학과)
Abstract
Numerous statistical and machine learning techniques have been studied for automatic text classification. However, because they train the classifiers using only feature vectors of documents, ambiguity between two possible categories significantly degrades precision of classification. To remedy the drawback, we propose a new method which incorporates relationship information of categories into extant classifiers. In this paper, we first perform the document classification using the k-NN classifier which is generally known for relatively good performance in spite of its simplicity. We employ the relationship information from an object-based thesaurus to reduce the ambiguity. By referencing various relationships in the thesaurus corresponding to the structured categories, the precision of k-NN classification is drastically improved, removing the ambiguity. Experiment result shows that this method achieves the precision up to 13.86% over the k-NN classification, preserving its recall.
Keywords
Document Classification; Nearest Neighbor Classification; Object-Based Thesaurus;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 고영중, 서정연, '문서관리를 위한 자동문서범주화에 대한 이론 및 기법', 정보관리연구, 제33권, 제2호, pp.19-32, 2002
2 Aas, K. and Eikvil, L., 'Text Categorization : A Survey,' Report No. NR 941, Norwegian Computing Center. URL http://citeseer.ist.psu.edu/aas99text.htm
3 이경찬, 강승식, '자질 중요도 계산 기법에 의한 자동 문서 범주화', 한국정보과학회 봄 학술발표 논문집(B), 제30권, 제2호, pp. 537-539, 2003   과학기술학회마을
4 Choi, J. H., Yang, J. D. and Lee, D. G., 'An Object-Based Approach to Managing Domain Specific Thesauri: Semiautomatic Thesaurus Construction and Query-Based Browsing,' International Journal of Software Engineering & Knowledge Engineering, Vol. 10, No.4, pp. 1-27, 2002
5 Sebastiani F., 'Machine learning in automated text categorization,' ACM Computing Surveys, Vol.34, No.1, pp.1-47, 2002   DOI   ScienceOn
6 Antonie, M. L. and Zaiane, O. R, 'Text document categorization by term association,' In Proceeding of the second IEEE International Conference on Data Mining (ICDM) , pp. 19-26, Dec. 2002   DOI
7 Hiroshi, U., Takao, M. and SHIOYA, I., 'Improving Text Categorization By Resolving Semantic Ambiguity,' In Proceeding of the IEEE Pacific Rim Conference on Communications, Computers and Signal processing(PACRIM), pp. 796-799, 2003   DOI
8 Bao, Y. and Ishii, N., 'Combining Multiple K-Nearest Neighbor Classifiers for Text Classification by Reducts,' In Proceeding of the fifth International Conference on Discovery Science, pp. 340-347, 2002
9 Han, E. H., Karypis, G. and Kumar, V., 'Text categorization using weight adjusted k-nearest neighbor classification,' In Proceeding of the fifth Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining(PAKDD), pp. 53-65, 1999
10 Lim, H. S., 'A Comparative Evaluation of Korean Text Categorization based on kNN Learning,' In Proceeding of the International Conference on Artificial Intelligence(IC-AI), pp. 755-759, 2002
11 Jalam, R. and Teytaud, O., 'Kernel-based text categorization,' In Proceeding of the International Joint Conference on Neural Networks(IJCNN), Vol. 3, pp. 15-19, 2001   DOI
12 Schapire, R E. and Singer, Y., 'Text categorization with the concept of fuzzy set of informative keywords,' In Proceeding of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Vol. 2, pp. 609-614, 1999   DOI
13 Duda, R. O. and Hart, P. E., 'An algorithm for text categorization with SVM,' TENCON '02. In Proceeding of the IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, Vol.1, pp. 47-50, 2002
14 Yiming Yang. 'An Evaluation of Statistical Approaches to Text Categorization,' Journal of Information Retrieval, Vol.1, No.1, pp.67-88, 1999   DOI
15 Soucy, P. and Mineau, G. W., 'A Simple KNN Algorithm for Text Categorization,' In Proceeding of the first IEEE International Conference on Data Mining(ICDM), Vol. 28, pp. 647-648, 2001   DOI
16 Sasaki, M. and Kita, K., 'Rule-Based Text Categorization Using Hierarchical Categories,' In Proceeding of the IEEE International Conference on Systems, Man and Cybernetics, Vol. 3, pp. 2827-2830, 1998   DOI
17 Mehnet, R., 'Federal Agency and Federal Library Reports : National Library of Medicine,' Bowker Ann : Library and Book Trade Almance, second ed., pp. 110-115, 1997
18 Lam, W., Low, K. F. and Ho, C. Y., 'Using a Bayesian network induction approach for text categorization,' In Proceeding of the fifteenth International Joint Conference on Artificial Intelligence(IJCAI), Vol. 1, pp. 745-750, 1997
19 Diao, L., Hu, K., Lu, Y. and Shi, C., 'Boosting simple decision trees with Bayesian learning for text categorization,' In Proceeding of the fourth World Congress on Intelligent Control and Automation, Vol. 1, pp. 321-325, 2002   DOI