Text Classification based on a Feature Projection Technique with Robustness from Noisy Data

오류 데이타에 강한 자질 투영법 기반의 문서 범주화 기법

  • Published : 2004.04.01

Abstract

This paper presents a new text classifier based on a feature projection technique. In feature projections, training documents are represented as the projections on each feature. A classification process is based on individual feature projections. The final classification is determined by the sum from the individual classification of each feature. In our experiments, the proposed classifier showed high performance. Especially, it have fast execution speed and robustness with noisy data in comparison with k-NN and SVM, which are among the state-of-art text classifiers. Since the algorithm of the proposed classifier is very simple, its implementation and training process can be done very simply. Therefore, it can be a useful classifier in text classification tasks which need fast execution speed, robustness, and high performance.

본 논문은 자질 투영법을 사용한 새로운 문서 분류기를 제안한다. 제안된 문서 분류기는 학습 문서를 각 자질로의 투영으로써 표현한다. 문서를 위한 분류 작업은 투영된 각 자질로부터의 투표(voting)에 기인한다. 실험을 통해서 본 제안된 문서 분류기는 단순한 구조에도 불구하고 높은 성능을 보이고 있으며, 특히 기존의 문서 범주화 기법에서 높은 성능을 보여왔던 최근린법(k-NN)과 지지백터기계(SVM)와 비교했을 때 빠른 수행 속도와 오류 데이타가 많을 환경에서 높은 성능을 보인다는 장점이 있다. 또한 제안된 문서 분류기의 알고리즘이 매우 단순하기 때문에 분류기의 구현과 학습 과정이 쉽게 수행될 수 있다. 이러한 이유로 제안된 문서 분류기는 빠른 수행 속도와 견고성(robustness), 그리고 높은 성능을 요구하는 은서 범주화 응용 영역에 유용하게 사용될 수 있을 것이다.

Keywords

References

  1. D. D. Lewis. 'Naive (bayes) at forty: The independence assumption in information retrieval,' European Conference on Machine Learning, 1998
  2. A. McCallum and K. Nigram, 'A comparison of event models for naive bayes text classification,' AAAI '98 workshop on Learning for Text Categorization, 1998
  3. D. D. Lewis and M. Ringuette, 'A comparison of two learning algorithms for text categorization,' Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994
  4. C. Cortes and V. Vapnik. 'Support vector networks,' Machine Learning, 20:273-297, 1995 https://doi.org/10.1023/A:1022627411411
  5. T. Joachims. 'Text categorization with support vector machines: learning with many relevant features,' European Conference on Machine Learning (ECML), 1998
  6. Y. Yang. 'Expert netword: Effective and efficient learning from human decisions in text categorizatin and retrieval,' 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp.13-22, 1994
  7. D. D. Lewis, R. E. Schapire, J. P. Callan and R. Papka, 'Training algorithms for linear text classifiers,' Proceedings of the 19th International Conference on Research and Development in Information Retrieval(SIGIR'96), pp.289-297, 1996 https://doi.org/10.1145/243199.243277
  8. E. Wiener, J. O. Pedersen, and A. S. Weigend. 'A neural network approach to topic spotting,' Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95), 1995
  9. Y. Yang and X. Liu, 'A re-examination of text categorization methods,' Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, SICIR'99, pp. 42-49, 1999 https://doi.org/10.1145/312624.312647
  10. I. Sirin and H. A. Guvenir, 'An algorithm for classification by feature partitioning,' Technical Report, Department of Computer Engineering and Information Science, Bilkent University, 1993
  11. A. Akkus and H. A. Guvenir, 'K nearest neighbor classification on feature projections,' Proceedings of ICML'96, Italy, pp. 12-19, 1996
  12. G. Salton and M. J. McGill, Introduction to modern information retrieval, McGraw-Hill, Inc, 1983
  13. K. Nigam, A. McCallum, S. Thrun, T. Mitchell, 'Learning to classify text from labeled and unlabeled documents,' Proceedings of 15th National Conference on Artificial Intelligence (AAAI-98), 1998
  14. Y. Ko, J, Park, and J, Seo, 'Automatic text categorization using the importance of sentences,' Proceedings 'of the 19th International Conference on Computational Lin- guistics (COLING'2002), pp.474-480, 2002 https://doi.org/10.3115/1072228.1072331
  15. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, 'Learning to construct knowledge bases from the world wide web,' Artificial Intelligence, 118 (1-2), pp. 69-113, 2000 https://doi.org/10.1016/S0004-3702(00)00004-7
  16. Y. Yang, 'An evaluation of statistical approaches to text categorization,' Information Retrieval Journal, May, 1999 https://doi.org/10.1023/A:1009982220290