A Text Categorization Method Improved by Removing Noisy Training Documents

오류 학습 문서 제거를 통한 문서 범주화 기법의 성능 향상

  • Published : 2005.09.01

Abstract

When we apply binary classification to multi-class classification for text categorization, we use the One-Against-All method generally, However, this One-Against-All method has a problem. That is, documents of a negative set are not labeled by human. Thus, they can include many noisy documents in the training data. In this paper, we propose that the Sliding Window technique and the EM algorithm are applied to binary text classification for solving this problem. We here improve binary text classification through extracting noise documents from the training data by the Sliding Window technique and re-assigning categories of these documents using the EM algorithm.

문서 범주화에서 이진 분류를 다중 분류에 적용할 때 일반적으로 '한 범주에 적합-다른 모든 범주에서는 부적합(One-Against-All) 판정 방법'을 사용한다. 하지만, 이러한 '한 범주에 적합-다른 모든 범주에서는 부적합 판정 방법'은 한 가지 문제점을 가지는데, 적합(positive) 집합의 문서들은 사람이 직접범주를 할당한 것이지만 부적합(negative) 집합의 문서들은 사람이 직접 범주를 할당한 것이 아니기 때문에 오류 문서들이 많이 포함될 수 있다는 것이다. 본 논문에서는 이러한 문제점을 해결하기 위해서 슬라이딩 원도우(sliding window) 기법과 EM 알고리즘을 이진 분류 기반의 문서 범주화에 적용할 것을 제안한다. 제안된 기법은 먼저 슬라이딩 윈도우 기법을 사용하여 오류 문서들을 추출하고 이들을 EM알고리즘을 사용해서 다시 범주를 할당함으로써 이진 분류 기반의 문서 범주화 기법의 성능을 향상시킨다.

Keywords

References

  1. T. Joacnims, Learning to Classify Text Using Support Vector Machines : theory and Algorithms by Thorsten Joachims. Dept. of Computer Science, Cornell University. NY, USA, Kluwer Academic Publishers, April, 2002
  2. H. Yu. J. Han, and K. Chang, 'PEBL : Positive Example Based Learning for Web Page Classification Using SVM,' Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD'02), 2002 https://doi.org/10.1145/775047.775083
  3. H. Yu, C.X. Zhai, and J. Han, 'Text Classification from Positive and Unlabeled Documents,' Proceedings of International Conference on Knowledge Management (CIKM'03), New Orleans. Louisiana, USA, November 3-8, 2003 https://doi.org/10.1145/956863.956909
  4. B. Liu, W.S. Lee, P.S. Yu and X. Li., 'Partially Supervised Classification of Text Documents,' Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), Sydney, Australia, July 8-12, 2002
  5. X. Li and B. Liu., 'Learning to classify text using positive and unlabeled data,' Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, Aug 9-15, 2003
  6. B. Zadrozny and C. Elkan., 'Reducing Multiclass to Binary by Coupling Probability Estimates,' Proceedings of International Conference on Knowledge Discovery and Data Mining(KDD'02), 2002
  7. B. Zadrozny and C. Elkan, 'Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers,' Proceedings of the Eighteenth International Conference on Machine Learning, 2001
  8. C.-W. Hsu and C.-J. Lin. 'A Comparison of Methods for Multi-class Support Vector Machines,' IEEE Transactions on Neural Networks, 13, pp. 415-425, 2002 https://doi.org/10.1109/72.991427
  9. D.D. Lewis, 'Naive (bayes) at Forty: The Independence Assumption in Information Retrieval,' Proceedings of European Conference on Machine Learning, 1998
  10. T. Mitchell, Machine Learning. New York: McGraw-Hill, 1997
  11. A. Demster, N. M. Laird, and D. Rubin, 'Maximum Likelihood from Incomplete Data via the EM Algorithm,' Journal of the Royal Statistical Society series B, vol 39, No. 1, pp. 1-38, 1997
  12. K. P. Nigam, 'Using Unlabeled Data to Improve Text Classification,' Doctoral dissertation, computer Science Department, Carnegie Mellon University, 2001
  13. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, 'Learning to Construct Knowledge Bases from the World Wide Web,' Artificial Intelligence, 118 (1-2), pp. 69-113, 2000 https://doi.org/10.1016/S0004-3702(00)00004-7
  14. A. McCallum and K. Nigram, 'A Comparison of Event Models for Naive Bayes Text Classification,' AAAI '98 workshop on Learning for Text Categorization, 1998
  15. K. Nigam, A. McCallum, S. Thrun, T. Mitchell, 'Learning to Classify Text from Labeled and Unlabeled Documents,' Proceedings of 15th National Conference on Artificial Intelligence (AAAI-98), 1998
  16. Y. Yang, S. Slattery, and R. Ghani. 'A Study of Approaches to Hypertext Categorization,' Journal of Intelligent Information Systems, Vol. 18, No. 2., 2002 https://doi.org/10.1023/A:1013685612819
  17. T. Joachims, 'Text Categorization with Support Vector Machines: Learning with Many Relevant Features,' ECML, pp. 137-142, 1998 https://doi.org/10.1007/BFb0026683
  18. Y. Yang, 'An Evaluation of Statistical Approaches to Text Categorization,' Information Retrieval Journal, May, 1999 https://doi.org/10.1023/A:1009982220290