Combining Multiple Classifiers for Automatic Classification of Email Documents

전자우편 문서의 자동분류를 위한 다중 분류기 결합

  • 이지행 ((주)다음소프트 자연어처리연구소) ;
  • 조성배 (연세대학교 컴퓨터과학과)
  • Published : 2002.04.01

Abstract

Automated text classification is considered as an important method to manage and process a huge amount of documents in digital forms that are widespread and continuously increasing. Recently, text classification has been addressed with machine learning technologies such as k-nearest neighbor, decision tree, support vector machine and neural networks. However, only few investigations in text classification are studied on real problems but on well-organized text corpus, and do not show their usefulness. This paper proposes and analyzes text classification methods for a real application, email document classification task. First, we propose a combining method of multiple neural networks that improves the performance through the combinations with maximum and neural networks. Second, we present another strategy of combining multiple machine learning classifiers. Voting, Borda count and neural networks improve the overall classification performance. Experimental results show the usefulness of the proposed methods for a real application domain, yielding more than 90% precision rates.

디지털 형태의 문서가 널리 퍼지고 끊임없이 증가함에 따라 이를 자동으로 가공하고 처리하는 문서 자동분류의 중요성이 널리 인식되고 있다. 최근의 문서 자동분류는 k-최근접 이웃, 결정트리, Support Vector Machine, 신경망 등의 다양한 기계학습 기법을 이용하여 연구되고 있다. 그러나 많은 연구가 잘 조직된 데이타 집합을 이용하여 연구결과를 보여주고 있으며, 실제 문제에의 응용성에는 큰 비중을 두지 않고 있다. 본 논문에서는 문서분류의 응용시스템인 질의 자동응답시스템에 적용할 수 있는 다중분류기 결합 방법을 제안하고 실제 전자우편 문서의 분류문제를 해결한다. 첫째로, 다중신경 망을 이용한 문서분류를 제안한다. 제안한 방법은 최대값 결합, 신경망 결합을 통해 성능의 향상을 가져온다. 둘째로, 여러 분류기의 결합을 통해 문서분류의 성능을 개선한다. 본 논문에서는 투표 결합방법, Borda 결합, 신경망 결합방법 등을 적용하여 여러 분류기의 결합을 수행하였다. 실용 가능성을 분석한 실험결과 90%이상의 정확율을 보여 제안한 방법이 실용적일 수 있음을 알 수 있었다.

Keywords

References

  1. F. Sebastiani, 'Machine Learning in Automated Text Categorisation,' Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 1999
  2. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999
  3. Y. Yang and X. Liu, 'A Re-examination of Text Categorization Methods,' Proceedings of the 22h Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), pp. 42-49, 1999 https://doi.org/10.1145/312624.312647
  4. Y. Yang, T. Ault and T. Pierce, 'Combining Multiple Learning Strategies for Effective Cross Validation,' Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1167-1182, 2000
  5. S. M. Weiss, et al., 'Maximizing Text-Mining Performance,' IEEE Intelligent System, pp. 63-69, July/August 199 https://doi.org/10.1109/5254.784086
  6. T. Caelli, L. Guan, and W. Wen, 'Modularity in Neural Computing,' Proceedings of the IEEE, vol. 87, no. 9, pp. 1497-1518, September 1999 https://doi.org/10.1109/5.784227
  7. A. J. C. Sharkey, 'On Combining Artificial Neural Nets,' Connection Science, vol. 8, no. 3/4, pp. 299-314, 1996 https://doi.org/10.1080/095400996116785
  8. R. Anand, et.al., 'Efficient Classification for Muticlass Problems Using Modular Neural Networks,' IEEE Transactions on Neural Networks, vol. 6, no. 1, pp, 117-124, 1995 https://doi.org/10.1109/72.363444
  9. L. S. Larkey and W. B. Croft, 'Combining Classifiers in Text Categorization,' Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 96), pp. 289-297, 1996 https://doi.org/10.1145/243199.243276
  10. R. Anand, et.al., 'An Improved Algorithm for Neural Network Classification of Imbalanced Training Sets,' IEEE Transactions on Neural Networks, vol. 4, no. 6, pp. 962-969, 1993 https://doi.org/10.1109/72.286891
  11. B. Lu and M. Ito, 'Task Decomposition and Module Combination Based on Class Relations: A Modular Neural Network for Pattern Classification,' IEEE Transactions on Neural Networks, vol. 10, no. 5, pp, 1244-1256, September 1999 https://doi.org/10.1109/72.788664
  12. Y. Yang and J. P. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' In Jr. D. H. Fisher (Ed.), The 14th International Conference on Machine Learning, pp. 412-420, Morgan Kaufmann, 1997
  13. R. P. Lippmann, 'An Introduction to Computing with Neural Networks,' IEEE Acoustics, Speech, and Signal Processing Magazine, vol. 4, no. 2, pp. 4-22, 1987
  14. Y. Ephraim and L. R. Rabiner, 'On the Relations Between Modeling Approaches for Speech Recognition,' IEEE Transactions on Information Theory, vol. 36, no. 2, pp, 372-380, March 1990 https://doi.org/10.1109/18.52483
  15. L. R. Rabiner, 'A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition:' Proceedings of the IEEE, vol. 77, no. 2, pp, 257-286, February 1989 https://doi.org/10.1109/5.18626
  16. L. Xu, A. Krzyzak and C. Y. Suen, 'Methods of Combining Multiple Classifiers and Their Applications to Handwriting Recognition,' IEEE Trans. on Systems, Man and Cybernetics, vol. 22, no. 3, pp. 418-435, 1992 https://doi.org/10.1109/21.155943
  17. R. Quinlan, 'Decision Trees and Decision-making,' IEEE Transactions on Systems, Man and Cybernetics, vol. 20, no. 2, pp. 339-346, March/April 1990 https://doi.org/10.1109/21.52545
  18. V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995
  19. T. Joachims, 'Estimating the Generalization Performance of a SVM Efficiently,' Proceedings of the 17th International Conference on Machine Learning (ICML 2000), 2000
  20. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, 'WEBSOM-Self-Organizing Maps of Document Collections,' Neurocomputing, vol. 21, pp. 101-117, 1998 https://doi.org/10.1016/S0925-2312(98)00039-3
  21. 김현돈, 조성배, “한메일넷 질의 자동응답을 위한 이단계 자기구성 지도,' 한국정보과학회· 춘계학술발표논문집(B), pp.481-484, 대구, April 2000