DOI QR코드

DOI QR Code

Improving Text Categorization with High Quality Bigrams

고품질 바이그램을 이용한 문서 범주화 성능 향상

  • Published : 2002.08.01

Abstract

This paper presents an efficient text categorization algorithm that generates high quality bigrams by using the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to a Naive Bayes classifier. The experimental results suggest that the bigrams, while small in number, can substantially contribute to improving text categorization. Upon close examination of the results, we conclude that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.

본 논문은 정보이익을 사용하여 고품질 바이그램을 생성하는 효율적 문서 범주화 알고리즘을 제안한다. 실험 결과 유니그램에 적은 수의 바이그램을 추가해서 나이브 베이즈 분류기에 적용했을 때 문서 범주화 성공률은 상당히 향상되었다. 결과 분석은 제안한 알고리즘이 양의 문서를 분류하는데 더 우수하다는 것을 제시한다.

Keywords

References

  1. Apte, C, Damerau, F., and Weiss, S., 'Automated learning of decision rules for text categorization,' A CM Transactions on Information Systems, 12(3), pp.233-251, 1994 https://doi.org/10.1145/183422.183423
  2. Dumais, S., Piatt, J., Heckman, D., and Sahami, M., 'Inductive Learning Algorithms and Representations for Text Categorization,' In Gardarin et. al. (Ed.), Proceedings of CKIM-98, 7th ACM International Conference on Information and Knowledge Management, New York : ACM Press, pp.148-155, 1998 https://doi.org/10.1145/288627.288651
  3. Furnkranz, J., A Study Using n-gram features for Text Categorization, Technical Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence, Vienna, Austria, 1998
  4. Joachims, T., 'Text Categorization with Support Vector Machines : Learning with Many Relevant Features,' In Nedellec, C. and Rouveiro, C. (Ed.), Proceedings of ECML-98, 10th European Conference on Machine Learning, Heidelberg : Springer Verlag, pp.137-142, 1998 https://doi.org/10.1007/BFb0026683
  5. Lewis, D., Representation and Learning in Information Retrieval, Technical Report UM-CS-1991-093, Department of Computer Science, University of Massachusetts, Amherst, MA, 1992
  6. Mladenic, D. and Grobelnik, M., 'Word sequences as features in text learning,' In Proceedings of the 17th Electro-technical and Computer Science Conference (ERK-98), Ljubljana, Slovenia, pp.145-148, 1998
  7. Nigam, K, McCallum, A., Thrun, S., and Mitchell, T., 'Text Classification from Labeled and Unlabeled Documents using EM,' Machine Learning, 39, pp.103-134, 2000 https://doi.org/10.1023/A:1007692713085
  8. Schapire, R, Singer, Y., and Singhal, A., 'Boosting and Rocchio Applied to Text Filtering,' In Croft et. al. (Ed.), Proceedings of SIGIR-98, 21st ACM International Corference on Research and Development in Information Retrieval, New York : ACM Press, pp.215-223, 1998 https://doi.org/10.1145/290941.290996
  9. Schiltze, H., Hull, D., and Pederson, J., 'A Comparison of Classifiers and Document Representations for the Routing Problem,' In Croft et. al. (Ed.), Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, New York : ACM Press, pp.229-237, 1995 https://doi.org/10.1145/215206.215365

Cited by

  1. Korean Document Classification Using Extended Vector Space Model vol.18B, pp.2, 2011, https://doi.org/10.3745/KIPSTB.2011.18B.2.093