DOI QR코드

DOI QR Code

N-gram Feature Selection for Text Classification Based on Symmetrical Conditional Probability and TF-IDF

대칭 조건부 확률과 TF-IDF 기반 텍스트 분류를 위한 N-gram 특질 선택

  • Choi, Woo-Sik (Department of Industrial Management Engineering, Korea University) ;
  • Kim, Seoung Bum (Department of Industrial Management Engineering, Korea University)
  • 최우식 (고려대학교 산업경영공학과) ;
  • 김성범 (고려대학교 산업경영공학과)
  • Received : 2015.02.09
  • Accepted : 2015.05.11
  • Published : 2015.08.15

Abstract

The rapid growth of the World Wide Web and online information services has generated and made accessible a huge number of text documents. To analyze texts, selecting important keywords is an essential step. In this paper, we propose a feature selection method that combines a term frequency-inverse document frequency technique and symmetrical conditional probability. The proposed method can identify features with N-gram, the sequential multiword. The effectiveness of the proposed method is demonstrated through a real text data from the machine learning repository, University of California, Irvine.

Keywords

References

  1. Bache, K. and Lichman, M. (2013), UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
  2. Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992), Class-Based N-gram Models of Natural Language, Computational linguistics, 18(4), 467-479.
  3. Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011), Discriminating Gender on Twitter, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 1301-1309, Association for Computational Linguistics.
  4. Chemudugunta, C. and Steyvers, P. S. M. (2007), Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model, Advances in Neural Information Processing Systems 19 : Proceedings of the 2006 Conference, 19, MIT Press.
  5. Cho, S. G. and Kim, S. B. (2012), Finding Meaningful Pattern of Key Words in IIE Transactions Using Text Mining, Journal of the Korean Institute of Industrial Engineers, 38(1), 67-73. https://doi.org/10.7232/JKIIE.2012.38.1.067
  6. da Silva, J. F. and Lopes, G. P. (1999), A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units, Sixth meeting on the Mathematics of Language, 369-381.
  7. Feldman, R. and Sanger, J. (2007), The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press.
  8. Ganesan, K., Zhai, C., and Han, J. (2010), Opinosis : A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions, Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China.
  9. Houvardas, J. and Stamatatos, E. (2006), N-Gram Feature Selection for Authorship Identification, Artificial Intelligence : Methodology, Systems, and Applications, 77-86, Springer Berlin Heidelburg.
  10. Jing, L., Huang, H., and Shi, H. (2002), Improved Feature Selection Approach TFIDF in Text Mining, Proceedings of the First International Conference on Machine Learning and Cybernetics, 944-946, Beijing, China.
  11. Li, Y. H. and Jain, A. K. (1998), Classification of Text Documents, The Computer Journal, 41(8), 537-546. https://doi.org/10.1093/comjnl/41.8.537
  12. Mukherjee, A. and Liu, B. (2010), Improving Gender Classification of Blog Authors, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 207-217, Association for Computational Linguistics.
  13. Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. (2000), Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, 39(2-3), 103-134. https://doi.org/10.1023/A:1007692713085
  14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay. E. (2011), Scikit-learn : Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830.
  15. Phan, X., Nquyen, L., and Horiguchi. S. (2008), Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections, Proceedings of the 17th International Conference on World Wide Web, 91-100, ACM.
  16. Python Software Foundation (2010), Python Language Reference, Version 2.7, http://www.python.org/.
  17. Ramos, J. (2003), Using TF-IDF to Determine Word Relevance in Document Queries, Proceedings of the First Instructional Conference on Machine Learning.
  18. Salton, G. and McGill, M. J. (1983), Introduction to Modern Information Retrieval, McGraw-Hill Book Company.
  19. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., and Chanona-Hernandez, L. (2014), Syntactic N-grams as Machine Learning Features for Natural Language Processing, Expert Systems with Applications, 41, 853-860. https://doi.org/10.1016/j.eswa.2013.08.015
  20. Silva, J. and Lopes, G. (2010), Towards Automatic Building of Document Keywords, Proceedings of the 23rd International Conference on Computational Linguistics : Posters, 1149-1157, Association for Computational Linguistics.
  21. Smadja, F., McKeown, K. R., and Hatzivassiloglou, V. (1996), Translating Collocations for Bilingual Lexicons : A Statistical Approach, Computation Linguistics, 22(1), 1-38.
  22. Tang, B., Shepherd, M., Milios, E., and Heywood, M. I. (2005), Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering, Proceeding of SIAM International Workshop on Feature Selection for Data Mining, 17-26.
  23. Ting, S. L., Ip, W. H., and Tsang, A. H. C. (2011), Is Naive Bayes a Good Classifier for Document Classification?, International Journal of Software Engineering and Its Applications, 5(3), 37-46.
  24. Zaki, T., Es-saady, Y., Mammass, D., Ennaji, A., and Nicolas, S. (2014), A Hybrid Method N-Grams-TFIDF with Radial Basis for Indexing and Classification of Arabic Documents, International Journal of Software Engineering and Its Applications, 8(2), 127-144.