Browse > Article
http://dx.doi.org/10.5351/KJAS.2019.32.2.265

Comparison of term weighting schemes for document classification  

Jeong, Ho Young (Department of Statistics, Pusan National University)
Shin, Sang Min (Department of Management Information Systems, Dong-A University)
Choi, Yong-Seok (Department of Statistics, Pusan National University)
Publication Information
The Korean Journal of Applied Statistics / v.32, no.2, 2019 , pp. 265-276 More about this Journal
Abstract
The document-term frequency matrix is a general data of objects in text mining. In this study, we introduce a traditional term weighting scheme TF-IDF (term frequency-inverse document frequency) which is applied in the document-term frequency matrix and used for text classifications. In addition, we introduce and compare TF-IDF-ICSDF and TF-IGM schemes which are well known recently. This study also provides a method to extract keyword enhancing the quality of text classifications. Based on the keywords extracted, we applied support vector machine for the text classification. In this study, to compare the performance term weighting schemes, we used some performance metrics such as precision, recall, and F1-score. Therefore, we know that TF-IGM scheme provided high performance metrics and was optimal for text classification.
Keywords
term weighting; document classification; text mining; TF-IDF; keyword extraction;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Cho, S. G., Cho, J. H., and Kim, S. B. (2015). Discovering meaningful trends in the inaugural addresses of United States Presidents Via text mining, Journal of Korean Institute of Industrial Engineers, 41, 453-460.   DOI
2 Dumais, S. (1991). Improving the retrieval of information from external sources, Behavior Research Methods, Instruments & Computers, 23, 229-236.   DOI
3 Hornik, K., Meyer, D., and Karatzoglou, A. (2006). Support vector machines in R, Journal of Statisticcal Software, 15, 1-28.
4 Jung, M.J. (2017). A study on clustering methods for proximity data in text mining (Master thesis), Pusan National University.
5 Lee, M. R. and Bae, H. K. (2002). Design of keyword extraction system using TFIDF, The Korean Society for Cognitive Science, 13, 1-11.
6 Miner, G., Elder, J., and Hill, T. (2012). Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications, Academic Press, Seoul.
7 Chen, K. and Zong, C. (2003). A new weighting algorithm for linear classifier. In Proceedings of 2003 International Conference on Natural Language Processing and Knowledge Engineering, 650-655.
8 Chen, K., Zhang, Z., Long, J., and Zhang, H. (2016). Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert System with Applications, 66, 245-260.   DOI
9 Nakov, P., Popova, A., and Mateev, P. (2001), Weight functions impact on LSA performance. In Proceeding of the Recent Advances in Natural language processing, Bulgaria, 187-193.
10 Ren, F. and Sohrab, M. G. (2013). Class-indexing-based term weighting for automatic text classification, Information Sciences, 236, 109-125.   DOI
11 Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B. (2011). Finding a "kneedle" in a Haystack: Detecting Knee Points in System Behavior, Distributed Computing Systems Workshops (ICDCSW) 2011 31st International Conference on, IEEE, 166-171.
12 Yang, Y. and Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the ACM SIGIR Conference on Research and Development in International Retrieval, 42-49.