Browse > Article
http://dx.doi.org/10.5351/KJAS.2022.35.4.469

Properties of chi-square statistic and information gain for feature selection of imbalanced text data  

Mun, Hye In (Department of Statistics, Dankook University)
Son, Won (Department of Statistics, Dankook University)
Publication Information
The Korean Journal of Applied Statistics / v.35, no.4, 2022 , pp. 469-484 More about this Journal
Abstract
Since a large text corpus contains hundred-thousand unique words, text data is one of the typical large-dimensional data. Therefore, various feature selection methods have been proposed for dimension reduction. Feature selection methods can improve the prediction accuracy. In addition, with reduced data size, computational efficiency also can be achieved. The chi-square statistic and the information gain are two of the most popular measures for identifying interesting terms from text data. In this paper, we investigate the theoretical properties of the chi-square statistic and the information gain. We show that the two filtering metrics share theoretical properties such as non-negativity and convexity. However, they are different from each other in the sense that the information gain is prone to select more negative features than the chi-square statistic in imbalanced text data.
Keywords
chi-square statistic; feature selection; imbalanced data; information gain; text data;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Forman G (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.
2 Kou G, Yang P, Peng Y, Xiao F, Chen Y, and Alsaadi FE (2020). Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing, 86, 105836.
3 Mladenic D and Grobelnik M (1999). Feature selection for unbalanced class distribution and naive bayes. In Proceedings of the 16th International Conference on Machine Learning (ICML). 258-267.
4 Rao JNK and Scott AJ (1987). On simple adjustments to chi-square tests with sample survey data. The annals of statistics, 15(1), 385-397.
5 Yang Y and Pedersen JO (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML). 412-420.
6 Bird S, Klein E, and Loper E (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O'Reilly.
7 Manning C and Schutze H (1999). Foundations of statistical natural language processing. MIT press.
8 SonW(2020). Skewness of chi-square statistic for imbalanced text data. Journal of the Korean Data & Information Science Society, 31(5), 807-821.   DOI