[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/KJAS.2022.35.4.469

Properties of chi-square statistic and information gain for feature selection of imbalanced text data

Mun, Hye In (Department of Statistics, Dankook University)
Son, Won (Department of Statistics, Dankook University)

Publication Information

The Korean Journal of Applied Statistics / v.35, no.4, 2022 , pp. 469-484 More about this Journal

Abstract

Since a large text corpus contains hundred-thousand unique words, text data is one of the typical large-dimensional data. Therefore, various feature selection methods have been proposed for dimension reduction. Feature selection methods can improve the prediction accuracy. In addition, with reduced data size, computational efficiency also can be achieved. The chi-square statistic and the information gain are two of the most popular measures for identifying interesting terms from text data. In this paper, we investigate the theoretical properties of the chi-square statistic and the information gain. We show that the two filtering metrics share theoretical properties such as non-negativity and convexity. However, they are different from each other in the sense that the information gain is prone to select more negative features than the chi-square statistic in imbalanced text data.

Keywords

chi-square statistic; feature selection; imbalanced data; information gain; text data;

Citations & Related Records

Reference

1	Forman G (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.
2	Kou G, Yang P, Peng Y, Xiao F, Chen Y, and Alsaadi FE (2020). Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing, 86, 105836.
3	Mladenic D and Grobelnik M (1999). Feature selection for unbalanced class distribution and naive bayes. In Proceedings of the 16th International Conference on Machine Learning (ICML). 258-267.
4	Rao JNK and Scott AJ (1987). On simple adjustments to chi-square tests with sample survey data. The annals of statistics, 15(1), 385-397.
5	Yang Y and Pedersen JO (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML). 412-420.
6	Bird S, Klein E, and Loper E (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O'Reilly.
7	Manning C and Schutze H (1999). Foundations of statistical natural language processing. MIT press.
8	SonW(2020). Skewness of chi-square statistic for imbalanced text data. Journal of the Korean Data & Information Science Society, 31(5), 807-821. DOI

KSCI

Properties of chi-square statistic and information gain for feature selection of imbalanced text data 불균형 텍스트 데이터의 변수 선택에 있어서의 카이제곱통계량과 정보이득의 특징

Properties of chi-square statistic and information gain for feature selection of imbalanced text data