[KSCI] Korea Science Citation Index Service

An Enhanced Feature Selection Method Based on the Impurity of Words Considering Unbalanced Distribution of Documents

Kang, Jin-Beom (한양대학교 컴퓨터공학과)
Yang, Jae-Young (동부정보기술 RFID/USN Part Manager)
Choi, Joong-Min (한양대학교 컴퓨터공학과)

Publication Information

Journal of KIISE:Software and Applications / v.34, no.9, 2007 , pp. 804-816 More about this Journal

Abstract

Sample training data for machine learning often contain irrelevant information or redundant concept. It is also the case that the original data may include noise. If the information collected for constructing learning model is not reliable, it is difficult to obtain accurate information. So the system attempts to find relations or regulations between features and categories in the teaming phase. The feature selection is to remove irrelevant or redundant information before constructing teaming model. for improving its performance. Existing feature selection methods assume that the distribution of documents is balanced in terms of the number of documents for each class and the length of each document. In practice, however, it is difficult not only to prepare a set of documents with almost equal length, but also to define a number of classes with fixed number of document elements. In this paper, we propose a new feature selection method that considers the impurities among the words and unbalanced distribution of documents in categories. We could obtain feature candidates using the word impurity and eventually select the features through unbalanced distribution of documents. We demonstrate that our method performs better than other existing methods via some experiments.

Keywords

feature selection; machine learning; classification; word impurity; unbalanced distribution of documents;

Citations & Related Records

Reference

1	I. H. Witten, E. Frank, Data Mining, Morgan Kaufmann Publishers, 2000
2	K. Kira and L. A. Rendell. 'The feature selection problem: Traditional methods and a new algorithm,' In 10th National Conference on Artificial Intelligence, pp. 129-134. MIT Press 1992
3	G. H. John, R. Kohavi, K. Pfleger, 'Irrelevant Features and the Subset Selection Problem,' Proc. of ICML94, pp. 121-129, Morgan Kaufmann Pulishers, San Francisco, CA, 1994
4	S. Roweis, 'NIPS Conference Papers Vols0-12' http://www.cs.toronto.edu/~roweis/data/nips12raw_str602.tgz
5	R. Caruana, D. Freitag, 'Greedy attribute selection,' Proc. of ICML94, pp. 28-36, 1994
6	M. A. Hall, 'Correlation-based Feature Selection for Machine Learning,' Ph. D diss. Hamilton, NZ: Wailkato University, Department of Computer Science, 1999
7	D. D. Lewis, 'Reuters-21578 Text Categorization Test Collection Distribution 1.0 README file (v1.3),' 2004, http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt
8	Tom Mitchell, Machine Learning, McGraw Hill, 1996
9	Y. Yang, J. O. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' Proc. of ICML97, pp. 412-420, 1997
10	M. Dash, K. Choi, P. Scheuermann, H. Li, 'Feature selection for clustering - a filter solution,' Proc. of IEEE-ICDM, pp. 115-122, 2002

KSCI

An Enhanced Feature Selection Method Based on the Impurity of Words Considering Unbalanced Distribution of Documents 문서의 불균등 분포를 고려한 단어 불순도 기반 특징 선택 방법

An Enhanced Feature Selection Method Based on the Impurity of Words Considering Unbalanced Distribution of Documents