• Title/Summary/Keyword: unbalanced distribution of documents

Search Result 1, Processing Time 0.017 seconds

An Enhanced Feature Selection Method Based on the Impurity of Words Considering Unbalanced Distribution of Documents (문서의 불균등 분포를 고려한 단어 불순도 기반 특징 선택 방법)

  • Kang, Jin-Beom;Yang, Jae-Young;Choi, Joong-Min
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.9
    • /
    • pp.804-816
    • /
    • 2007
  • Sample training data for machine learning often contain irrelevant information or redundant concept. It is also the case that the original data may include noise. If the information collected for constructing learning model is not reliable, it is difficult to obtain accurate information. So the system attempts to find relations or regulations between features and categories in the teaming phase. The feature selection is to remove irrelevant or redundant information before constructing teaming model. for improving its performance. Existing feature selection methods assume that the distribution of documents is balanced in terms of the number of documents for each class and the length of each document. In practice, however, it is difficult not only to prepare a set of documents with almost equal length, but also to define a number of classes with fixed number of document elements. In this paper, we propose a new feature selection method that considers the impurities among the words and unbalanced distribution of documents in categories. We could obtain feature candidates using the word impurity and eventually select the features through unbalanced distribution of documents. We demonstrate that our method performs better than other existing methods via some experiments.