Browse > Article
http://dx.doi.org/10.7472/jksii.2013.14.5.49

Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification  

In, Joo-Ho (Computer Engineering, Korea Aerospace Univ.)
Kim, Jung-Ho (Computer Engineering, Korea Aerospace Univ.)
Chae, Soo-Hoan (School of Electronics Telecommunication & Computer Engineering, Korea Aerospace Univ.)
Publication Information
Journal of Internet Computing and Services / v.14, no.5, 2013 , pp. 49-57 More about this Journal
Abstract
A novel approach for the feature selection is proposed, which is the important preprocessing task of on-line document classification. In previous researches, the features based on information from their single population for feature selection task have been selected. In this paper, a mixed feature set is constructed by selecting features from multi-population as well as single population based on various information. The mixed feature set consists of two feature sets: the original feature set that is made up of words on documents and the transformed feature set that is made up of features generated by LSA. The hybrid feature selection method using both filter and wrapper method is used to obtain optimal features set from the mixed feature set. We performed classification experiments using the obtained optimal feature sets. As a result of the experiments, our expectation that our approach makes better performance of classification is verified, which is over 90% accuracy. In particular, it is confirmed that our approach has over 90% recall and precision that have a low deviation between categories.
Keywords
document classification; feature selection; mixed feature set; LSA; hybrid feature selection;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Guyon, I. Elisseeff, A., "An Introduction to Variable and Feature Selection", Journal of Machine Learning Research, Vol.3, pp.1157-1182, 2003.
2 Dasgupta, A. Drineas, P. Harb, B. Josifovski, V. Mahnoney, M. W., "Feature Seletion methods for Text Categorization", Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.230-239, 2007.
3 Landauer, T. K. Dumais, S. T., "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge", Psychological Review, Vol.104, No.2, pp.211-240, 1997.   DOI   ScienceOn
4 Deerwester, S. C. Dumais, S. T. Landaner, T. K. Furnas, G. W. Harshman, R. A., "Indexing by latent semantic analysis", Journal of the American Society for Information Science, Vol.41, No.6, pp.391-407, 1990.   DOI
5 Chakraborti, S. Lothian, R. Wiratunga, N. Watt, S., "Sprinkling: Supervised Latent Semantic Indexing", Advances in Information Retrieval, pp.510-514, 2006.
6 Liu, H. Yu, L., "Toward Integrating Feature selection algorithm for Classification and Clustering", IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.4, pp.491-502, 2005.   DOI   ScienceOn
7 Chen, C. M. Lee, H. M. Chang, V. J., "Two novel selection approaches for Web page Classification", Expert Systems with Application, Vol.36, No.1, pp.260-272, 2009.   DOI   ScienceOn
8 Selamat, A. Omatu, S., "Web page Feature Selection and Classification using Neural Networks", Information Sciences, Vol.158, pp.69-88, 2004.   DOI   ScienceOn
9 Yang, Y. Pedersen, J. O., "A Comparative Study on Feature Selection in Text Categorization", Proceedings of the 14th International Conference on Machine Learning(ICML '97), pp.412-420, 1997.
10 Peng, H. Long, F. Ding, C., "Feature selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.27, No. 8, Aug. pp. 1226-1238, 2005.   DOI   ScienceOn
11 John, G. Kohavi, R. Pfleger, K., "Irrelevant Feature and the Subset Selection Problem", In Proceedings of 11th International Conference on Machine Learning, pp.121-129, 1994.
12 Luukka, P., "Feature selection using fuzzy entropy measures with similarity classifier", Expert Systems with Applications, Vol.38, No.4, pp.4600-4607, 2011.   DOI   ScienceOn
13 Gheyas, I. A. Smith, L. S., "Feature subset selection in large dimensionality domains", Pattern Recognition, Vol.43, No.1, pp.5-13, 2010.   DOI   ScienceOn
14 Kim, J. In, J. Chae, S., "Sementic-based Genetic Algorithm for Feature Selection"', Journal of Korean Society for Internet Information, Vol.13, No.4, pp.1-10, 2012   과학기술학회마을   DOI   ScienceOn
15 Sun, J. T. Chen, Z. Zeng, H. J. Lu, Y. C. Shi, C. Y. Ma, W. Y., "Supervised Latent Semantic Indexing for Document Categorization", Fourth IEEE International Conference on Data Mining(ICDM '04), pp.535-538, 2004.
16 Liu, Y. N. Wang, G. Zhu, X. D., "Feature selection based on adaptive multi-population genetic algorithm", Journal of Jilin University Engineering and Technology Edition, Vol.41, No.6, pp.1690-1693, 2011.