DOI QR코드

DOI QR Code

Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification

효율적인 문서 분류를 위한 혼합 특징 집합과 하이브리드 특징 선택 기법

  • In, Joo-Ho (Computer Engineering, Korea Aerospace Univ.) ;
  • Kim, Jung-Ho (Computer Engineering, Korea Aerospace Univ.) ;
  • Chae, Soo-Hoan (School of Electronics Telecommunication & Computer Engineering, Korea Aerospace Univ.)
  • Received : 2012.12.11
  • Accepted : 2013.06.28
  • Published : 2013.10.31

Abstract

A novel approach for the feature selection is proposed, which is the important preprocessing task of on-line document classification. In previous researches, the features based on information from their single population for feature selection task have been selected. In this paper, a mixed feature set is constructed by selecting features from multi-population as well as single population based on various information. The mixed feature set consists of two feature sets: the original feature set that is made up of words on documents and the transformed feature set that is made up of features generated by LSA. The hybrid feature selection method using both filter and wrapper method is used to obtain optimal features set from the mixed feature set. We performed classification experiments using the obtained optimal feature sets. As a result of the experiments, our expectation that our approach makes better performance of classification is verified, which is over 90% accuracy. In particular, it is confirmed that our approach has over 90% recall and precision that have a low deviation between categories.

본 연구에서는 효율적인 온 라인 문서 자동 분류를 위해 매우 중요한 분류 작업의 전처리 단계인 특징선택을 위한 새로운 방법이 제안된다. 대부분의 기존 특징선택 방법 연구에서는 특징 집합의 모집단이 단일 모집단으로써 한 모집단이 가지는 정보만으로 분류에 적합한 특징들을 선택하여 특징 집합을 구성하였다. 본 연구에서는 단일 모집단에 한하여 수행되는 특징선택 뿐 만 아니라, 다중 모집단을 가지는 혼합 특징 집합에 대해서 특징선택을 함으로써 다양한 정보를 바탕으로 한 특징 집합을 구성하였다. 혼합 특징 집합은 두 종류의 특징 집합으로 구성된다. 즉 각각 문서로부터 추출한 단어로 구성된 원본 특징 집합과 원본 특징 집합으로부터 LSA를 이용하여 새로 생성한 변형 특징 집합이다. 혼합 특징 집합으로부터 필터 방법과 래퍼 방법을 이용한 하이브리드 방식의 특징 선택을 통해 최적의 특징 집합을 찾고, 이를 이용하여 문서 분류 실험을 수행하였다. 다양한 모집단의 특징들의 정보를 모두 고려함으로써 보다 향상된 분류 성능을 보일 것이라고 기대하였고, 인터넷 뉴스 기사를 대상으로 분류 실험한 결과 90% 이상의 향상된 분류성능을 확인하였다. 특히, 재현율과 정밀도 모두 90%이상의 성능을 보였으며, 둘 사이의 편차가 낮은 것을 확인하였다.

Keywords

References

  1. Guyon, I. Elisseeff, A., "An Introduction to Variable and Feature Selection", Journal of Machine Learning Research, Vol.3, pp.1157-1182, 2003.
  2. Dasgupta, A. Drineas, P. Harb, B. Josifovski, V. Mahnoney, M. W., "Feature Seletion methods for Text Categorization", Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.230-239, 2007.
  3. Landauer, T. K. Dumais, S. T., "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge", Psychological Review, Vol.104, No.2, pp.211-240, 1997. https://doi.org/10.1037/0033-295X.104.2.211
  4. Deerwester, S. C. Dumais, S. T. Landaner, T. K. Furnas, G. W. Harshman, R. A., "Indexing by latent semantic analysis", Journal of the American Society for Information Science, Vol.41, No.6, pp.391-407, 1990. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  5. Chakraborti, S. Lothian, R. Wiratunga, N. Watt, S., "Sprinkling: Supervised Latent Semantic Indexing", Advances in Information Retrieval, pp.510-514, 2006.
  6. Liu, H. Yu, L., "Toward Integrating Feature selection algorithm for Classification and Clustering", IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.4, pp.491-502, 2005. https://doi.org/10.1109/TKDE.2005.66
  7. Chen, C. M. Lee, H. M. Chang, V. J., "Two novel selection approaches for Web page Classification", Expert Systems with Application, Vol.36, No.1, pp.260-272, 2009. https://doi.org/10.1016/j.eswa.2007.09.008
  8. Selamat, A. Omatu, S., "Web page Feature Selection and Classification using Neural Networks", Information Sciences, Vol.158, pp.69-88, 2004. https://doi.org/10.1016/j.ins.2003.03.003
  9. Yang, Y. Pedersen, J. O., "A Comparative Study on Feature Selection in Text Categorization", Proceedings of the 14th International Conference on Machine Learning(ICML '97), pp.412-420, 1997.
  10. Peng, H. Long, F. Ding, C., "Feature selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.27, No. 8, Aug. pp. 1226-1238, 2005. https://doi.org/10.1109/TPAMI.2005.159
  11. John, G. Kohavi, R. Pfleger, K., "Irrelevant Feature and the Subset Selection Problem", In Proceedings of 11th International Conference on Machine Learning, pp.121-129, 1994.
  12. Luukka, P., "Feature selection using fuzzy entropy measures with similarity classifier", Expert Systems with Applications, Vol.38, No.4, pp.4600-4607, 2011. https://doi.org/10.1016/j.eswa.2010.09.133
  13. Gheyas, I. A. Smith, L. S., "Feature subset selection in large dimensionality domains", Pattern Recognition, Vol.43, No.1, pp.5-13, 2010. https://doi.org/10.1016/j.patcog.2009.06.009
  14. Kim, J. In, J. Chae, S., "Sementic-based Genetic Algorithm for Feature Selection"', Journal of Korean Society for Internet Information, Vol.13, No.4, pp.1-10, 2012 https://doi.org/10.7472/jksii.2012.13.4.1
  15. Liu, Y. N. Wang, G. Zhu, X. D., "Feature selection based on adaptive multi-population genetic algorithm", Journal of Jilin University Engineering and Technology Edition, Vol.41, No.6, pp.1690-1693, 2011.
  16. Sun, J. T. Chen, Z. Zeng, H. J. Lu, Y. C. Shi, C. Y. Ma, W. Y., "Supervised Latent Semantic Indexing for Document Categorization", Fourth IEEE International Conference on Data Mining(ICDM '04), pp.535-538, 2004.

Cited by

  1. A Methodology for Automatic Multi-Categorization of Single-Categorized Documents vol.20, pp.3, 2014, https://doi.org/10.13088/jiis.2014.20.3.077
  2. Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification vol.20, pp.1, 2019, https://doi.org/10.7472/jksii.2019.20.1.01