Browse > Article

A Study on Feature Selection for kNN Classifier using Document Frequency and Collection Frequency  

Lee, Yong-Gu (계명대학교 문헌정보학과)
Publication Information
Journal of Korean Library and Information Science Society / v.44, no.1, 2013 , pp. 27-47 More about this Journal
Abstract
This study investigated the classification performance of a kNN classifier using the feature selection methods based on document frequency(DF) and collection frequency(CF). The results of the experiments, which used HKIB-20000 data, were as follows. First, the feature selection methods that used high-frequency terms and removed low-frequency terms by the CF criterion achieved better classification performance than those using the DF criterion. Second, neither DF nor CF methods performed well when low-frequency terms were selected first in the feature selection process. Last, combining CF and DF criteria did not result in better classification performance than using the single feature selection criterion of DF or CF.
Keywords
Automatic classification; Feature selection; kNN classifier; Document frequency; Collection frequency;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 이용구. "단어 중의성 해소를 위한 지도학습 방법의 통계적 자질선정에 관한 연구." 한국비블리아학회지, 제22권, 제2호(2011. 6), pp.5-25.(Yong-Gu, Lee. A Study on Statistical Feature Selection with Supervised Learning for Word Sense Disambiguation. Journal of the Korean BIBLIA Society for library and Information Science, Vol.22, No.2(Jun. 2011), pp.5-25.)
2 이재윤. "자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구." 한국문헌정보학회지, 제39권, 제2호(2005. 6), pp.123-146.(Jae-Yun, Lee. "An Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criteria and Weighting Methods." Journal of the Korean Society for Library and Information Science, Vol.39, No.2(Jun. 2005), pp.123-146.)
3 정영미. 정보검색연구. 서울 : 구미무역 출판부, 2005.(Young-Mee, Chung. Research in Information Retrieval. Seoul : Gumi Trading, 2005.)
4 정은경. "문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구." 정보관리학회지, 제26권, 제3호(2009. 9), pp.261-278.(Eun-Kyung, Chung. "A Semantic-Based Feature Expansion Approach for Improving the Effectiveness of Text Categorization by Using WordNet." Journal of the Korean Society for information Management, Vol.26, No.3(Sep. 2009), pp.261-278.)
5 Azam, N. and J. Yao. "Comparison of term frequency and document frequency based feature selection metrics in text categorization." Expert Systems with Applications, Vol.39, No.5(2012), pp.4760-4768.   DOI   ScienceOn
6 Guyon, I. and A. Elisseeff. "An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3(2002), pp.1157-1182.
7 Jackson, P. and I. Moulinier. Natural Language Processing for Online Applications - Text Retrieval, Extraction and Categorization. Amsterdam : Benjamins Publishing Co., 2002.
8 Kim, J. et al. "HKIB-2000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research." Journal of Computing Science and Engineering, Vol.3, No.3(Sep. 2009), pp.165-180.   DOI   ScienceOn
9 Sebastiani, F. "Machine Learning in Automated Text Categorization." ACM Computing Surveys, Vol.34, No.1(2002), pp.1-47.   DOI   ScienceOn
10 Shang, W. et al. "A novel feature selection algorithm for text categorization." Expert Systems with Applications, Vol.33, No.1(July. 2007), pp.1-5.   DOI   ScienceOn
11 Tan, S. "Neighbor-weighted K-nearest Neighbor for Unbalanced Text Corpus." Expert Systems with Applications, Vol.28, No.4(2005), pp.667-671.   DOI   ScienceOn
12 Yang, Y. and J.O. Pedersen. "A comparative study on feature selection in text categorization." In: Proceedings of the 14th International Conference on Machine Learning(1997), pp.412-420.
13 Yang, Y. and X. Lin. "A re-examination of text categorization methods." In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in the information retrieval(1999), pp.42-49.
14 심경. "문헌범주화에서 학습문헌수 최적화에 관한 연구." 정보관리학회지, 제23권, 제4호(2006. 12), pp.277-294.(Kyung, Shim. "Optimization of Number of Training Documents in Text Categorization." Journal of the Korean Society for information Management, Vol.23, No.4(Dec. 2006), pp.277-294.)
15 HKIB 실험집단. [cited 2012. 7. 10].
16 HAM 형태소 분석기. [cited 2012. 7. 15].