DOI QR코드

DOI QR Code

A Hybrid Efficient Feature Selection Model for High Dimensional Data Set based on KNHNAES (2013~2015)

KNHNAES (2013~2015) 에 기반한 대형 특징 공간 데이터집 혼합형 효율적인 특징 선택 모델

  • Kwon, Tae il (BigSun Systems Co. LTd.) ;
  • Li, Dingkun (Database/Bioinformatics Lab, School of Electrical & Computer Engineering, Chungbuk National University) ;
  • Park, Hyun Woo (Database/Bioinformatics Lab, School of Electrical & Computer Engineering, Chungbuk National University) ;
  • Ryu, Kwang Sun (Database/Bioinformatics Lab, School of Electrical & Computer Engineering, Chungbuk National University) ;
  • Kim, Eui Tak (Database/Bioinformatics Lab, School of Electrical & Computer Engineering, Chungbuk National University) ;
  • Piao, Minghao (Agency of Smart Factory, Chungbuk National University)
  • 권태일 (빅썬시스템즈(주)) ;
  • 이정곤 (충북대학교 전기전자정보컴퓨터학부 데이터베이스/바이오인포매틱스연구실) ;
  • 박현우 (충북대학교 전기전자정보컴퓨터학부 데이터베이스/바이오인포매틱스연구실) ;
  • 류광선 (충북대학교 전기전자정보컴퓨터학부 데이터베이스/바이오인포매틱스연구실) ;
  • 김의탁 (충북대학교 전기전자정보컴퓨터학부 데이터베이스/바이오인포매틱스연구실) ;
  • 박명호 (충북대학교 스마트 팩토리 사업단)
  • Received : 2018.04.05
  • Accepted : 2018.04.25
  • Published : 2018.04.30

Abstract

With a large feature space data, feature selection has become an extremely important procedure in the Data Mining process. But the traditional feature selection methods with single process may no longer fit for this procedure. In this paper, we proposed a hybrid efficient feature selection model for high dimensional data. We have applied our model on KNHNAES data set, the result shows that our model outperforms many existing methods in terms of accuracy over than at least 5%.

고차원 데이터에서는 데이터마이닝 기법 중에서 특징 선택은 매우 중요한 과정이 되었다. 그러나 전통적인 단일 특징 선택방법은 더 이상 효율적인 특징선택 기법으로 적합하지 않을 수 있다. 본 논문에서 우리는 고차원 데이터에 대한 효율적인 특징선택을 위하여 혼합형 특징선택 기법을 제안하였다. 본 논문에서는 KNHANES 데이터에 제안한 혼합형 특징선택기법을 적용하여 분류한 결과 기존의 분류기법을 적용한 모델보다 5% 이상의 정확도가 향상되었다.

Keywords

References

  1. K. Eamonn, A. Mueen, "Curse of dimensionality", Encyclopedia of Machine Learning and Data Mining, Springer, pp.314-315, 2017.
  2. S. Bharat, N. Kushwaha, O. P. Vyas, "A feature subset selection technique for high dimensional data using symmetric uncertainty." Journal of Data Analysis and Information Processing,Vol. 2 No. 04, pp. 95, 2014. https://doi.org/10.4236/jdaip.2014.24012
  3. G. Isabelle, A. Elisseeff, "An introduction to variable and feature selection." Journal of machine learning research, Vol. 3, pp. 1157-1182, Mar, 2003.
  4. Q. Gu, Z. Li, J. Han, "Generalized fisher score for feature selection." arXiv preprint arXiv, 1202.3725, 2012.
  5. H. H. Hsu, C. W. Hsieh, M. D. Lu, "Hybrid feature selection by combining filters and wrappers." Expert Systems with Applications, Vol. 38, No. 7, pp. 8144-8150, 2011. https://doi.org/10.1016/j.eswa.2010.12.156
  6. Y. Lei, H. Liu, "Feature selection for high-dimensional data: A fast correlation-based filter solution." Proceedings of the 20th international conference on machine learning (ICML-03). 2003.
  7. Z. M. Hira, D. F. Gillies, "A review of feature selection and feature extraction methods applied on microarray data." Advances in bioinformatics 2015, 2015.
  8. L. Wang, Y. Wang, Q. Chang, "Feature selection methods for big data bioinformatics: A survey from the search perspective." Methods, Vol. 111, pp. 21-31, 2016. https://doi.org/10.1016/j.ymeth.2016.08.014
  9. N. A. Capela, E. D. Lemaire, N. Baddour, "Feature selection for wearable smartphone-based human activity recognition with able bodied, elderly, and stroke patients." PloS one, Vol. 10, No. 4, 2015.
  10. E. Guldogan, M. Gabbouj, "Feature Selection for Content-Based Image Retrieval", Signal, Image and Video Processing, Vol. 2, pp. 241-250, 2008. https://doi.org/10.1007/s11760-007-0049-9
  11. KNHANES, Available: https://knhanes.cdc.go.kr/knhanes/sub03/sub03_02_02.do
  12. R. Chakraborty, R. P. Nikhil, "Feature selection using a neural framework with controlled redundancy", IEEE transactions on neural networks and learning systems Vol. 26, No. 1, pp. 35-50, 2015. https://doi.org/10.1109/TNNLS.2014.2308902
  13. L. Yu, H. Liu, "Feature selection for high-dimensional data: a fast correlation-based filter solution", Proceedings of the 12th International Conference on Machine Learning, Washington, DC, USA, 2003.
  14. K. I. Kim, M. I. M. Ishag, M. Kim, J. S. Kim, and K. H. Ryu, "Proposal of a Resource-Monitoring Improvement System Using Amazon Web Service API." In Advances in Computer Science and Ubiquitous Computing, pp. 1103-1107, 2016
  15. S. Kweon, et al., "Data resource profile: the Korea national health and nutrition examination survey (KNHANES)", International journal of epidemiology, Vol. 43, No. 1, pp. 69-77, 2014. https://doi.org/10.1093/ije/dyt228
  16. C. B. Begg, A. B. Jesse, "Publication bias: a problem in interpreting medical data", Journal of the Royal Statistical Society. Series A (Statistics in Society), pp. 419-463, 1988.
  17. M. Piao, H. S. Shon, J. Y. Lee, and K. H. Ryu, "Subspace projection method based clustering analysis in load profiling", IEEE Transactions on Power Systems, vol. 29, no. 6, pp. 2628-2635, 2014. https://doi.org/10.1109/TPWRS.2014.2309697
  18. M. E. A. Bashir, D. G. Lee, M. Li et al., "Trigger learning and ECG parameter customization for remote cardiac clinical care information system", IEEE Transactions on Information Technology in Biomedicine, vol. 16, no. 4, pp. 561-571, 2012. https://doi.org/10.1109/TITB.2012.2188812
  19. Y. Lee, Y. J. Jung, K. W. Nam, S. Nittel, K. Beard, and K. H. Ryu, "Geosensor data representation using layered slope grids", Sensors, vol. 12, no. 12, pp. 17074-17093, 2012. https://doi.org/10.3390/s121217074
  20. D. R. Cox, "The regression analysis of binary sequences (with discussion)", J Roy Stat Soc B. Vol. 20, pp. 215-242, 1958.
  21. S. Russell, P. Norvig, Artificial Intelligence: "A Modern Approach (2nd ed.)." Prentice Hall, 1995
  22. R. Pandya, P. Jayati, "C5. 0 algorithm to improved decision tree with feature selection and reduced error pruning", International Journal of Computer Applications Vol. 117, No. 16, 2015.
  23. J. R. Quinlan, "C4. 5: programs for machine learning." Elsevier, 2014.
  24. C. Cortes, V. Vapnik, "Support-vector networks", Machine learning, Vol. 20, No. 3, pp. 273-297, 1995. https://doi.org/10.1007/BF00994018
  25. T. Fawcett, "An Introduction to ROC Analysis", Pattern Recognition Letters, Vol. 27, No. 8, pp. 861-874, 2006. https://doi.org/10.1016/j.patrec.2005.10.010
  26. http://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/
  27. A. V. Chobanian, G. L. Bakris, H. R. Black, et al., "Seventh report of the joint national committee on prevention, detection, evaluation, and treatment of high blood pressure", Hypertension, Vol. 42, No. 6, pp. 1206-1252, 2003 https://doi.org/10.1161/01.HYP.0000107251.49515.c2
  28. T. M. Cover, J. A. Thomas, Elements of Information Theory (Wiley ed.), 1991.
  29. S. S. Kannan, N. Ramraj, "A Novel Hybrid Feature Selection via Symmetrical Uncertainty Ranking Based Local Memetic Search Algorithm", Knowledge-Based Systems, Vol. 23, pp. 580-585, 2010. https://doi.org/10.1016/j.knosys.2010.03.016
  30. M. A. Hall, "Correlation-based Feature Subset Selection for Machine Learning", Hamilton, New Zealand, 1998.
  31. S. Maldonado, R. Weber, and J. Basak, "Simultaneous feature selection and classification using kernel-penalized support vector machines", Information Sciences, vol. 181, no. 1, pp. 115-128, 2011. https://doi.org/10.1016/j.ins.2010.08.047
  32. H. Kim, M. I. M.Ishag, M. Piao, T. Kwon, and K. H. Ryu, "A data mining approach for cardiovascular disease diagnosis using heart rate variability and images of carotid arteries", Symmetry, vol. 8, no.6, 47, 2016. https://doi.org/10.3390/sym8060047
  33. P, Li, Y. Piao, H. S. Shon, K. H. Ryu, "Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data", BMC bioinformatics, , 16(1): 347, 2015. https://doi.org/10.1186/s12859-015-0778-7