DOI QR코드

DOI QR Code

Algorithms for Handling Incomplete Data in SVM and Deep Learning

SVM과 딥러닝에서 불완전한 데이터를 처리하기 위한 알고리즘

  • Lee, Jong-Chan (Dept. of Computer Engineering, Chungwoon University)
  • 이종찬 (청운대학교 컴퓨터공학과)
  • Received : 2019.12.09
  • Accepted : 2020.03.20
  • Published : 2020.03.28

Abstract

This paper introduces two different techniques for dealing with incomplete data and algorithms for learning this data. The first method is to process the incomplete data by assigning the missing value with equal probability that the missing variable can have, and learn this data with the SVM. This technique ensures that the higher the frequency of missing for any variable, the higher the entropy so that it is not selected in the decision tree. This method is characterized by ignoring all remaining information in the missing variable and assigning a new value. On the other hand, the new method is to calculate the entropy probability from the remaining information except the missing value and use it as an estimate of the missing variable. In other words, using a lot of information that is not lost from incomplete learning data to recover some missing information and learn using deep learning. These two methods measure performance by selecting one variable in turn from the training data and iteratively comparing the results of different measurements with varying proportions of data lost in the variable.

본 논문은 불완전한 데이터를 처리하기 위해 2가지의 서로 다른 기법과 이를 학습하는 알고리즘을 소개한다. 첫째방법은 손실변수가 가질 수 있는 균등한 확률로 손실값을 할당하여 불완전한 데이터를 처리하고, SVM 알고리즘으로 이 데이터를 학습하는 것이다. 이 기법은 임의의 변수에 손실 값의 빈도가 높을수록 엔트로피가 높도록 하여 이 변수가 결정트리에서 선택되지 않도록 하는 것이다. 이 방법은 손실 변수에 남아있는 정보를 모두 무시하고 새로운 값을 할당한다는 특징이 있다. 이에 반해 새로운 방법은 손실 값을 제외하고 남아있는 정보로 엔트로피 확률을 구하고 이를 손실 변수의 추정 값으로 사용하는 것이다. 즉, 불완전한 학습데이터로부터 소실되지 않은 많은 정보들을 이용해 소실된 일부 정보를 복구하고 딥러닝을 이용해 학습한다. 이 2가지 방법은 학습데이터에서 차례로 변수 하나를 선택하고, 이 변수에 손실된 데이터의 비율을 달리하면서 서로 다른 측정값들의 결과들과 반복적으로 비교함으로써 성능을 측정한다.

Keywords

References

  1. J. Han, J. Pei & M. Kamber. (2011), Data Mining: Concepts and Techniques, Elsevier.
  2. M. Kantardzic. (2002). Data Mining : Concepts, Models, Methods, and Algorithms, Wiley-IEEE Press.
  3. T. Maszczyk & W. Duch. (2008). Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees. Lecture Notes in Computer Science, 5097, 643-651.
  4. M. Srednicki & Theoretical Physics Group.(1993) Entropy and Area. Physical Review letters. DOI : 10.1103/PhysRevLett.71.666
  5. R.M.Gray.(2011).Entropy and Information Theory. Springer 2ED. DOI : 10.1007/978-1-4419-7970-4
  6. J. R. Quinlan. (1993). C4.5 : Program for Machine Learning. San Mateo : Morgan Kaufmann
  7. P. E. Utgoff. (1989). Incremental Induction of Decision Trees. Machine Learning, 4(2), 161-186. https://doi.org/10.1023/A:1022699900025
  8. R. Kohavi & J. R. Quinlan. (2002). Data mining tasks and methods: Classification: Decision-tree discovery, Handbook of data mining and knowledge discovery, Oxford University Press, 267-276.
  9. H. Lee, S. Chung & E. Choi. (2016). A Case Study on Machine Learning Applications and Perfor- mance Improvement in Learning Algorithm, Journal of Digital Convergence, 14(2), 245-258. https://doi.org/10.14400/JDC.2016.14.2.245
  10. Y. Jeong. (2019). Machine Learning Based Domain Classification for Korean Dialog System, Journal of Convergence for Information Technology, 9(8), 1-8. https://doi.org/10.22156/CS4SMB.2019.9.8.001
  11. J. C. Lee, D. H. Seo, C. H. Song & W. D. Lee. (2007). FLDF based Decision Tree using Extended Data Expression, The 6th Conference on Machine Learning & Cybernetics, 3478-3483.
  12. J. C. Lee & W. D. Lee. (2010). Classifier handling incomplete data. Journal of the Korea Institute of Information and Communication Engineering, 14(1), 53-62. https://doi.org/10.6109/jkiice.2010.14.1.053
  13. D. Kim, D. Lee & W. D. Lee. (2006). Classifier using Extended Data Expression, IEEE Mountain Workshop on Adaptive and Learning Systems. DOI : 10.1109/SMCALS.2006.250708
  14. T. Delavallade & T. H. Dang. (2007). Using Entropy to Impute Missing Data in a Classification Task. IEEE International Fuzzy Systems Conference. DOI : 10.1109/FUZZY.2007.4295430
  15. A. McCallum, D. Freitag & F. Pereira. (2000). Maximum Entropy Markov Models for Information Extraction and Segmentation. Proc. Of 17th International Conference on Machine Learning, 591-598.
  16. J. C. Lee. (2018). Application Examples Applying Extended Data Expression Technique to Classification Problems, Journal of the Korea Convergence Society, 9(12), 9-15. DOI : 10.15207/JKCS.2018.9.12.009
  17. J. C. Lee. (2019). Deep Learning Model for Incomplete Data, Journal of the Korea Convergence Society, 10(2), 1-6. DOI : 10.15207/JKCS.2019.10.2.001
  18. E. Keogh, C. Blake & C. J. Merz. (1998). UCI Repository of Machine Learning Databases, UCI Machine Learning Repository. http://www.ics .uci.edu/-mlearn/MLRepository.html