DOI QR코드

DOI QR Code

Probability Estimation Method for Imputing Missing Values in Data Expansion Technique

데이터 확장 기법에서 손실값을 대치하는 확률 추정 방법

  • Lee, Jong Chan (Dept. of Computer Engineering, Chungwoon University)
  • 이종찬 (청운대학교 컴퓨터공학과)
  • Received : 2021.10.06
  • Accepted : 2021.11.20
  • Published : 2021.11.28

Abstract

This paper uses a data extension technique originally designed for the rule refinement problem to handling incomplete data. This technique is characterized in that each event can have a weight indicating importance, and each variable can be expressed as a probability value. Since the key problem in this paper is to find the probability that is closest to the missing value and replace the missing value with the probability, three different algorithms are used to find the probability for the missing value and then store it in this data structure format. And, after learning to classify each information area with the SVM classification algorithm for evaluation of each probability structure, it compares with the original information and measures how much they match each other. The three algorithms for the imputation probability of the missing value use the same data structure, but have different characteristics in the approach method, so it is expected that it can be used for various purposes depending on the application field.

본 논문은 불완전한 데이터를 처리하기 위해 본래 규칙개선 문제를 위해 고안되었던 데이터 확장 기법을 사용한다. 이 기법은 사건마다 중요도를 의미하는 가중치를 가질 수 있으며 각 변수를 확률값으로 나타낼 수 있는 특징이 있다. 본 논문에서의 핵심 문제가 손실값과 가장 근사한 확률을 구하여 손실값을 확률로 대치하는 것이므로, 3가지 다른 알고리즘으로 손실값에 대한 확률을 구한 후 이 데이터 구조의 형식으로 저장한다. 그리고 각각의 확률 구조에 대한 평가를 위해 SVM 분류 알고리즘으로 각각의 정보 영역을 분류하는 학습을 한 후, 본래의 정보와 비교하여 얼마나 서로 일치하느냐를 측정한다. 손실값의 대치 확률을 위한 3가지 알고리즘들은 같은 데이터 구조를 사용하고 있으나 접근 방법에서는 서로 다른 특징을 가지고 있어 적용 분야에 따라 다양한 용도로 이용될 수 있기를 기대한다.

Keywords

References

  1. J. Han, J. Pei & M. Kamber. (2011). Data Mining: Concepts and Techniques, Waltham : Elsevier
  2. R. Kohavi & J. R. Quinlan. (2002). Data mining tasks and methods: Classification: Decision-tree discovery, Handbook of data mining and knowledge discovery, New York : Oxford University Press, 267-276.
  3. D. Kim, D. Lee & W. D. Lee. (2006). Classifier using Extended Data Expression, IEEE Mountain Workshop on Adaptive and Learning Systems. DOI : 10.1109/SMCALS.2006.250708
  4. J. C. Lee. (2018). Application Examples Applying Extended Data Expression Technique to Classification Problems, Journal of the Korea convergence society, 9(12), 9-15. DOI : 10.15207/JKCS.2018.9.12.009
  5. J. C. Lee. (2020). Algorithms for Handling Incomplete Data in SVM and Deep Learning, Journal of the Korea convergence society, 11(3), 1-7. DOI : 10.15207/JKCS.2020.11.3.001
  6. T. Delavallade & T. H. Dang.(2007). Using Entropy to Impute Missing Data in a Classification Task. IEEE International Fuzzy Systems Conference. DOI : 10.1109/FUZZY.2007.4295430
  7. J. C. Lee. (2021). A data extension technique to handle incomplete data. Journal of the Korea Convergence Society, 12(2), 7-13. DOI : 10.15207 /JKCS.2021.12.2.007 https://doi.org/10.15207/JKCS.2021.12.2.007
  8. J. R. Quinlan. (1993). C4.5 : Program for Machine Learning. San Mateo : Morgan Kaufmann
  9. J. C. Lee, D. H. Seo, C. H. Song & W. D. Lee. (2007). FLDF based Decision Tree using Extended Data Expression, The 6 th Conference on Machine Learning & Cybernetics, 3478-3483.
  10. A. Sportisse, C. Boyer, A. Dieuleveut & J. Josse. (2020). Debiasing Averaged Stochastic Gradient Descent to handle missing values, 34th Conference on Neural Information Processing Systems, Vancouver, Canada, 1-11.
  11. S. Huang & C. Cheng. (2020). A Safe-Region Imputation Method for Handling Medical Data with Missing Values, Symmetry 2020, 12, 1792. DOI : 10.3390/sym12111792
  12. J. You, X. Ma, D. Y. Ding, M. Kochenderfer & J. Leskovec. (2020). Handling Missing Data with Graph Representation Learning, 34th Conference on Neural Information Processing Systems, Vancouver, Canada. 1-13
  13. J. C. Lee. (2019). Deep Learning Model for Incomplete Data, Journal of the Korea Convergence Society, 10(2), 1-6. DOI : 10.15207/JKCS.2019.10.2.001
  14. J. C. Lee & W. D. Lee. (2010). Classifier handling incomplete data. Journal of the Korea Institute of Information and Communication Engineering, 14(1), 53-62. https://doi.org/10.6109/JKIICE.2010.14.1.053
  15. Center for Machine Learning and Intelligent Systems, University of California, Irvine. (2020). UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.php