DOI QR코드

DOI QR Code

Improvement of generalization of linear model through data augmentation based on Central Limit Theorem

데이터 증가를 통한 선형 모델의 일반화 성능 개량 (중심극한정리를 기반으로)

  • 황두환 (육군3사관학교 국방시스템과학과)
  • Received : 2022.03.05
  • Accepted : 2022.04.07
  • Published : 2022.06.30

Abstract

In Machine learning, we usually divide the entire data into training data and test data, train the model using training data, and use test data to determine the accuracy and generalization performance of the model. In the case of models with low generalization performance, the prediction accuracy of newly data is significantly reduced, and the model is said to be overfit. This study is about a method of generating training data based on central limit theorem and combining it with existed training data to increase normality and using this data to train models and increase generalization performance. To this, data were generated using sample mean and standard deviation for each feature of the data by utilizing the characteristic of central limit theorem, and new training data was constructed by combining them with existed training data. To determine the degree of increase in normality, the Kolmogorov-Smirnov normality test was conducted, and it was confirmed that the new training data showed increased normality compared to the existed data. Generalization performance was measured through differences in prediction accuracy for training data and test data. As a result of measuring the degree of increase in generalization performance by applying this to K-Nearest Neighbors (KNN), Logistic Regression, and Linear Discriminant Analysis (LDA), it was confirmed that generalization performance was improved for KNN, a non-parametric technique, and LDA, which assumes normality between model building.

기계학습 모델 구축 간 트레이닝 데이터를 활용하며, 훈련 간 사용되지 않은 테스트 데이터를 활용하여 모델의 정확도와 일반화 성능을 판단한다. 일반화 성능이 낮은 모델의 경우 새롭게 받아들이게 되는 데이터에 대한 예측 정확도가 현저히 감소하게 되며 이러한 현상을 두고 모델이 과적합 되었다고 한다. 본 연구는 중심극한정리를 기반으로 데이터를 생성 및 기존의 훈련용 데이터와 결합하여 새로운 훈련용 데이터를 구성하고 데이터의 정규성을 증가시킴과 동시에 이를 활용하여 모델의 일반화 성능을 증가시키는 방법에 대한 것이다. 이를 위해 중심극한정리의 성질을 활용해 데이터의 각 특성별로 표본평균 및 표준편차를 활용하여 데이터를 생성하였고, 새로운 훈련용 데이터의 정규성 증가 정도를 파악하기 위하여 Kolmogorov-Smirnov 정규성 검정을 진행한 결과, 새로운 훈련용 데이터가 기존의 데이터에 비해 정규성이 증가하였음을 확인할 수 있었다. 일반화 성능은 훈련용 데이터와 테스트용 데이터에 대한 예측 정확도의 차이를 통해 측정하였다. 새롭게 생성된 데이터를 K-Nearest Neighbors(KNN), Logistic Regression, Linear Discriminant Analysis(LDA)에 적용하여 훈련시키고 일반화 성능 증가정도를 파악한 결과, 비모수(non-parametric) 기법인 KNN과 모델 구성 간 정규성을 가정으로 갖는 LDA의 경우에 대하여 일반화 성능이 향상되었음을 확인할 수 있었다.

Keywords

References

  1. Abhishek B, "Normal Distribution and Machine Learning", Available at https://medium.com/analytics-vidhya/normal-distribution-and-machine-learning-ec9d3ca05070 (Accessed 20, February, 2022).
  2. Cortez, P., A, Cerdeira., F, Almeida., T, Matos., "Modeling wine preferences by data mining from physicochemical properties", Decision support systems, Vol.47, No 4(2003), 547-553. https://doi.org/10.1016/j.dss.2009.05.016
  3. Paris, G., D, Robilliard., C, Fonlupt., "Exploring overfitting in genetic programming", In International Conference on Artificial Evolution (Evolution Artificielle), Springer, Berlin, Heidelberg, 267-277.
  4. Asghar, G., S, Zahediasl., "Normality tests for statistical analysis: a guide for non-statisticians." International journal of endocrinology and metabolism, Vol.10, No 2 (2012), 486. https://doi.org/10.5812/ijem.3505
  5. Gareth, J., D, Witten., T, Hastie., R, Tibshirani., "An introduction to statistical learning", Vol.112, New York, springer, 2013.
  6. Fleiss, J.L., W, J.B., D, A.F., "The logistic regression analysis of psychiatric data", Journal of Psychiatric Research, Vol 20, No 3 (1986), 195-209. https://doi.org/10.1016/0022-3956(86)90003-8
  7. Massey Jr,. "The Kolmogorov-Smirnov test for goodness of fit." Journal of the American statistical Association, Vol 46, No 253 (1951), 68-78. https://doi.org/10.1080/01621459.1951.10500769
  8. Leung K., "Assumptions of Logistic Regression, Clearly Explained", Available at https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290#b004 (Accessed 20, February, 2022).
  9. Lilliefors, H.W., "On the Kolmogorov-Smirnov test for normality with mean and variance unknown", Journal of the American statistical Association, Vol 62, No 318 (1967), 399-402. https://doi.org/10.1080/01621459.1967.10482916
  10. Pohar, M., M, Blas., S, Turk., "Comparison of logistic regression and linear discriminant analysis: a simulation study", Metodoloskizvezki, Vol1, No 1 (2004), 143.
  11. Malik U., "Linear Discriminant Analysis (LDA)_in python with Scikit-Learn", Available at https://stackabuse.com/implementing-lda-in-python-with-scikit-learn, (Accessed 20, February, 2022).
  12. Kokuer, M., RNG, Naguib., P Jancovic., H, Banfield Younghusband., and G, Roger., "Towards automatic risk analysis for hereditary non-polyposis colorectal cancer based on pedigree data." In Outcome Prediction in Cancer, (2007), Elsevier, 319-337
  13. Zhang, S., X, Li., M, Zong., X, Zhu., "Efficient kNN classification with different numbers of nearest neighbors", IEEE transactions on neural networks and learning systems, Vol 29, No 5(2017), 1774-1785. https://doi.org/10.1109/tnnls.2017.2673241
  14. Balakrishnama, S., A, Ganapathiraju., "Linear discriminant analysis-a brief tutorial", Institute for Signal and information Processing, Vol 18 (1998), 1-8.
  15. Sarker, I. H., "Machine learning: Algorithms, real-world applications and research directions", SN Computer Science, Vol 2, No 3 (2021), 1-21. https://doi.org/10.1007/s42979-020-00382-x
  16. Kwak, S. G., J.H. Kim., "Central limit theorem: the cornerstone of modern statistics", Korean journal of anesthesiology, Vol 70, No 2 (2017), 144. https://doi.org/10.4097/kjae.2017.70.2.144
  17. Takano, S., "Thinking Machine: machine learning and its hardware implementation", 1stEdition, Elsevier, 2021.
  18. Wayne, W., MD, Lamorte., "The Role of Probability", Available at https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/BS704_Probability12.html#:~:text=The%20central%20limit%20theorem%20states,will%20be%20approximately%20normally%20distributed, (Accessed 20, February, 2022).
  19. Theodoridis, S., "Machine learning: a Bayesian and optimization perspective". Academic press, 2015.
  20. Ying, X., "An overview of overfitting and its solutions", Journal of Physics: Conference Series, Vol 1168, No. 2 (2019), 22.
  21. Zhang, Z., "Introduction to machine learning: k-nearest neighbors". Annals of translational medicine, Vol 4, No 11 (2016).