A Comparative Study of Classification Methods Using Data with Label Noise

레이블 노이즈가 존재하는 자료의 판별분석 방법 비교연구

  • Kwon, So Young (Department of Statistics, College of Natural Sciences, Sungshin Women's University) ;
  • Kim, Kyoung Hee (Department of Statistics, College of Natural Sciences, Sungshin Women's University)
  • 권소영 (성신여자대학교 자연과학대학 통계학과) ;
  • 김경희 (성신여자대학교 자연과학대학 통계학과)
  • Received : 2018.11.07
  • Accepted : 2018.12.12
  • Published : 2018.12.31

Abstract

Discriminant analysis predicts a class label of a new observation with an unknown label, using information from the existing labeled data. Hence, observed labels play a critical role in the analysis and we usually assume that these labels are correct. If the observed label contains an error, the data has label noise. Label noise can frequently occur in real data, which would affect classification performance. In order to resolve this, a comparative study was carried out using simulated data with label noise. In particular, we considered 4 different classification techniques such as LDA (linear discriminant analysis classifiers), QDA (quadratic discriminant analysis classifiers), KNN (k-nearest neighbour), and SVM (support vector machine). Then we evaluated each method via average accuracy using generated data from various scenarios. The effect of label noise was investigated through its occurrence rate and type (noise location). We confirmed that the label noise is a significant factor influencing the classification performance.

판별분석(discriminant analysis)은 새로운 개체가 입력되었을 때, 그 개체가 어느 그룹에 속하는지 예측하는데 사용되는 분석방법이다. 판별분석에서는 레이블(label)을 통해 새로운 개체를 예측하기 때문에 판별분석에서 레이블은 중요하다. 레이블 노이즈(label noise)는 관측된 레이블에 오류가 포함된 것을 의미하며, 실데이터에 발생하기 쉽고 판별성능에 영향을 미칠 수 있는 중요한 요인이다. 이를 개선하기 위해 레이블 노이즈와 레이블 노이즈에 강건한 모형들이 연구되고 있지만, 레이블 노이즈가 존재할 때 판별성능에 영향을 줄 수 있는 요인을 고려하고 이 요인들이 판별성능에 미치는 영향을 비교한 연구는 찾기 힘들다. 따라서 이 논문에서는 분류문제에서 많이 사용되는 LDA, QDA, KNN, SVM 방법을 이용하여 레이블 노이즈가 판별성능에 미치는 영향을 알아보고자 한다. 특히 판별분석의 성능과 연관이 있을 것으로 예상되는 레이블 노이즈의 발생 비율, 발생형태, 데이터의 개수에 따른 판별성능을 모의실험을 통해 살펴보았다. 그 결과, 데이터의 형태와 분석기법에 따라 레이블 노이즈가 판별성능에 영향을 미치는 정도가 다름을 확인하였다.

Keywords

Acknowledgement

Supported by : 성신여자대학교

References

  1. Altman, N. S. (1992). An introduction to kernel and nearest neighbor nonparametric regression, The American Statistician, 46(3), 175-185.
  2. Cannings, T. I., Fan, Y., Samworth, R. J. (2018). Classification with imperfect training labels, arXiv:1805.11505
  3. Cortes, C., Vapnik, V. (1995). Support-vector networks, Machine Learning, 20(3), 273-297. https://doi.org/10.1007/BF00994018
  4. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7(2), 179-188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
  5. Frenay, B., Verleysen, M. (2014). Classification in the presence of label noise: a survey, IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845-869. https://doi.org/10.1109/TNNLS.2013.2292894
  6. Ha, J. W., Park, C. (2009). Variable selection in linear discriminant analysis, Journal of the Korean Data Analysis Society, 11(1B), 381-389. (in Korean).
  7. Jin, S. (2005). Application of k-nearest neighborhood classification through bootstrap estimation of misclassification rate, Journal of the Korean Data Analysis Society, 7(5), 1643-1652. (in Korean).
  8. Kim, D., Shim, J. (2003). Discriminant analysis based on incremental pruning in LS-SVM, Journal of the Korean Data Analysis Society, 5(4), 829-838.
  9. Kwon, S., Choi, H. (2014). A simple graphical method for detecting consistently misclassified samples using robust SVM, Journal of the Korean Data Analysis Society, 16(1), 125-133. (in Korean).
  10. Lee, S., Oh, K. (2003). k-nearest neighbor classifier using local values of k, Korean Institute of Information Scientists and Engineers, 30(2), 193-195. (in Korean).
  11. Lim, H. K. (2018). Prediction of myocardial infraction/angina and selection of major risk factors using machine learning, Journal of the Korean Data Analysis Society, 20(2), 647-656. (in Korean).
  12. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., Tewari, A. (2013). Learning with noisy labels, In Advances in Neural Information Processing Systems, 26, 1196-1204.
  13. Rolnick, D., Veit, A., Belongie, S., Shavit, N. (2017). Deep learning is robust to massive label noise, arXiv preprint arXiv:1705.10694
  14. Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X. (2015). Learning from massive noisy labeled data for image classification, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2691-2699.
  15. Yee, J., Park, M. (2016). Robust estimation of a genomic association against the imbalance among the multi-class phenotypes, Journal of the Korean Data Analysis Society, 18(4), 1741-1750.
  16. Zhu, X., Wu, X. (2004). Class noise vs. attribute noise: a quantitative study of their impacts, Artificial Intelligence Review, 22(3), 177-210. https://doi.org/10.1007/s10462-004-0751-8