Browse > Article
http://dx.doi.org/10.5392/JKCA.2021.21.11.135

Denoising Self-Attention Network for Mixed-type Data Imputation  

Lee, Do-Hoon (서울시립대학교 전자전기컴퓨터공학부)
Kim, Han-Joon (서울시립대학교 전자전기컴퓨터공학부)
Chun, Joonghoon (명지대학교 융합소프트웨어학부)
Publication Information
Abstract
Recently, data-driven decision-making technology has become a key technology leading the data industry, and machine learning technology for this requires high-quality training datasets. However, real-world data contains missing values for various reasons, which degrades the performance of prediction models learned from the poor training data. Therefore, in order to build a high-performance model from real-world datasets, many studies on automatically imputing missing values in initial training data have been actively conducted. Many of conventional machine learning-based imputation techniques for handling missing data involve very time-consuming and cumbersome work because they are applied only to numeric type of columns or create individual predictive models for each columns. Therefore, this paper proposes a new data imputation technique called 'Denoising Self-Attention Network (DSAN)', which can be applied to mixed-type dataset containing both numerical and categorical columns. DSAN can learn robust feature expression vectors by combining self-attention and denoising techniques, and can automatically interpolate multiple missing variables in parallel through multi-task learning. To verify the validity of the proposed technique, data imputation experiments has been performed after arbitrarily generating missing values for several mixed-type training data. Then we show the validity of the proposed technique by comparing the performance of the binary classification models trained on imputed data together with the errors between the original and imputed values.
Keywords
Machine Learning; Deep Learning; Data Quality; Missing Values; Data Imputation; Attention;
Citations & Related Records
연도 인용수 순위
  • Reference
1 S. O. Arik and T. Pfister, "TabNet: Attentive Interpretable Tabular Learning," Proceedings of the AAAI Conference on Artificial Intelligence, Vol.35, No.8, pp.6679-6687, 2021.
2 X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, "TabTransformer: Tabular Data Modeling Using Contextual Embeddings," arXiv preprint arXiv:2012.06678, 2020.
3 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, "Attention is All you need," Advances in Neural Information Processing Systems, pp.5998-6008, 2017.
4 P. Vincent, H. Larochelle, Y. Bengio and P. A. Manzagol, "Extracting and Composing Robust Features with Denoising Autoencoders," Proceedings of the 25th International Conference on Machine Learning, pp.1096-1103, 2008.
5 L. Gondara and K. Wang, "Mida: Multiple imputation using denoising autoencoders," Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp.260-272, Springer, 2018.
6 D. B. RUBIN, "Inference and missing data," Biometrika, Vol.63, No.3, pp.581-592, 1976.   DOI
7 D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
8 W. Lin and C. Tsai, " Missing value imputation: a review and anlaysis of the literature (2006-2017)," Artificial Intelligence Review, Vol.53, No.2, pp.1487-1509, 2020.   DOI
9 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehgani, M. Minderer, G. Heigold, S. Gelly, J. Uszkreit, and N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," arXiv preprint arXiv:2010.11929, 2020.
10 J. Yoon, J. Jordon, and M Schaar, "GAIN: Missing Data Imputation using Generative Adversial Nets," International Conference on Machine Learning, pp.5689-5698, 2018.
11 D. J. Stekhoven and P. Buhlmann, "MissForest-non-parametric missing value imputation for mixed-type data," Bioinformatics, Vol.28, No.1, pp.112-118, 2012.   DOI
12 A. Nazabal, P. Olmos, Z. Ghahramani, and I. Valera, "Handling Incomplete Heterogeneous Data using VAEs," Pattern Recognition, Vol.107, 2020.
13 F. Biessmann, T. Rukat, P. Schmidit, P. Naidu, S. Schelter, A. Taptunov, D. Lange, and D. Salinas, "Datawig: Missing Value Imputation for Tables," Journal of Machine Learning Research, Vol.20, 2019.
14 N. Abiri, B. Linse, P. Eden, and M. Ohlsson, "Establishing Strong Imputation Performance of a Denoising Autoencoder in a wide range of missing data problems," Neurocomputing, Vol.365, pp.137-146, 2019.   DOI