Browse > Article
http://dx.doi.org/10.9717/kmms.2021.24.10.1403

An Evaluation Study on Artificial Intelligence Data Validation Methods and Open-source Frameworks  

Yun, Changhee (AI Future Strategy Center, National Information-Society Agency)
Shin, Hokyung (School of Computer Science and Engineering, Kyungpook National University)
Choo, Seung-Yeon (School of Architecture, Kyungpook National University)
Kim, Jaeil (School of Computer Science and Engineering, Kyungpook National University)
Publication Information
Abstract
In this paper, we investigate automated data validation techniques for artificial intelligence training, and also disclose open-source frameworks, such as Google's TensorFlow Data Validation (TFDV), that support automated data validation in the AI model development process. We also introduce an experimental study using public data sets to demonstrate the effectiveness of the open-source data validation framework. In particular, we presents experimental results of the data validation functions for schema testing and discuss the limitations of the current open-source frameworks for semantic data. Last, we introduce the latest studies for the semantic data validation using machine learning techniques.
Keywords
Artificial Intelligence; Data Validation; Data Quality Management; Open-source Framework; Review Study;
Citations & Related Records
연도 인용수 순위
  • Reference
1 S.E. Whang and J.-G. Lee, "Data Collection and Quality Challenges for Deep Learning," Proceedings of the VLDB Endowment 13.12, pp. 3429-3431, 2020.
2 E. Caveness, et al., "Tensorflow Data Validation: Data Analysis and Validation in Continuous ML Pipelines," Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. pp. 2793-2794, 2020.
3 A. Jain, et al., "Overview and Importance of Data Quality for Machine Learning Tasks," Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 3561-3562, 2020.
4 A. Paleyes, R.-G. Urma, and N.D. Lawrence, "Challenges in Deploying Machine Learning: A Survey of Case Studies," arXiv preprint, arXiv:2011.09926, pp. 1-3, pp. 15-16, 2020.
5 T. Rukat, et al., "Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality," ML Ops Workshop at the Conference on Machine Learning and Systems (MLSys), pp. 1-2, 2019.
6 S. Saria and A. Subbaswamy, "Tutorial: Safe and Reliable Machine Learning," ACM Conference on Fairness, Accountability, and Transparency, Atlanta, Ga. pp. 1-3, 2019.
7 M. Armbrust, et al., "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics," CIDR, pp. 1-4, 2021.
8 V. Shah, K. Yang, and K. Kumar, "Improving Feature Type Inference Accuracy of TFDV with SortingHat," Corpus ID: 235273771, pp. 1-7, 2020.
9 J.-C. Kim, et al., "A Study on Automatic Missing Value Imputation Replacement Method for Data Processing in Digital Data," Journal of Korea Multimedia Society, Vol. 24. No. 2, pp. 245-246, 2021.
10 N. Hynes, D. Sculley, and M. Terry, "The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets," NIPS MLSys Workshop. pp. 1-3, 2017.
11 TFDV(2021), https://www.tensorflow.org/tfx/guide/tfdv (accessed October 8, 2021).
12 Cerberus(2021), https://docs.python-cerberus.org/en/stable/ (accessed October 8, 2021).
13 Voluptuous(2021), https://github.com/alecthomas/voluptuous (accessed October 8, 2021).
14 Pandera(2021), https://pandera.readthedocs.io/en/stable/ (accessed October 8, 2021).
15 K. Kumar, New Trends in Data Warehousing Techniques, ResearchGate, 2020.
16 E. Breck, et al., "Data Validation for Machine Learning," MLSys. pp. 2-4, 2019.
17 Laparoscopic Endoscopy Open Dataset(2021), https://opencas.webarchiv.kit.edu/?q=node/30 (accessed October 8, 2021).
18 L. Ruff, et al., "Deep One-class Classification," International Conference on Machine Learning, PMLR 80, pp. 3-5, 2018.
19 M. Hulsebos, et al., "Sherlock: A Deep Learning Approach to Semantic Data Type Detection," Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1500-1504, 2019.
20 P. Marquez-Neila and R. Sznitman, "Image Data Validation for Medical Systems," International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 1-2, 2019.
21 Y. Roh, et al., "A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective," IEEE Transactions on Knowledge and Data Engineering, Vol. 33, No. 4, pp. 1328-1330. 2021.   DOI
22 S.-H. Jeong, et al., "A Study on Classification Evaluation Prediction Model by Cluster for Accuracy Measurement of Unsupervised Learning Data," Journal of Korea Multimedia Society, Vol. 21. No. 7, pp. 779-780, 2018.   DOI
23 G. Pang, et al., "Deep Learning for Anomaly Detection: A Review," ACM Computing Surveys, Vol. 54, Issue 2, pp. 1-8, 2021.   DOI