Browse > Article

A Study of Big Data Domain Automatic Classification Using Machine Learning  

Kong, Seongwon ((주)위세아이텍)
Hwang, Deokyoul ((주)위세아이텍)
Publication Information
The Journal of Bigdata / v.3, no.2, 2018 , pp. 11-18 More about this Journal
Abstract
This study is a study on domain automatic classification for domain - based quality diagnosis which is a key element of big data quality diagnosis. With the increase of the value and utilization of Big Data and the rise of the Fourth Industrial Revolution, the world is making efforts to create new value by utilizing big data in various fields converged with IT such as law, medical, and finance. However, analysis based on low-reliability data results in critical problems in both the process and the result, and it is also difficult to believe that judgments based on the analysis results. Although the need of highly reliable data has also increased, research on the quality of data and its results have been insufficient. The purpose of this study is to shorten the work time to automizing the domain classification work which was performed from manually to using machine learning in the domain - based quality diagnosis, which is a key element of diagnostic evaluation for improving data quality. Extracts information about the characteristics of the data that is stored in the database and identifies the domain, and then featurize it, and automizes the domain classification using machine learning. We will use it for big data quality diagnosis and contribute to quality improvement.
Keywords
Big Data; Data Quality Diagnosis; Domain; Machine Learning; Random Forest;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 이진형, "머신러닝을 이용한 빅데이터 품질진단 자동화에 관한 연구", 한국빅데이터논문지, 제2권 제2호, 2017
2 Robert E. Schapire, "Random Forests", Machine Learning, 45, 5-32, 2001   DOI
3 A Liaw, M Wiener, Classification and regression by randomForest, R news, 2002
4 B.P.Weidema, M.S.Wesnæs, Data quality management for life cycle inventories-an example of using data quality indicators, Vol4, Issues 3-4, 1996, Pages 167-174   DOI
5 이상기, 채철주, 홍의경," 데이터 프로파일링과 정규 표현식 활용 비정형 과학기술 빅데이터 품질관리 방안", 한국콘텐츠학회논문지, 제14권, 제12호, p486-793, 2014
6 명재호, 안희진 이창수, 김성현 임동진, 오경조, 이종규, 김선영, 최용준, 데이터 품질 가이드라인, 한국데이터진흥원, 2011
7 데이터 품질관리 지침, 한국데이터베이스진흥센터, 2006
8 데이터 산업 백서, 한국데이터진흥원, 2017
9 차경엽, 심광호, "공공부문 정보시스템 데이터의 신뢰성 점검기법 개발", 한국통계학회논문집, 제17권, P745-753, 2010
10 데이터 분석 전문가 가이드, 한국데이터베이스진흥원, 2016
11 J. VanderPlas, Python Data Science Handbook: Essential Tools for Working with Data, 2016
12 T.F. Cootes, M.C.Ionita, C.Lindner, P.Sauer, "Robust and Accurate Shape Model Fitting Using Random Forest Regression Voting", Computer Vision - ECCV 2012, pp 278-291, 2012
13 김선호, 이창수, "데이터 품질관리 프로세스 평가를 위한 프로세스 참조모델", 한국전자거래학회지, 제18권, 2013
14 Caballero, I., Caro, A., Calero, C., Piattini, M., "IQM3 : Information Quality, Management Maturity Model," Journal of Universal Computer Science Vol. 14, No. 22, pp. 3658-3685, 2008.
15 ISO 8000-1 Data quality-Part1 : Overview, ISO, 2009
16 Pipino, L. L., Lee, Y. W., Wang R. Y., "Data quality as-sessment", Communications of the ACM, Vol. 45, No. 4, pp. 211-218, 2002.   DOI
17 Ryu, K. S., Park, J. S., Park, J. H., "A data quality management maturity model," ETRI Journal, Vol. 28, No. 2, 2006.
18 Leo L. Pipino, Yang W. Lee, and Richard Y. Wang, "Data Quality Assessment," Communications of the ACM, vol. 45, no. 4, Apr. 2002, pp. 211-218.   DOI