DOI QR코드

DOI QR Code

Valid Data Conditions and Discrimination for Machine Learning: Case study on Dataset in the Public Data Portal

기계학습에 유효한 데이터 요건 및 선별: 공공데이터포털 제공 데이터 사례를 통해

  • Oh, Hyo-Jung (Dept. of Library & Information Science, Jeonbuk National University) ;
  • Yun, Bo-Hyun (Div. of Software Liberal Arts, Mokwon University)
  • 오효정 (전북대학교 문헌정보학과) ;
  • 윤보현 (목원대학교 소프트웨어교양학부)
  • Received : 2021.11.20
  • Accepted : 2022.01.15
  • Published : 2022.02.28

Abstract

The fundamental basis of AI technology is learningable data. Recently, the types and amounts of data collected and produced by the government or private companies are increasing exponentially, however, verified data that can be used for actual machine learning has not yet led to it. This study discusses the conditions that data actually can be used for machine learning should meet, and identifies factors that degrade data quality through case studies. To this end, two representative cases of developing a prediction model using public big data was selected, and data for actual problem solving was collected from the public data portal. Through this, there is a difference from the results of applying valid data screening criteria and post-processing. The ultimate purpose of this study is to argue the importance of data quality management that must be most fundamentally preceded before the development of machine learning technology, which is the core of artificial intelligence, and accumulating valid data.

인공지능 기술의 가장 큰 근간은 학습 가능한 데이터이다. 최근 정부나 사기업에서 수집·생산하는 데이터의 종류와 양이 기하급수적으로 증가하고 있지만, 실제 기계학습에 활용 가능한 데이터의 확보로는 아직까지 이어지지 않고 있다. 이에 본 연구에서는 기계학습에 실제 활용 가능한 데이터가 갖추어야 할 조건에 대해 논의하고, 실제 사례연구를 통해 데이터 품질을 저하시키는 요인을 파악한다. 이를 위해 공공빅데이터를 활용해 예측 모델을 개발한 대표사례를 선정, 공공데이터포털로부터 실제 문제 해결을 위한 데이터를 수집 후 데이터 품질을 확인하였다. 이를 통해 유효한 데이터 선별 기준을 적용하고 후처리한 결과와의 차이를 보인다. 본 연구의 궁극적인 목적은 인공지능의 핵심인 기계학습 기술 개발에 앞서 가장 근본적으로 선결되어야 할 데이터 품질을 관리하고 유효한 데이터를 축적하기 위한 기반 마련에 있다.

Keywords

Acknowledgement

본 논문은 2021년도 전북대학교 연구기반 조성비 지원에 의하여 연구되었음. 본 논문은 2021년도 한국연구재단 연구비 지원에 의한 결과의 일부임 ((NRF-2021R1I1A3047435)

References

  1. IDC. IDC Forecasts Improved Growth for global AI Market in 2021 [Internet], https://www.idc.com/getdoc.jsp?containerId=prUS47482321
  2. T.J.Kim, Data Dam', What Kind of Businesses Are They Made Up [Internet], https://zdnet.co.kr/view/?no=20200902101741
  3. K.V.Cruz, "Moon Jae-In's Strategy Amid Covid-19 Pandemic: Reviving the Green in the Korean New Deal." in Collection of Essays on Korea's Public Diplomacy, 2020
  4. D.Fang and L.Deng, "Legal Regulation of Government Data Opening: American Legislation and China's Path: Reflection Based on the US the Open, Public, Electronic, and Necessary (OPEN) Government Data Act," Information and Documentation Services Vol.42, No.5, pp.50-57, 2021
  5. D.J.Kim, "Spatial Big Data Plan for Government 3.0 and Creative Economy", Korea Research Institute For Human Settlements, No.14, pp.40-47, 2014
  6. G.Viscusi, B.Spahiu, A.Maurino, and C.Batini, "Compliance with open government data policies: An empirical assessment of Italian local public administrations." Information polity Vol.19, No.3, pp.263-275, 2014. https://doi.org/10.3233/IP-140338
  7. Gartner Reserach. Measuring the Business Value of Data Quality [Internet], https://www.gartner.com/en/documents/1819214/measuring-the-business-value-of-data-quality
  8. S.O.Yun and J.W.Hyun, "An Analysis of Open Data Policy in Korea: Focused on National Core Data in Open Data Portal," Korean Public Management Review, Vol.33, No.1, pp.219-247, 2019 https://doi.org/10.24210/kapm.2019.33.1.010
  9. W.S.Lim and S.J.Jung, Open Data, Small Amount. Useless Files [Internet], https://www.donga.com/news/article/all/20160517/78152584/1
  10. H.W.Lee, "Intrusion Artifact Acquisition Method based on IoT Botnet Malware," Journal of KIOTS, Vol.7, No.3, pp.1-8, 2021
  11. S.H.Yoon, J.H.Na, and H.-J.Oh, "Data Opening Status Analysis and Quality Management Strategies in Land, Infrastructure and Transport Domain,", Journal of Digital Culture Archives, Vol.3, No.2, pp.73-85, 2020
  12. J.H.Na, S.H.Yoon, and H.-J.Oh, "Black Ice Formation Prediction Model Based on Public Data in Land, Infrastructure and Transport Domain," KIPS Transactions on Software and Data Engineering, Vol.10, No.7, pp.257-262. 2021 https://doi.org/10.3745/KTSDE.2021.10.7.257
  13. S.S.Yu, K.P.Choi, H.Myung, and H.-J.Oh, "Prediction Model of Pest According to Individual Farms Based on Heterogeneous Public Big data." Journal of KIIT. Vol.18, No.6, pp.1-9, 2020
  14. K.P.Choi, S.S.Yu, N.H.Yoo, and H.-J.Oh, "Pest Prediction and Prevention Model Visualization using Farm Map for Ecological Smart Farm," Journal of KIIT. Vol.19, No.2, pp.105-113, 2021
  15. H.W.Lee and H.S.Lee, "Optimal Machine Learning Model for Detecting Normal and Malicious Android Apps," Journal of KIOTS, Vol.6, No.2, pp.1-10, 2020