DOI QR코드

DOI QR Code

A Study on the Artificial Intelligence (AI) Training Data Quality: Fuzzy-set Qualitative Comparative Analysis (fsQCA) Approach

인공지능 학습용 데이터 품질에 대한 연구: 퍼지셋 질적비교분석

  • Hyunmok Oh (National Information Society Agency) ;
  • Seoyoun Lee (School of Management and Economics, Beijing Institute of Technology) ;
  • Younghoon Chang (Nottingham University Business School China, University of Nottingham Ningbo China)
  • 오현목 (한국지능정보사회진흥원) ;
  • 이서연 (북경이공대학교 관리경제학원) ;
  • 장영훈 (노팅엄대학교 닝보캠퍼스 상학원)
  • Received : 2023.10.18
  • Accepted : 2024.11.23
  • Published : 2024.02.29

Abstract

This study is empirical research to enhance understanding of AI (artificial intelligence) training data project in South Korea. It primarily focuses on the various concerns regarding data quality from policy-executing institutions, data construction companies, and organizations utilizing AI training data to develop the most reliable algorithm for society. For academic contribution, this study suggests a theoretical foundation and research model for understanding AI training data quality and its antecedents, as well as the unique data and ethical aspects of AI. For this purpose, this study proposes a research model with important antecedents related to AI training data quality, such as data attribute factors, data building environmental factors, and data type-related factors. The study collects 393 sample data from actual practitioners and personnel from companies building artificial intelligence training data and companies developing artificial intelligence services. Data analysis was conducted through Fuzzy Set Qualitative Comparative Analysis (fsQCA) and Artificial Neural Network analysis (ANN), presenting academic and practical implications related to the quality of AI training data.

본 연구는 한국의 인공지능 학습용 데이터 구축 사업과 데이터의 공공 개방에 관한 정책 수행 기관, 데이터 구축 기업, 그리고 이를 활용하는 다양한 기관의 데이터 품질에 대해 이해를 제고하고, 신뢰할 수 있는 인공지능 알고리즘 개발에 있어 가장 중요한 학습용 데이터 품질에 대한 이론적 토대를 만들기 위한 실증적 연구이다. 이를 위해, 데이터의 속성 요인, 데이터 구축환경 요인, 데이터 타입 관련 요인 등 인공지능 학습용 데이터 품질과 관련된 중요 선행요인을 도입하여 이론적 모형을 제안한다. 본 연구는 393명의 인공지능 학습용 데이터 구축 기업과 인공지능 서비스 개발 기업의 실무 담당자를 대상으로 설문조사를 실시하여 데이터를 수집하였다. 데이터 분석은 퍼지셋 질적비교분석 방법과 인공신경망 분석을 통해 이루어졌으며, 분석 결과를 통해 인공지능 학습용 데이터 관련 학술적 및 실무적 시사점을 도출했다.

Keywords

References

  1. 과학기술정보통신부, "인공지능 학습용 데이터 개방, 2배(191→381종)로 늘어난다", 2021 Available at https://www.msit.go.kr/bbs/view.do?sCode=user&mId=113&mPid=238&pageIndex=1&bbsSeqNo=94&nttSeqNo=3181904&searchOpt=ALL&searchTxt=%ED%95%99%EC%8A%B5%EC%9A%A9.
  2. 과학기술정보통신부, "인공지능 학습용 데이터, 역대 최대 규모로 개방한다", 2023 Available at https://www.msit.go.kr/bbs/view.do?sCode=user&mId=113&mPid=238&pageIndex=1&bbsSeqNo=94&nttSeqNo=3183010&searchOpt=ALL&searchTxt=%ED%95%99%EC%8A%B5%EC%9A%A9.
  3. 김형섭, "데이터 품질관리 평가모델에 관한 연구", 한국융합학회논문지, 제11권, 제7호, 2020, pp. 217-222. https://doi.org/10.15207/JKCS.2020.11.7.217
  4. 이용희, "빅데이터 품질향상 방안에 관한 고찰", 한국IT정책경영학회논문지, 제10권, 제5호, 2018, pp. 1007-1013.
  5. 이원국, 양희태, "퍼스널 모빌리티 사용의도에 관한 연구: SOR(Stimulus-Organism-Response) 모델을 중심으로", 경영정보학연구, 제24권, 제2호, 2022, pp. 67-88. https://doi.org/10.14329/isr.2022.24.2.067
  6. 이현애, 정희정, 함주연, 정남호, "퍼지셋 질적 비교 분석(fsQCA)을 활용한 관광지 거주민들의 삶의 질 저하에 영향을 미치는 요인 연구", 경영정보학연구, 제21권, 제1호, 2019, pp. 113-133. https://doi.org/10.14329/isr.2019.21.1.113
  7. 장경애, 김우제, 김자희, "고객의 요구사항에 기반한 데이터 품질 평가 속성 및 우선순위 도출", 정보처리학회논문지. 소프트웨어 및 데이터 공학, 제4권, 제12호, 2015, pp. 549-560. https://doi.org/10.3745/KTSDE.2015.4.12.549
  8. 정원섭, "인공지능 알고리즘의 편향성과 공정성", 인간.환경.미래, 제25권, 2020, pp. 55-73.
  9. 정원진, 박영태, "Data warehousing, contextual data quality, and problem-solving performance", 정보시스템연구, 제14권, 제2호, 2005, pp. 237-256.
  10. 정혜정, "데이터 품질 평가에 관한 연구", 인터넷정보학회논문지, 제8권, 제4호, 2007, pp. 119-128.
  11. 최유진, 양희태, "위드 코로나 시대의 원격근무 솔루션 지속 사용 의도에 관한 연구: TOE (Technology-Organization-Environment) 모델을 중심으로", 경영정보학연구, 제25권, 제2호, 2023, pp. 163-180. https://doi.org/10.14329/isr.2023.25.2.163
  12. 한국지능정보사회진흥원, "인공지능 학습용 데이터 품질관리 가이드라인", v1.0, 2021.
  13. 한국지능정보사회진흥원, "인공지능 학습용 데이터 품질관리 가이드라인", v3.0, 2023.
  14. Ardagna, D., C. Cappiello, W. Sama, and M. Vitali, "Context-aware data quality assessment for big data", Future Generation Computer Systems, Vol.89, 2018, pp. 548-562. https://doi.org/10.1016/j.future.2018.07.014
  15. Awa, H. O. and O. U. Ojiabo, "A model of adoption determinants of ERP within TOE framework", Information Technology & People, Vol. 29, No.4, 2016, pp. 901-930. https://doi.org/10.1108/ITP-03-2015-0068
  16. Bagozzi, R. P. and Y. Yi, "On the evaluation of structural equation models", Journal of the Academy of Marketing Science, Vol.16, 1988, pp. 74-94. https://doi.org/10.1007/BF02723327
  17. Batini, C. and M. Scannapieco, Data and Information Quality, Cham, Switzerland: Springer International Publishing. 2016.
  18. Bertossi, L. and F. Geerts, "Data quality and explainable AI", Journal of Data and Information Quality (JDIQ), Vol.12, No.2, 2020, pp. 1-9. https://doi.org/10.1145/3386687
  19. Bryman, A., Social Research Methods, Oxford university press, 2016.
  20. Cai, L. and Y. Zhu, "The challenges of data quality and data quality assessment in the big data era", Data Science Journal, Vol.14, No.2, 2015, pp. 1-10. https://doi.org/10.5334/dsj-2015-002
  21. Chan, F. T. and A. Y. Chong, "A SEM-neural network approach for understanding determinants of interorganizational system standard adoption and performances", Decision Support Systems, Vol.54, No.1, 2012, pp. 621-630. https://doi.org/10.1016/j.dss.2012.08.009
  22. Chong, A. Y. L., B. Li, E. W. Ngai, E. Ch'Ng, and F. Lee, "Predicting online product sales via online reviews, sentiments, and promotion strategies: A big data architecture and neural network approach", International Journal of Operations & Production Management, Vol.36, No.4, 2016, pp. 358-383. https://doi.org/10.1108/IJOPM-03-2015-0151
  23. Chong, A. Y.-L., "Predicting m-commerce adoption determinants: A neural network approach", Expert Systems with Applications, Vol.40, No.2, 2013, pp. 523-530. https://doi.org/10.1016/j.eswa.2012.07.068
  24. Christine Patterso, The Six Primary Dimensions For Data Quality Assessment, DAMA UK Working, 2017.
  25. David Loshin, Dimensions of Data Quality, In MK Series on Business Intelligence. The Practitioner's Guide to Data Quality Improvement, Morgan Kaufmann, 2011.
  26. David Plotkin, Data Stewardship, An Actionable Guide to Effective Data Management and Data Governance, Morgan Kaufmann, 2014.
  27. DeLone, W. H. and E. R. McLean, "Measuring e-commerce success: Applying the DeLone & McLean information systems success model", International Journal of Electronic Commerce, Vol.9, No.1, 2004, pp. 31-47. https://doi.org/10.1080/10864415.2004.11044317
  28. English, L. P., Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits, John Wiley & Sons, Inc, 1999.
  29. Ercole, A., V. Brinck, P. George, R. Hicks, J. Huijben, M. Jarrett, and L. Wilson, "Guidelines for data acquisition, quality and curation for observational research designs (DAQCORD)", Journal of Clinical and Translational Science, Vol.4, No.4, 2020, pp. 354-359. https://doi.org/10.1017/cts.2020.24
  30. Fiss, P. C., "A set-theoretic approach to organizational configurations", Academy of Management Review, Vol.32, No.4, 2007, pp. 1180-1198. https://doi.org/10.5465/amr.2007.26586092
  31. Fiss, P. C., "Building better causal theories: A fuzzy set approach to typologies in organization research", Academy of Management Journal, Vol.54, No.2, 2011, pp. 393-420. https://doi.org/10.5465/amj.2011.60263120
  32. Fornell, C. and D. F. Larcker, "Evaluating structural equation models with unobservable variables and measurement error", Journal of Marketing Research, Vol.18, No.1, 1981, pp. 39-50. https://doi.org/10.1177/002224378101800104
  33. Gefen, D. and D. Straub, "A practical guide to factorial validity using PLS-Graph: Tutorial and annotated example", Communications of the Association for Information Systems, Vol.16, No.1, 2005, pp. 5.
  34. Granick, L., "Assuring the quality of information dissemination: Responsibilities of database producers", Information Services & Use, Vol.11, No.3, 1991, pp. 117-136. https://doi.org/10.3233/ISU-1991-11304
  35. Haug, A., J. Stentoft Arlbjorn, and A. Pedersen, "A classification model of ERP system data quality", Industrial Management & Data Systems, Vol.109, No.8, 2009, pp. 1053-1068. https://doi.org/10.1108/02635570910991292
  36. Hew, J.-J., L.-Y. Leong, G. W.-H. Tan, K.-B. Ooi, and V.-H. Lee, "The age of mobile social commerce: An Artificial Neural Network analysis on its resistances", Technological Forecasting and Social Change, Vol.144, 2019, pp. 311-324. https://doi.org/10.1016/j.techfore.2017.10.007
  37. Hoxmeier, J. A., "Typology of database quality factors", Software Quality Journal, Vol.7, 1998, pp. 179-193. https://doi.org/10.1023/A:1008923120973
  38. Hew, J., V. Lee, and L. Leong, "Why do mobile consumers resist mobile commerce applications? A hybrid fsQCA-ANN analysis", Journal of Retailing and Consumer Services, Vol.75, 2023, 103526, ISSN 0969-6989, https://doi.org/10.1016/j.jretconser.2023.103526.
  39. Kim, H.-S., "A study on the data quality management evaluation model", Journal of the Korea Convergence Society, Vol.11, No.7, 2020, pp. 217-222.
  40. Lee, Y., O. J. Kwon, H. Lee, J. Kim, K. Lee, and K.-E. Kim, "Augment & valuate: A data enhancement pipeline for data-centric AI", 2021, arXiv preprint arXiv:2112.03837.
  41. Leong, L.-Y., T.-S. Hew, K.-B. Ooi, V.-H. Lee, and J.-J. Hew, "A hybrid SEM-neural network analysis of social media addiction", Expert Systems with Applications, Vol.133, 2019, pp. 296-316. https://doi.org/10.1016/j.eswa.2019.05.024
  42. Li, F., E. C.-X. Aw, G. W.-H. Tan, T.-H. Cham, and K.-B. Ooi, "The Eureka moment in under-standing luxury brand purchases! A non-linear fsQCA-ANN approach", Journal of Retailing and Consumer Services, Vol.68, 2022, 103039.
  43. Li, P., X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang, "CleanML: A study for evaluating the impact of data cleaning on ml classification tasks", 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021.
  44. Madnick, S. E., R. Y. Wang, Y. W. Lee, and H. Zhu, "Overview and framework for data and information quality research", Journal of Data and Information Quality (JDIQ), Vol.1, No.1, 2009, pp. 1-22. https://doi.org/10.1145/1515693.1516680
  45. Miller, H., "The multiple dimensions of information quality", Information Systems Management, Vol.13, No.2, 1996, pp. 79-82. https://doi.org/10.1080/10580539608906992
  46. Mohammadi, H., "Investigating users' perspectives on e-learning: An integration of TAM and IS success model", Computers in Human Behavior, Vol.45, 2015, pp.359-374. https://doi.org/10.1016/j.chb.2014.07.044
  47. Ng, A., "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI", DeepLearningAI. 2021. Available at https://www.youtube.com/watch?v=06-AZXmwHjo.
  48. Nicolaou, A. I., M. Ibrahim, and E. Van Heck, "Information quality, trust, and risk perceptions in electronic data exchanges", Decision Support Systems, Vol.54, No.2, 2013, pp. 986-996. https://doi.org/10.1016/j.dss.2012.10.024
  49. Park, G.-E. and C.-J. Kim, "Quality characteristics of public open data", Journal of Digital Convergence, Vol.13, No.10, 2015, pp. 135-146. https://doi.org/10.14400/JDC.2015.13.10.135
  50. Pipino, L. L., Y. W. Lee, and R. Y. Wang, "Data quality assessment", Communications of the ACM, Vol.45, No.4, 2002, pp. 211-218. https://doi.org/10.1145/505248.506010
  51. Podsakoff, P. M., S. B. MacKenzie, J.-Y. Lee, and N. P. Podsakoff, "Common method biases in behavioral research: A critical review of the literature and recommended remedies", Journal of Applied Psychology, Vol.88, No.5, 2003, pp. 879.
  52. Pudjianto, B., H. Zo, A. P. Ciganek, and J. J. Rho, "Determinants of e-government assimilation in Indonesia: An empirical investigation using a TOE framework", Asia Pacific Journal of Information Systems, Vol.21, No.1, 2011, pp. 49-80.
  53. Purwanto, S., "The effect of organizational governance on the performance and commitment of the lecturers", Public Policy and Administration Research, Vol.5, No.1, 2015, pp. 35-42.
  54. Ragin, C. C., Redesigning social inquiry: Fuzzy Sets and Beyond, University of Chicago Press. 2009.
  55. Ragin, C. C., "Set relations in social research: Evaluating their consistency and coverage", Political Analysis, Vol.14, No.3, 2006, pp. 291-310. https://doi.org/10.1093/pan/mpj019
  56. Ragin, C. C., K. A. Drass, and S. Davey, "Fuzzy-set/qualitative comparative analysis 2.0", Tucson, Arizona: Department of Sociology, University of Arizona, Vol.23, No.6, 2006, pp. 1949-1955.
  57. Ragin, C. and S. Davey, fs/QCA [Computer Programme], version 2.5. Irvine, CA: University of California, 2014.
  58. Rana, N. P., Y. K. Dwivedi, M. D. Williams, and V. Weerakkody, "Investigating success of an e-government initiative: Validation of an integrated IS success model", Information Systems Frontiers, Vol.17, 2015, pp. 127-142. https://doi.org/10.1007/s10796-014-9504-7
  59. Roh, Y., G. Heo, and S. E. Whang, "A survey on data collection for machine learning: A big data-ai integration perspective", IEEE Transactions on Knowledge and Data Engineering, Vol.33, No.4, 2019, pp. 1328-1347. https://doi.org/10.1109/TKDE.2019.2946162
  60. Scannapieco, M., A. Virgillito, C. Marchetti, M. Mecella, and R. Baldoni, "The DaQuinCIS architecture: a platform for exchanging and improving data quality in cooperative information systems", Information Systems, Vol.29, No.7, 2004, pp. 551-582. https://doi.org/10.1016/j.is.2003.12.004
  61. Schneider, C. Q. and C. Wagemann, Set-theoretic Methods for the Social Sciences: A Guide to Qualitative Comparative Analysis, Cambridge University Press, 2012.
  62. Sharma, M., S. Joshi, and S. Luthra, et al., "Impact of Digital Assistant Attributes on Millennials' Purchasing Intentions: A Multi-Group Analysis using PLS-SEM, Artificial Neural Network and fsQCA", Information Systems Frontiers, 2022, https://doi.org/10.1007/s10796-022-10339-5.
  63. Wand, Y. and R. Y. Wang, "Anchoring data quality dimensions in ontological foundations", Communications of the ACM, Vol.39, No.11, 1996, pp. 86-95. https://doi.org/10.1145/240455.240479
  64. Wang, R. Y. and D. M. Strong, "Beyond accuracy: What data quality means to data consumers", Journal of Management Information Systems, Vol.12, No.4, 1996, pp. 5-33. https://doi.org/10.1080/07421222.1996.11518099
  65. Wang, R. Y., V. C. Storey, and C. P. Firth, "A framework for analysis of data quality research", IEEE Transactions on Knowledge and Data Engineering, Vol.7, No.4, 1995, pp. 623-640. https://doi.org/10.1109/69.404034
  66. Xu, H., J. Horn Nord, N. Brown, and G. Daryl Nord, "Data quality issues in implementing an ERP", Industrial Management & Data Systems, Vol.102, No.1, 2002, pp. 47-58. https://doi.org/10.1108/02635570210414668