DOI QR코드

DOI QR Code

Preliminary Inspection Prediction Model to select the on-Site Inspected Foreign Food Facility using Multiple Correspondence Analysis

차원축소를 활용한 해외제조업체 대상 사전점검 예측 모형에 관한 연구

  • 박혜진 (식품안전정보원) ;
  • 최재석 (국민대학교 비즈니스IT전문대학원) ;
  • 조상구 (경복대학교 빅데이터학과)
  • Received : 2022.11.14
  • Accepted : 2022.12.18
  • Published : 2023.03.31

Abstract

As the number and weight of imported food are steadily increasing, safety management of imported food to prevent food safety accidents is becoming more important. The Ministry of Food and Drug Safety conducts on-site inspections of foreign food facilities before customs clearance as well as import inspection at the customs clearance stage. However, a data-based safety management plan for imported food is needed due to time, cost, and limited resources. In this study, we tried to increase the efficiency of the on-site inspection by preparing a machine learning prediction model that pre-selects the companies that are expected to fail before the on-site inspection. Basic information of 303,272 foreign food facilities and processing businesses collected in the Integrated Food Safety Information Network and 1,689 cases of on-site inspection information data collected from 2019 to April 2022 were collected. After preprocessing the data of foreign food facilities, only the data subject to on-site inspection were extracted using the foreign food facility_code. As a result, it consisted of a total of 1,689 data and 103 variables. For 103 variables, variables that were '0' were removed based on the Theil-U index, and after reducing by applying Multiple Correspondence Analysis, 49 characteristic variables were finally derived. We build eight different models and perform hyperparameter tuning through 5-fold cross validation. Then, the performance of the generated models are evaluated. The research purpose of selecting companies subject to on-site inspection is to maximize the recall, which is the probability of judging nonconforming companies as nonconforming. As a result of applying various algorithms of machine learning, the Random Forest model with the highest Recall_macro, AUROC, Average PR, F1-score, and Balanced Accuracy was evaluated as the best model. Finally, we apply Kernal SHAP (SHapley Additive exPlanations) to present the selection reason for nonconforming facilities of individual instances, and discuss applicability to the on-site inspection facility selection system. Based on the results of this study, it is expected that it will contribute to the efficient operation of limited resources such as manpower and budget by establishing an imported food management system through a data-based scientific risk management model.

수입식품의 수입 건수와 수입 중량이 꾸준히 증가함에 따라 식품안전사고 방지를 위한 수입식품의 안전관리가 더욱 중요해지고 있다. 식품의약품안전처는 통관단계의 수입검사와 더불어 통관 전 단계인 해외제조업소에 대한 현지실사를 시행하고 있지만 시간과 비용이 많이 소요되고 한정된 자원 등의 제약으로 데이터 기반의 수입식품 안전관리 방안이 필요한 실정이다. 본 연구에서는 현지실사 전 부적합이 예상되는 업체를 사전에 선별하는 기계학습 예측 모형을 마련하여 현지실사의 효율성을 높이고자 하였다. 이를 위해 통합식품안전정보망에 수집된 총 303,272건의 해외제조가공업소 기본정보와 2019년도부터 2022년 4월까지의 현지실사 점검정보 데이터 1,689건을 수집하였다. 해외제조가공업소의 데이터 전처리 후 해외 제조업소_코드를 활용하여 현지실사 대상 데이터만 추출하였고, 총 1,689건의 데이터와 103개의 변수로 구성되었다. 103개의 변수를 테일유(Theil-U) 지표를 기준으로 '0'인 변수들을 제거하였고, 다중대응분석(Multiple Correspondence Analysis)을 적용해 축소 후 최종적으로 49개의 특성변수를 도출하였다. 서로 다른 8개의 모델을 생성하고, 모델 학습 과정에서는 5겹 교차검증으로 과적합을 방지하고, 하이퍼파라미터를 조정하여 비교 평가하였다. 현지실사 대상업체 선별의 연구목적은 부적합 업체를 부적합이라고 판정하는 확률인 검측률(recall)을 최대화하는 것이다. 머신러닝의 다양한 알고리즘을 적용한 결과 Recall_macro, AUROC, Average PR, F1-score, 균형정확도(Balanced Accuracy)가 가장 높은 랜덤포레스트(Random Forest)모델이 가장 우수한 모형으로 평가되었다. 마지막으로 모델에 의해서 평가된 개별 인스턴스의 부적합 업체 선정 근거를 제시하기 위해 SHAP(Shapley Additive exPlanations)을 적용하고 현지실사 업체 선정 시스템에의 적용 가능성을 제시하였다. 본 연구결과를 바탕으로 데이터에 기반한 과학적 위험관리 모델을 통해 수입식품 관리체계의 구축으로 인력·예산 등 한정된 자원의 효율적 운영방안 마련에 기여하길 기대한다.

Keywords

Acknowledgement

본 연구는 2022년도 식품의약품안전처의 연구개발비(21163MFDS516)로 수행되었으며 이에 감사드립니다.

References

  1. 김동규. (2020). 시계열 자료를 활용한 도시철도 수요 예측. Journal of The Korean Data Analysis Society, 22(2), 753-765.
  2. 김용우, 김민구, 김영민. (2022). 기계학습을 활용한 특허수명 예측 및 영향요인 분석. 지능 정보연구, 28(2), 147-170.
  3. 이경수, 박예린, 신윤종, 손권상, 권오병. (2022). 효율적 수입식품 검사를 위한 머신러닝 기반 부적합 건강기능식품 탐지 방법. 지능정보연구, 28(3), 139-159. https://doi.org/10.13088/JIIS.2022.28.3.139
  4. 장동식, 이상호. (2016). 미국의 수입식품안전관리시스템 분석 - 가공식품을 중심으로 -. 국제 상학, 31(4), 325-350.
  5. 정기혜. (2011). 우리나라 사회기반 강화를 위한 식품안전관리의 정책방향. 보건복지포럼, 2011(9), 51-63. https://doi.org/10.23062/2011.09.6
  6. 조상구, 조승용. (2020). 기계학습을 이용한 식품위생점검 체계의 효율성 개선 연구. 한국빅데이터학회지, 5(2), 53-67.
  7. 조재영, 주지환, 한인구. (2021). 기계학습을 이용한 수출신용보증 사고예측. 지능정보연구, 27(1), 83-102. https://doi.org/10.13088/JIIS.2021.27.1.083
  8. Bengio, Y., Delalleau, O., & Le Roux, N. (2006). 11 label propagation and quadratic criterion In Semi-Supervised Learning. Cambridge, MA and London: MIT press. 193-216.
  9. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3, 993-1022.
  10. Blouvshtein, L. & Cohen-Or, D. (2018) Outlier Detection for Robust Multi-dimensional Scaling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2273-2279. https://doi.org/10.1109/TPAMI.2018.2851513
  11. Brin, S., Motwani, R., Ullman, J. D., & Tsur, S. (1997). Dynamic itemset counting and implication rules for market basket data. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, 255-264.
  12. Cawley, G. C., & Talbot, N. L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11, 2079-2107.
  13. Christopher, D. M., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
  14. Deng, X., Cao, S., & Horn, A. L. (2021). Emerging applications of machine learning in food safety. Annual Review of Food Science and Technology, 12(1), 513-538. https://doi.org/10.1146/annurev-food-071720-024112
  15. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
  16. Hassini, E., Surti, C., & Searcy, C. (2012). A literature review and a case study of sustainable supply chains with a focus on metrics. International journal of production economics, 140(1), 69-82. https://doi.org/10.1016/j.ijpe.2012.01.042
  17. Herrmann, A., & Huber, F. (2000). Value-oriented brand positioning. The International Review of Retail, Distribution and Consumer Research, 10(1), 95-112.
  18. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417-441. https://doi.org/10.1037/h0071325
  19. Ji, C., Li, Y., Qiu, W., Jin, Y., Xu, Y., Awada, U. and Qu, W. (2012) Big Data Processing: Big Challenges. Journal of Interconnection Networks, 13, 1-19.
  20. Jin, C., Bouzembrak, Y., Zhou, J., Liang, Q., Van Den Bulk, L. M., Gavai, A., ... & Marvin, H. J. (2020). Big Data in food safety-A review. Current Opinion in Food Science, 36, 24-32. https://doi.org/10.1016/j.cofs.2020.11.006
  21. Judea, P. (1986). Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29(3), 241-288. https://doi.org/10.1016/0004-3702(86)90072-X
  22. Kamalja, K. K., & Khangar, N. V. (2017). Multiple Correspondence Analysis and its applications. Electronic Journal of Applied Statistical Analysis, 10(2), 432-462.
  23. Kim, U. M., & Hong, T. H. (2014). The Prediction of Customers based on Case Based Reasoning with Weighted Factors for imbalanced Data Sets. The Journal of Information Systems, 21(1), 29-45.
  24. Kingma, D. P., & Welling, M. (2019). An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), 307-392. https://doi.org/10.1561/2200000056
  25. Kruskal, J. B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2), 115-129. https://doi.org/10.1007/BF02289694
  26. Le Roux, B., & Rouanet, H. (2004). Geometric Data Analysis, From Correspondence Analysis to Structured Data Analysis. Dordrecht, The Netherlands: Kluwer Academic Publishers.
  27. Ledoit, O., & Wolf, M. (2004). Honey, I shrunk the sample covariance matrix. The Journal of Portfolio Management, 30(4), 110-119.
  28. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791. https://doi.org/10.1038/44565
  29. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008) "Isolation Forest," 2008 Eighth IEEE International Conference on Data Mining, 413-422.
  30. Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., ... & Lee, S. I. (2020). From local explanations to global understanding with explainable AI for trees. Nature machine intelligence, 2(1), 56-67. https://doi.org/10.1038/s42256-019-0138-9
  31. Marvin, H. J., Bouzembrak, Y., Janssen, E. M., van der Fels-Klerx, H. V., van Asselt, E. D., & Kleter, G. A. (2016). A holistic approach to food safety risks: Food fraud as an example. Food research international, 89, 463-470. https://doi.org/10.1016/j.foodres.2016.08.028
  32. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  33. Mills, P. (2011). Efficient statistical classification of satellite measurements. International Journal of Remote Sensing, 32(21), 6109-6132. https://doi.org/10.1080/01431161.2010.507795
  34. Nanga, S., Bawah, A. T., Acquaye, B. A., Billa, M. I., Baeta, F. D., Odai, N. A., ... & Nsiah, A. D. (2021). Review of Dimension Reduction Methods. Journal of Data Analysis and Information Processing, 9(3), 189-231. https://doi.org/10.4236/jdaip.2021.93013
  35. Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. Proceedings of the 22nd international conference on Machine learning (ICML 2005), 625-632.
  36. Omar, S., Ngadi, A., & Jebur, H. H. (2013). Machine learning techniques for anomaly detection: an overview. International Journal of Computer Applications, 79(2), 33-41. https://doi.org/10.5120/13715-1478
  37. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11), 559-572. https://doi.org/10.1080/14786440109462720
  38. Rousseeuw, P. J., & Driessen, K. V. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212-223. https://doi.org/10.1080/00401706.1999.10485670
  39. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500), 2323-2326. https://doi.org/10.1126/science.290.5500.2323
  40. Scholkopf, B., Smola, A., & Muller, K. R. (1997). Kernel principal component analysis. In International conference on artificial neural networks (pp. 583-588). Springer, Berlin, Heidelberg.
  41. Shannon, C. E., & Weaver, W. (1963). The Mathematical Theory of Communication,(first published in 1949). Urbana: University of Illinois Press.
  42. Spearman, C. (1904). 'General intelligence,' objectively determined and measured. The American Journal of Psychology, 15(2), 201-292. https://doi.org/10.2307/1412107
  43. Tenenbaum, J. B., Silva, V. D., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. science, 290(5500), 2319-2323. https://doi.org/10.1126/science.290.5500.2319
  44. Van Der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-sne. journal of machine learning research. J Mach Learn Res, 9(26), 5.
  45. Wahyuni, H. C., Vanany, I., & Ciptomulyono, U. (2019). Application of Bayesian Network for Food Safety Risk in Cattle Slaugtering Industry. 2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), 450-454.