[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5668/JEHS.2017.43.4.298

Data Cleaning and Integration of Multi-year Dietary Survey in the Korea National Health and Nutrition Examination Survey (KNHANES) using Database Normalization Theory

Kwon, Namji (CHEM.I.NET, Ltd.)
Suh, Jihye (CHEM.I.NET, Ltd.)
Lee, Hunjoo (CHEM.I.NET, Ltd.)

Publication Information

Journal of Environmental Health Sciences / v.43, no.4, 2017 , pp. 298-306 More about this Journal

Abstract

Objectives: Since 1998, the Korea National Health and Nutrition Examination Survey (KNHANES) has been conducted in order to investigate the health and nutritional status of Koreans. The food intake data of individuals in the KNHANES has also been utilized as source dataset for risk assessment of chemicals via food. To improve the reliability of intake estimation and prevent missing data for less-responded foods, the structure of integrated long-standing datasets is significant. However, it is difficult to merge multi-year survey datasets due to ineffective cleaning processes for handling extensive numbers of codes for each food item along with changes in dietary habits over time. Therefore, this study aims at 1) cleaning the process of abnormal data 2) generation of integrated long-standing raw data, and 3) contributing to the production of consistent dietary exposure factors. Methods: Codebooks, the guideline book, and raw intake data from KNHANES V and VI were used for analysis. The violation of the primary key constraint and the $1^{st}-3rd$ normal form in relational database theory were tested for the codebook and the structure of the raw data, respectively. Afterwards, the cleaning process was executed for the raw data by using these integrated codes. Results: Duplication of key records and abnormality in table structures were observed. However, after adjusting according to the suggested method above, the codes were corrected and integrated codes were newly created. Finally, we were able to clean the raw data provided by respondents to the KNHANES survey. Conclusion: The results of this study will contribute to the integration of the multi-year datasets and help improve the data production system by clarifying, testing, and verifying the primary key, integrity of the code, and primitive data structure according to the database normalization theory in the national health data.

Keywords

Database; data cleaning; exposure factor; food intake; normalization;

Citations & Related Records

Reference

1	Korea Centers for Disease Control & Prevention, Korea National Health & Nutrition Examination Survey. Available: https://knhanes.cdc.go.kr/knhanes/eng/index.do [accessed 12 July 2017].
2	Korea Food & Drug Administration, Study on Extension of Dietary Exposure Assessment System - On Dietary Intake Database and Food & Nutrient Content Database -. Available: https://rnd.mfds.go.kr/ [accessed 12 July 2017].
3	Choi HS, Oh HJ, Choi H, Choi WH, Kim JG, Kim KM, et al. Vitamin D Insufficiency in Korea-A Greater Threat to Younger Generation: The Korea National Health and Nutrition Examination Survey (KNHANES) 2008. 2011; 96(3): 643-651.
4	Codd EF, Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks, Research Report, IBM, 1969.
5	Shin SK, Sanders GL. Denormalization strategies for data retrieval from data warehouses. Decision Support Systems. 2006; 42(1): 267-282. DOI

KSCI

Data Cleaning and Integration of Multi-year Dietary Survey in the Korea National Health and Nutrition Examination Survey (KNHANES) using Database Normalization Theory 데이터베이스 정규화 이론을 이용한 국민건강영양조사 중 다년도 식이조사 자료 정제 및 통합

Data Cleaning and Integration of Multi-year Dietary Survey in the Korea National Health and Nutrition Examination Survey (KNHANES) using Database Normalization Theory