Browse > Article
http://dx.doi.org/10.5668/JEHS.2017.43.4.298

Data Cleaning and Integration of Multi-year Dietary Survey in the Korea National Health and Nutrition Examination Survey (KNHANES) using Database Normalization Theory  

Kwon, Namji (CHEM.I.NET, Ltd.)
Suh, Jihye (CHEM.I.NET, Ltd.)
Lee, Hunjoo (CHEM.I.NET, Ltd.)
Publication Information
Journal of Environmental Health Sciences / v.43, no.4, 2017 , pp. 298-306 More about this Journal
Abstract
Objectives: Since 1998, the Korea National Health and Nutrition Examination Survey (KNHANES) has been conducted in order to investigate the health and nutritional status of Koreans. The food intake data of individuals in the KNHANES has also been utilized as source dataset for risk assessment of chemicals via food. To improve the reliability of intake estimation and prevent missing data for less-responded foods, the structure of integrated long-standing datasets is significant. However, it is difficult to merge multi-year survey datasets due to ineffective cleaning processes for handling extensive numbers of codes for each food item along with changes in dietary habits over time. Therefore, this study aims at 1) cleaning the process of abnormal data 2) generation of integrated long-standing raw data, and 3) contributing to the production of consistent dietary exposure factors. Methods: Codebooks, the guideline book, and raw intake data from KNHANES V and VI were used for analysis. The violation of the primary key constraint and the $1^{st}-3rd$ normal form in relational database theory were tested for the codebook and the structure of the raw data, respectively. Afterwards, the cleaning process was executed for the raw data by using these integrated codes. Results: Duplication of key records and abnormality in table structures were observed. However, after adjusting according to the suggested method above, the codes were corrected and integrated codes were newly created. Finally, we were able to clean the raw data provided by respondents to the KNHANES survey. Conclusion: The results of this study will contribute to the integration of the multi-year datasets and help improve the data production system by clarifying, testing, and verifying the primary key, integrity of the code, and primitive data structure according to the database normalization theory in the national health data.
Keywords
Database; data cleaning; exposure factor; food intake; normalization;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Korea Centers for Disease Control & Prevention, Korea National Health & Nutrition Examination Survey. Available: https://knhanes.cdc.go.kr/knhanes/eng/index.do [accessed 12 July 2017].
2 Korea Food & Drug Administration, Study on Extension of Dietary Exposure Assessment System - On Dietary Intake Database and Food & Nutrient Content Database -. Available: https://rnd.mfds.go.kr/ [accessed 12 July 2017].
3 Choi HS, Oh HJ, Choi H, Choi WH, Kim JG, Kim KM, et al. Vitamin D Insufficiency in Korea-A Greater Threat to Younger Generation: The Korea National Health and Nutrition Examination Survey (KNHANES) 2008. 2011; 96(3): 643-651.
4 Codd EF, Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks, Research Report, IBM, 1969.
5 Shin SK, Sanders GL. Denormalization strategies for data retrieval from data warehouses. Decision Support Systems. 2006; 42(1): 267-282.   DOI