• Title/Summary/Keyword: Imbalanced data

Search Result 151, Processing Time 0.028 seconds

A Study of the Nutritional Status According to the State of Depression of Allergic Disease Patients: Based on the Korea National Health and Nutrition Examination Survey (알레르기성 질환자의 우울증 유무에 따른 영양 상태 연구: 국민건강영양조사 데이터를 이용하여)

  • Oh, Soo-Yeun
    • Journal of the Korean Dietetic Association
    • /
    • v.28 no.4
    • /
    • pp.227-246
    • /
    • 2022
  • This study was conducted on the nutritional status of 1,805 patients with allergic diseases (atopic dermatitis, allergic rhinitis, and asthma) aged 19 to 64 years according to their state of depression, based on the data from the Korea National Health and Nutrition Examination Survey (KNHANES). The Patient Health Questionnaire-9 (PHQ-9) was used to diagnose depression. Subjects with a score of 10 or more were categorized into the depression group (n=152) and the rest into the non-depression group (n=1,653). The results of this study were as follows: The proportion of women (75.7%) was higher than that of men (24.3%) in the depressed group (P<0.01). In terms of energy intake per 1,000 kcal, both men and women in the depressed group showed a lower energy intake than the non-depressed group and this intake was less than the estimated energy requirement (EER). The nutrient intakes of protein, calcium, phosphorus, iron, vitamin A, thiamine, riboflavin, niacin, folic acid, and vitamin C were below the estimated average requirement (EAR). Also, the intakes of fiber and potassium were less than the adequate intake (AI) (P<0.001). In the lifestyle parameters, the ratio of eating alone at lunch was 54.1%:33.1%, indicating that more than half of the depression group ate alone. In conclusion, it was observed that the nutritional status of allergic disease patients was imbalanced. The nutritional imbalance was due to insufficient energy intake and inadequate intake of nutrients, which was below the average requirements of vitamins and minerals and this was more evident in the depression group than in the non-depression group.

Predicting Highway Concrete Pavement Damage using XGBoost (XGBoost를 활용한 고속도로 콘크리트 포장 파손 예측)

  • Lee, Yongjun;Sun, Jongwan
    • Korean Journal of Construction Engineering and Management
    • /
    • v.21 no.6
    • /
    • pp.46-55
    • /
    • 2020
  • The maintenance cost for highway pavement is gradually increasing due to the continuous increase in road extension as well as increase in the number of old routes that have passed the public period. As a result, there is a need for a method of minimizing costs through preventative grievance Preventive maintenance requires the establishment of a strategic plan through accurate prediction old Highway pavement. herefore, in this study, the XGBoost among machine learning classification-based models was used to develop a highway pavement damage prediction model. First, we solved the imbalanced data issue through data sampling, then developed a predictive model using the XGBoost. This predictive model was evaluated through performance indicators such as accuracy and F1 score. As a result, the over-sampling method showed the best performance result. On the other hand, the main variables affecting road damage were calculated in the order of the number of years of service, ESAL, and the number of days below the minimum temperature -2 degrees Celsius. If the performance of the prediction model is improved through more data accumulation and detailed data pre-processing in the future, it is expected that more accurate prediction of maintenance-required sections will be possible. In addition, it is expected to be used as important basic information for estimating the highway pavement maintenance budget in the future.

A study on the classification of body types for female junior high school students - Focused on the development of school uniforms - (여자 중학생의 체형분류에 관한 연구 - 교복패턴개발을 중심으로 -)

  • Shin, Jang-Hee
    • Journal of the Korea Fashion and Costume Design Association
    • /
    • v.22 no.3
    • /
    • pp.99-110
    • /
    • 2020
  • In terms of junior high school girls' growth patterns during early adolescence, are unlike childhood when relatively balanced growth patterns are found and high school years in which the normal adult body type is nearly reached, growth patterns displayed are imbalanced and rapid. In fact, diverse size changes by body part growth occur significantly different from individual to individual. Therefore, it has been hard for junior high school students to select their proper size when buying school uniforms. This study attempted to acquire basic data needed to address adolescent body shapes and school uniform patterns for junior high school girls, using the data from the 7th Size Korea Survey (2015). Specifically, it provides basic data for the development of school uniform patterns through the classification of their body into particular types, After extracting body shape components and a cluster analysis using ANOVA. According to a factor analysis conducted to determine body shape components, six factors were obtained: Factor 1: bulk and horizontal size, Factor 2: body height and length, Factor 3: shoulder shape and length, Factor 4: shape of upper body, Factor 5: lower drop, Factor 6: upper drop with a variance of 81.46%. To classify junior high school girls' body shape and determine their characteristics, a cluster analysis was performed with the variables obtained using factor analysis. Body shape was classified into three different types: Type 1 accounted for 30.7%. This was a short, slender body with the smallest bulk, size, and upper drop. Type 2 accounted for 24.9%. This was the largest in bulk and horizontal size and highest and length as well. Type 3 accounted for 44.5%. This type was close to average in terms of horizontal size, length and height, and high drop values. To develop school uniforms with great accuracy and body fit for junior high school students, there should be further studies on changes in body shape and their causes. The study results can serve as basic data for comparing branded school uniform patterns for junior high school girls and developing school uniform patterns based on body shape, using 3D virtual clothing simulations.

Effect of Sleep Duration on Dietary Habits and Body Composition of University Students (대학생의 수면시간에 따른 식습관 및 체조성에 관한 연구)

  • Kim, KyungHee;Cho, HeeSook
    • Journal of the Korean Society of Food Culture
    • /
    • v.28 no.5
    • /
    • pp.539-546
    • /
    • 2013
  • The aim of this study was to investigate the effect of sleep duration on dietary habits and body composition of university students. Sleep duration has recently been added to the list of risk factors for obesity. However, studies on this topic are fairly limited particularly in Korea. We studied the relationship between the duration of sleep and obesity principally based on body mass index and %body fat in university students. For this purpose, a survey was conducted on a total of 312 university students. The subjects enrolled for this study were divided into two groups: (1) those with sleep duration of <7 hours (148 students) and (2) those with sleep duration of >7 hours (164 students). Based on a self-reporting method, the participants filled up the questionnaires for more than 20 minutes. Based on the overall data obtained, we observed that most students (52.88%) skipped breakfast. This was mainly due to shortage of time (60.58%). We also observed that self-reporting dietary preferences included eating irregular meals (49.04%), overeating (19.55%), imbalanced diet (16.35%), and skipping meals (9.94%). It was found that cookies were the favorite snacks in the majority of the participants (50%). Our data reveal that the body mass index, fat mass, visceral fat, and subcutaneous fat, respectively of the shorter sleep duration group (<7 h/day) were 23.78 $kg/m^2$, 19.13 kg, 2.23 kg, and 11.15 kg. In contrast, in those of the control group (7 h/day), these values were found to be 21.84 $kg/m^2$, 13.88 kg, 1.56 kg, and 12.11 kg. We also observed that there were significant correlations of sleep duration with body mass index (p<0.05), fat mass (p<0.01), visceral fat (p<0.01), and beck depression score (p<0.01). Our data suggest that the body mass index in the shorter sleep duration group was higher than that of the control group; however, %fat, visceral fat, and subcutaneous fat in the shorter sleep duration group were found to be higher than those of the control group. The data obtained through our study suggest that short sleep duration is clearly associated with a modest increase in general and abdominal obesity particularly in university students.

Development of a Gangwon Province Forest Fire Prediction Model using Machine Learning and Sampling (머신러닝과 샘플링을 이용한 강원도 지역 산불발생예측모형 개발)

  • Chae, Kyoung-jae;Lee, Yu-Ri;cho, yong-ju;Park, Ji-Hyun
    • The Journal of Bigdata
    • /
    • v.3 no.2
    • /
    • pp.71-78
    • /
    • 2018
  • The study is based on machine learning techniques to increase the accuracy of the forest fire predictive model. It used 14 years of data from 2003 to 2016 in Gang-won-do where forest fire were the most frequent. To reduce weather data errors, Gang-won-do was divided into nine areas and weather data from each region was used. However, dividing the forest fire forecast model into nine zones would make a large difference between the date of occurrence and the date of not occurring. Imbalance issues can degrade model performance. To address this, several sampling methods were applied. To increase the accuracy of the model, five indices in the Canadian Frost Fire Weather Index (FWI) were used as derived variable. The modeling method used statistical methods for logistic regression and machine learning methods for random forest and xgboost. The selection criteria for each zone's final model were set in consideration of accuracy, sensitivity and specificity, and the prediction of the nine zones resulted in 80 of the 104 fires that occurred, and 7426 of the 9758 non-fires. Overall accuracy was 76.1%.

The Optimization of Ensembles for Bankruptcy Prediction (기업부도 예측 앙상블 모형의 최적화)

  • Myoung Jong Kim;Woo Seob Yun
    • Information Systems Review
    • /
    • v.24 no.1
    • /
    • pp.39-57
    • /
    • 2022
  • This paper proposes the GMOPTBoost algorithm to improve the performance of the AdaBoost algorithm for bankruptcy prediction in which class imbalance problem is inherent. AdaBoost algorithm has the advantage of providing a robust learning opportunity for misclassified samples. However, there is a limitation in addressing class imbalance problem because the concept of arithmetic mean accuracy is embedded in AdaBoost algorithm. GMOPTBoost can optimize the geometric mean accuracy and effectively solve the category imbalance problem by applying Gaussian gradient descent. The samples are constructed according to the following two phases. First, five class imbalance datasets are constructed to verify the effect of the class imbalance problem on the performance of the prediction model and the performance improvement effect of GMOPTBoost. Second, class balanced data are constituted through data sampling techniques to verify the performance improvement effect of GMOPTBoost. The main results of 30 times of cross-validation analyzes are as follows. First, the class imbalance problem degrades the performance of ensembles. Second, GMOPTBoost contributes to performance improvements of AdaBoost ensembles trained on imbalanced datasets. Third, Data sampling techniques have a positive impact on performance improvement. Finally, GMOPTBoost contributes to significant performance improvement of AdaBoost ensembles trained on balanced datasets.

Parallel Method for HEVC Deblocking Filter based on Coding Unit Depth Information (코딩 유닛 깊이 정보를 이용한 HEVC 디블록킹 필터의 병렬화 기법)

  • Jo, Hyun-Ho;Ryu, Eun-Kyung;Nam, Jung-Hak;Sim, Dong-Gyu;Kim, Doo-Hyun;Song, Joon-Ho
    • Journal of Broadcast Engineering
    • /
    • v.17 no.5
    • /
    • pp.742-755
    • /
    • 2012
  • In this paper, we propose a parallel deblocking algorithm to resolve workload imbalance when the deblocking filter of high efficiency video coding (HEVC) decoder is parallelized. In HEVC, the deblocking filter which is one of the in-loop filters conducts two-step filtering on vertical edges first and horizontal edges later. The deblocking filtering can be conducted with high-speed through data-level parallelism because there is no dependency between adjacent edges for deblocking filtering processes. However, workloads would be imbalanced among regions even though the same amount of data for each region is allocated, which causes performance loss of decoder parallelization. In this paper, we solve the problem for workload imbalance by predicting the complexity of deblocking filtering with coding unit (CU) depth information at a coding tree block (CTB) and by allocating the same amount of workload to each core. Experimental results show that the proposed method achieves average time saving (ATS) by 64.3%, compared to single core-based deblocking filtering and also achieves ATS by 6.7% on average and 13.5% on maximum, compared to the conventional uniform data-level parallelism.

Semi-supervised learning for sentiment analysis in mass social media (대용량 소셜 미디어 감성분석을 위한 반감독 학습 기법)

  • Hong, Sola;Chung, Yeounoh;Lee, Jee-Hyong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.5
    • /
    • pp.482-488
    • /
    • 2014
  • This paper aims to analyze user's emotion automatically by analyzing Twitter, a representative social network service (SNS). In order to create sentiment analysis models by using machine learning techniques, sentiment labels that represent positive/negative emotions are required. However it is very expensive to obtain sentiment labels of tweets. So, in this paper, we propose a sentiment analysis model by using self-training technique in order to utilize "data without sentiment labels" as well as "data with sentiment labels". Self-training technique is that labels of "data without sentiment labels" is determined by utilizing "data with sentiment labels", and then updates models using together with "data with sentiment labels" and newly labeled data. This technique improves the sentiment analysis performance gradually. However, it has a problem that misclassifications of unlabeled data in an early stage affect the model updating through the whole learning process because labels of unlabeled data never changes once those are determined. Thus, labels of "data without sentiment labels" needs to be carefully determined. In this paper, in order to get high performance using self-training technique, we propose 3 policies for updating "data with sentiment labels" and conduct a comparative analysis. The first policy is to select data of which confidence is higher than a given threshold among newly labeled data. The second policy is to choose the same number of the positive and negative data in the newly labeled data in order to avoid the imbalanced class learning problem. The third policy is to choose newly labeled data less than a given maximum number in order to avoid the updates of large amount of data at a time for gradual model updates. Experiments are conducted using Stanford data set and the data set is classified into positive and negative. As a result, the learned model has a high performance than the learned models by using "data with sentiment labels" only and the self-training with a regular model update policy.

Conditional Generative Adversarial Network based Collaborative Filtering Recommendation System (Conditional Generative Adversarial Network(CGAN) 기반 협업 필터링 추천 시스템)

  • Kang, Soyi;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.157-173
    • /
    • 2021
  • With the development of information technology, the amount of available information increases daily. However, having access to so much information makes it difficult for users to easily find the information they seek. Users want a visualized system that reduces information retrieval and learning time, saving them from personally reading and judging all available information. As a result, recommendation systems are an increasingly important technologies that are essential to the business. Collaborative filtering is used in various fields with excellent performance because recommendations are made based on similar user interests and preferences. However, limitations do exist. Sparsity occurs when user-item preference information is insufficient, and is the main limitation of collaborative filtering. The evaluation value of the user item matrix may be distorted by the data depending on the popularity of the product, or there may be new users who have not yet evaluated the value. The lack of historical data to identify consumer preferences is referred to as data sparsity, and various methods have been studied to address these problems. However, most attempts to solve the sparsity problem are not optimal because they can only be applied when additional data such as users' personal information, social networks, or characteristics of items are included. Another problem is that real-world score data are mostly biased to high scores, resulting in severe imbalances. One cause of this imbalance distribution is the purchasing bias, in which only users with high product ratings purchase products, so those with low ratings are less likely to purchase products and thus do not leave negative product reviews. Due to these characteristics, unlike most users' actual preferences, reviews by users who purchase products are more likely to be positive. Therefore, the actual rating data is over-learned in many classes with high incidence due to its biased characteristics, distorting the market. Applying collaborative filtering to these imbalanced data leads to poor recommendation performance due to excessive learning of biased classes. Traditional oversampling techniques to address this problem are likely to cause overfitting because they repeat the same data, which acts as noise in learning, reducing recommendation performance. In addition, pre-processing methods for most existing data imbalance problems are designed and used for binary classes. Binary class imbalance techniques are difficult to apply to multi-class problems because they cannot model multi-class problems, such as objects at cross-class boundaries or objects overlapping multiple classes. To solve this problem, research has been conducted to convert and apply multi-class problems to binary class problems. However, simplification of multi-class problems can cause potential classification errors when combined with the results of classifiers learned from other sub-problems, resulting in loss of important information about relationships beyond the selected items. Therefore, it is necessary to develop more effective methods to address multi-class imbalance problems. We propose a collaborative filtering model using CGAN to generate realistic virtual data to populate the empty user-item matrix. Conditional vector y identify distributions for minority classes and generate data reflecting their characteristics. Collaborative filtering then maximizes the performance of the recommendation system via hyperparameter tuning. This process should improve the accuracy of the model by addressing the sparsity problem of collaborative filtering implementations while mitigating data imbalances arising from real data. Our model has superior recommendation performance over existing oversampling techniques and existing real-world data with data sparsity. SMOTE, Borderline SMOTE, SVM-SMOTE, ADASYN, and GAN were used as comparative models and we demonstrate the highest prediction accuracy on the RMSE and MAE evaluation scales. Through this study, oversampling based on deep learning will be able to further refine the performance of recommendation systems using actual data and be used to build business recommendation systems.

Experimental Comparison of Network Intrusion Detection Models Solving Imbalanced Data Problem (데이터의 불균형성을 제거한 네트워크 침입 탐지 모델 비교 분석)

  • Lee, Jong-Hwa;Bang, Jiwon;Kim, Jong-Wouk;Choi, Mi-Jung
    • KNOM Review
    • /
    • v.23 no.2
    • /
    • pp.18-28
    • /
    • 2020
  • With the development of the virtual community, the benefits that IT technology provides to people in fields such as healthcare, industry, communication, and culture are increasing, and the quality of life is also improving. Accordingly, there are various malicious attacks targeting the developed network environment. Firewalls and intrusion detection systems exist to detect these attacks in advance, but there is a limit to detecting malicious attacks that are evolving day by day. In order to solve this problem, intrusion detection research using machine learning is being actively conducted, but false positives and false negatives are occurring due to imbalance of the learning dataset. In this paper, a Random Oversampling method is used to solve the unbalance problem of the UNSW-NB15 dataset used for network intrusion detection. And through experiments, we compared and analyzed the accuracy, precision, recall, F1-score, training and prediction time, and hardware resource consumption of the models. Based on this study using the Random Oversampling method, we develop a more efficient network intrusion detection model study using other methods and high-performance models that can solve the unbalanced data problem.