• Title/Summary/Keyword: Imbalanced Data

Search Result 162, Processing Time 0.023 seconds

A study on intrusion detection performance improvement through imbalanced data processing (불균형 데이터 처리를 통한 침입탐지 성능향상에 관한 연구)

  • Jung, Il Ok;Ji, Jae-Won;Lee, Gyu-Hwan;Kim, Myo-Jeong
    • Convergence Security Journal
    • /
    • v.21 no.3
    • /
    • pp.57-66
    • /
    • 2021
  • As the detection performance using deep learning and machine learning of the intrusion detection field has been verified, the cases of using it are increasing day by day. However, it is difficult to collect the data required for learning, and it is difficult to apply the machine learning performance to reality due to the imbalance of the collected data. Therefore, in this paper, A mixed sampling technique using t-SNE visualization for imbalanced data processing is proposed as a solution to this problem. To do this, separate fields according to characteristics for intrusion detection events, including payload. Extracts TF-IDF-based features for separated fields. After applying the mixed sampling technique based on the extracted features, a data set optimized for intrusion detection with imbalanced data is obtained through data visualization using t-SNE. Nine sampling techniques were applied through the open intrusion detection dataset CSIC2012, and it was verified that the proposed sampling technique improves detection performance through F-score and G-mean evaluation indicators.

A divide-oversampling and conquer algorithm based support vector machine for massive and highly imbalanced data (불균형의 대용량 범주형 자료에 대한 분할-과대추출 정복 서포트 벡터 머신)

  • Bang, Sungwan;Kim, Jaeoh
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.2
    • /
    • pp.177-188
    • /
    • 2022
  • The support vector machine (SVM) has been successfully applied to various classification areas with a high level of classification accuracy. However, it is infeasible to use the SVM in analyzing massive data because of its significant computational problems. When analyzing imbalanced data with different class sizes, furthermore, the classification accuracy of SVM in minority class may drop significantly because its classifier could be biased toward the majority class. To overcome such a problem, we propose the DOC-SVM method, which uses divide-oversampling and conquers techniques. The proposed DOC-SVM divides the majority class into a few subsets and applies an oversampling technique to the minority class in order to produce the balanced subsets. And then the DOC-SVM obtains the final classifier by aggregating all SVM classifiers obtained from the balanced subsets. Simulation studies are presented to demonstrate the satisfactory performance of the proposed method.

Weighted L1-Norm Support Vector Machine for the Classification of Highly Imbalanced Data (불균형 자료의 분류분석을 위한 가중 L1-norm SVM)

  • Kim, Eunkyung;Jhun, Myoungshic;Bang, Sungwan
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.1
    • /
    • pp.9-21
    • /
    • 2015
  • The support vector machine has been successfully applied to various classification areas due to its flexibility and a high level of classification accuracy. However, when analyzing imbalanced data with uneven class sizes, the classification accuracy of SVM may drop significantly in predicting minority class because the SVM classifiers are undesirably biased toward the majority class. The weighted $L_2$-norm SVM was developed for the analysis of imbalanced data; however, it cannot identify irrelevant input variables due to the characteristics of the ridge penalty. Therefore, we propose the weighted $L_1$-norm SVM, which uses lasso penalty to select important input variables and weights to differentiate the misclassification of data points between classes. We demonstrate the satisfactory performance of the proposed method through simulation studies and a real data analysis.

A Study on the Improvement of Image Classification Performance in the Defense Field through Cost-Sensitive Learning of Imbalanced Data (불균형데이터의 비용민감학습을 통한 국방분야 이미지 분류 성능 향상에 관한 연구)

  • Jeong, Miae;Ma, Jungmok
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.24 no.3
    • /
    • pp.281-292
    • /
    • 2021
  • With the development of deep learning technology, researchers and technicians keep attempting to apply deep learning in various industrial and academic fields, including the defense. Most of these attempts assume that the data are balanced. In reality, since lots of the data are imbalanced, the classifier is not properly built and the model's performance can be low. Therefore, this study proposes cost-sensitive learning as a solution to the imbalance data problem of image classification in the defense field. In the proposed model, cost-sensitive learning is a method of giving a high weight on the cost function of a minority class. The results of cost-sensitive based model shows the test F1-score is higher when cost-sensitive learning is applied than general learning's through 160 experiments using submarine/non-submarine dataset and warship/non-warship dataset. Furthermore, statistical tests are conducted and the results are shown significantly.

SMOTE by Mahalanobis distance using MCD in imbalanced data (불균형 자료에서 MCD를 활용한 마할라노비스 거리에 의한 SMOTE)

  • Jieun Jung;Yong-Seok Choi
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.4
    • /
    • pp.455 -465
    • /
    • 2024
  • SMOTE (synthetic minority over-sampling technique) has been used the most as a solution to the problem of imbalanced data. SMOTE selects the nearest neighbor based on Euclidean distance. However, Euclidean distance has the disadvantage of not considering the correlation between variables. In particular, the Mahalanobis distance has the advantage of considering the covariance of variables. But if there are outliers, they usually influence calculating the Mahalanobis distance. To solve this problem, we use the Mahalanobis distance by estimating the covariance matrix using MCD (minimum covariance determinant). Then apply Mahalanobis distance based on MCD to SMOTE to create new data. Therefore, we showed that in most cases this method provided high performance indicators for classifying imbalanced data.

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

  • Kim, HanYong;Lee, Woojoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.681-690
    • /
    • 2017
  • Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

Selecting the optimal threshold based on impurity index in imbalanced classification (불균형 자료에서 불순도 지수를 활용한 분류 임계값 선택)

  • Jang, Shuin;Yeo, In-Kwon
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.5
    • /
    • pp.711-721
    • /
    • 2021
  • In this paper, we propose the method of adjusting thresholds using impurity indices in classification analysis on imbalanced data. Suppose the minority category is Positive and the majority category is Negative for the imbalanced binomial data. When categories are determined based on the commonly used 0.5 basis, the specificity tends to be high in unbalanced data while the sensitivity is relatively low. Increasing sensitivity is important when proper classification of objects in minority categories is relatively important. We explore how to increase sensitivity through adjusting thresholds. Existing studies have adjusted thresholds based on measures such as G-Mean and F1-score, but in this paper, we propose a method to select optimal thresholds using the chi-square statistic of CHAID, the Gini index of CART, and the entropy of C4.5. We also introduce how to get a possible unique value when multiple optimal thresholds are obtained. Empirical analysis shows what improvements have been made compared to the results based on 0.5 through classification performance metrics.

Machine Learning Based Intrusion Detection Systems for Class Imbalanced Datasets (클래스 불균형 데이터에 적합한 기계 학습 기반 침입 탐지 시스템)

  • Cheong, Yun-Gyung;Park, Kinam;Kim, Hyunjoo;Kim, Jonghyun;Hyun, Sangwon
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.27 no.6
    • /
    • pp.1385-1395
    • /
    • 2017
  • This paper aims to develop an IDS (Intrusion Detection System) that takes into account class imbalanced datasets. For this, we first built a set of training data sets from the Kyoto 2006+ dataset in which the amounts of normal data and abnormal (intrusion) data are not balanced. Then, we have run a number of tests to evaluate the effectiveness of machine learning techniques for detecting intrusions. Our evaluation results demonstrated that the Random Forest algorithm achieved the best performances.

The Quality of a Traditional Dietary Pattern in Relation to Metabolic Syndrome in Elderly South Koreans

  • Oh, Chorong;No, Jaekyung
    • Journal of Obesity & Metabolic Syndrome
    • /
    • v.27 no.4
    • /
    • pp.254-261
    • /
    • 2018
  • Background: The most beneficial dietary pattern in managing metabolic syndrome (MetS) in the elderly has not been ascertained. The aim of this study is to classify dietary patterns and to examine associations between dietary pattern, MetS and body composition in elderly Koreans. Methods: This study was conducted among Koreans 65 years or older using data from the Korea National Health and Nutrition Examination Survey in 2009. A total of 1,567 study subjects were included. All statistical analyses were conducted using SPSS version 20.0 and dietary patterns were classified by cluster analysis. Results: There were three dietary patterns derived by cluster analysis in this study. We observed that most South Korean elderly still maintain a traditional dietary pattern. Dietary patterns were classified as balanced (31%), imbalanced (40%), or very imbalanced (30%), with the majority of subjects having an unbalanced diet pattern in which their total energy and nutrient intake was insufficient compared with the Dietary Reference Intake for Koreans. Those in the very imbalanced group had a ratio of macronutrients (carbohydrates:fats:protein) of 81.15:7.18:11.50 and a 54% higher likelihood of having hypertriglyceridemia (P=0.025) compared with those in the balanced group. Conclusion: The current findings indicate that the diets of South Korean elderly are nutritionally imbalanced, including high carbohydrate consumption, which confers a high risk hypertriglyceridemia. These findings highlight the effect of nutritional imbalance in elderly with MetS.

Severity-based Software Quality Prediction using Class Imbalanced Data

  • Hong, Euy-Seok;Park, Mi-Kyeong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.4
    • /
    • pp.73-80
    • /
    • 2016
  • Most fault prediction models have class imbalance problems because training data usually contains much more non-fault class modules than fault class ones. This imbalanced distribution makes it difficult for the models to learn the minor class module data. Data imbalance is much higher when severity-based fault prediction is used. This is because high severity fault modules is a smaller subset of the fault modules. In this paper, we propose severity-based models to solve these problems using the three sampling methods, Resample, SpreadSubSample and SMOTE. Empirical results show that Resample method has typical over-fit problems, and SpreadSubSample method cannot enhance the prediction performance of the models. Unlike two methods, SMOTE method shows good performance in terms of AUC and FNR values. Especially J48 decision tree model using SMOTE outperforms other prediction models.