Search | Korea Science

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

Kim, HanYong;Lee, Woojoo
- The Korean Journal of Applied Statistics
- /
- v.30 no.5
- /
- pp.681-690
- /
- 2017
Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.
https://doi.org/10.5351/KJAS.2017.30.5.681 인용 PDF KSCI

Comparison of nomograms designed to predict hypertension with a complex sample (고혈압 예측을 위한 노모그램 구축 및 비교)

Kim, Min Ho;Shin, Min Seok;Lee, Jea Young
- The Korean Journal of Applied Statistics
- /
- v.33 no.5
- /
- pp.555-567
- /
- 2020
Hypertension has a steadily increasing incidence rate as well as represents a risk factors for secondary diseases such as cardiovascular disease. Therefore, it is important to predict the incidence rate of the disease. In this study, we constructed nomograms that can predict the incidence rate of hypertension. We use data from the Korean National Health and Nutrition Examination Survey (KNHANES) for 2013-2016. The complex sampling data required the use of a Rao-Scott chi-squared test to identify 10 risk factors for hypertension. Smoking and exercise variables were not statistically significant in the Logistic regression; therefore, eight effects were selected as risk factors for hypertension. Logistic and Bayesian nomograms constructed from the selected risk factors were proposed and compared. The constructed nomograms were then verified using a receiver operating characteristics curve and calibration plot.
https://doi.org/10.5351/KJAS.2020.33.5.555 인용 PDF KSCI

Performance Comparison for Radar Target Classification of Monostatic RCS and Bistatic RCS (모노스태틱 RCS와 바이스태틱 RCS의 표적 구분 성능 분석)

Lee, Sung-Jun;Choi, In-Sik
- The Journal of Korean Institute of Electromagnetic Engineering and Science
- /
- v.21 no.12
- /
- pp.1460-1466
- /
- 2010
In this paper, we analyzed the performance of radar target classification using the monostatic and bistatic radar cross section(RCS) for four different wire targets. Short time Fourier transform(STFT) and continuous wavelet transform (CWT) were used for feature extraction from the monostatic RCS and the bistatic RCS of each target, and a multi-layered perceptron(MLP) neural network was used as a classifier. Results show that CWT yields better performance than STFT for both the monostatic RCS and the bistatic RCS. And, when STFT was used, the performance of the bistatic RCS was slightly better than that of the monostatic RCS. However, when CWT was used, the performance of the monostatic RCS was slightly better than that of the bistatic RCS. Resultingly, it is proven that bistatic RCS is a good cadndidate for application to radar target classification in combination with a monostatic RCS.
https://doi.org/10.5515/KJKIEES.2010.21.12.1460 인용 PDF KSCI

Validation Technique of Trace-Driven Simulation Model Using Weighted F-measure (가중 F 척도를 이용한 Trace-Driven 시뮬레이션 모델의 검증 방법)

HwangBo, Hoon;Cheon, Hyeon-Jae;Lee, Hong-Chul
- Journal of the Korea Society for Simulation
- /
- v.18 no.4
- /
- pp.185-195
- /
- 2009
As most systems get more complicated, system analysis using simulation has been taken notice of. One of the core parts of simulation analysis is validation of a simulation model, and we can identify how well the simulation model represents the real system with this validation process. The difference between input data of two systems has an effect on the comparison between a simulation model and a real system at validation stage, and the result with such difference is not enough to ensure high credibility of the model. Accordingly, in this paper, we construct a model based on Trace-driven simulation which uses identical input data with the real system. On the other hand, to validate a model by each class, not by an unique statistic, we validate the model using a metric transformed from F-measure which estimates performance of a classifier in data mining field. Finally, this procedure enables precise validation process of a model, and it helps modification by offering feedback at the validation phase.
https://doi.org/10.9709/JKSS.2009.18.4.185 인용 PDF

Intelligent System for the Prediction of Heart Diseases Using Machine Learning Algorithms with Anew Mixed Feature Creation (MFC) technique

Rawia Elarabi;Abdelrahman Elsharif Karrar;Murtada El-mukashfi El-taher
- International Journal of Computer Science & Network Security
- /
- v.23 no.5
- /
- pp.148-162
- /
- 2023
Classification systems can significantly assist the medical sector by allowing for the precise and quick diagnosis of diseases. As a result, both doctors and patients will save time. A possible way for identifying risk variables is to use machine learning algorithms. Non-surgical technologies, such as machine learning, are trustworthy and effective in categorizing healthy and heart-disease patients, and they save time and effort. The goal of this study is to create a medical intelligent decision support system based on machine learning for the diagnosis of heart disease. We have used a mixed feature creation (MFC) technique to generate new features from the UCI Cleveland Cardiology dataset. We select the most suitable features by using Least Absolute Shrinkage and Selection Operator (LASSO), Recursive Feature Elimination with Random Forest feature selection (RFE-RF) and the best features of both LASSO RFE-RF (BLR) techniques. Cross-validated and grid-search methods are used to optimize the parameters of the estimator used in applying these algorithms. and classifier performance assessment metrics including classification accuracy, specificity, sensitivity, precision, and F1-Score, of each classification model, along with execution time and RMSE the results are presented independently for comparison. Our proposed work finds the best potential outcome across all available prediction models and improves the system's performance, allowing physicians to diagnose heart patients more accurately.
https://doi.org/10.22937/IJCSNS.2023.23.5.17 인용 PDF

Improving Field Crop Classification Accuracy Using GLCM and SVM with UAV-Acquired Images

Seung-Hwan Go;Jong-Hwa Park
- Korean Journal of Remote Sensing
- /
- v.40 no.1
- /
- pp.93-101
- /
- 2024
Accurate field crop classification is essential for various agricultural applications, yet existing methods face challenges due to diverse crop types and complex field conditions. This study aimed to address these issues by combining support vector machine (SVM) models with multi-seasonal unmanned aerial vehicle (UAV) images, texture information extracted from Gray Level Co-occurrence Matrix (GLCM), and RGB spectral data. Twelve high-resolution UAV image captures spanned March-October 2021, while field surveys on three dates provided ground truth data. We focused on data from August (-A), September (-S), and October (-O) images and trained four support vector classifier (SVC) models (SVC-A, SVC-S, SVC-O, SVC-AS) using visual bands and eight GLCM features. Farm maps provided by the Ministry of Agriculture, Food and Rural Affairs proved efficient for open-field crop identification and served as a reference for accuracy comparison. Our analysis showcased the significant impact of hyperparameter tuning (C and gamma) on SVM model performance, requiring careful optimization for each scenario. Importantly, we identified models exhibiting distinct high-accuracy zones, with SVC-O trained on October data achieving the highest overall and individual crop classification accuracy. This success likely stems from its ability to capture distinct texture information from mature crops.Incorporating GLCM features proved highly effective for all models,significantly boosting classification accuracy.Among these features, homogeneity, entropy, and correlation consistently demonstrated the most impactful contribution. However, balancing accuracy with computational efficiency and feature selection remains crucial for practical application. Performance analysis revealed that SVC-O achieved exceptional results in overall and individual crop classification, while soybeans and rice were consistently classified well by all models. Challenges were encountered with cabbage due to its early growth stage and low field cover density. The study demonstrates the potential of utilizing farm maps and GLCM features in conjunction with SVM models for accurate field crop classification. Careful parameter tuning and model selection based on specific scenarios are key for optimizing performance in real-world applications.
https://doi.org/10.7780/kjrs.2024.40.1.9 인용 PDF HTML

Enhancing Alzheimer's Disease Classification using 3D Convolutional Neural Network and Multilayer Perceptron Model with Attention Network

Enoch A. Frimpong;Zhiguang Qin;Regina E. Turkson;Bernard M. Cobbinah;Edward Y. Baagyere;Edwin K. Tenagyei
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.17 no.11
- /
- pp.2924-2944
- /
- 2023
Alzheimer's disease (AD) is a neurological condition that is recognized as one of the primary causes of memory loss. AD currently has no cure. Therefore, the need to develop an efficient model with high precision for timely detection of the disease is very essential. When AD is detected early, treatment would be most likely successful. The most often utilized indicators for AD identification are the Mini-mental state examination (MMSE), and the clinical dementia. However, the use of these indicators as ground truth marking could be imprecise for AD detection. Researchers have proposed several computer-aided frameworks and lately, the supervised model is mostly used. In this study, we propose a novel 3D Convolutional Neural Network Multilayer Perceptron (3D CNN-MLP) based model for AD classification. The model uses Attention Mechanism to automatically extract relevant features from Magnetic Resonance Images (MRI) to generate probability maps which serves as input for the MLP classifier. Three MRI scan categories were considered, thus AD dementia patients, Mild Cognitive Impairment patients (MCI), and Normal Control (NC) or healthy patients. The performance of the model is assessed by comparing basic CNN, VGG16, DenseNet models, and other state of the art works. The models were adjusted to fit the 3D images before the comparison was done. Our model exhibited excellent classification performance, with an accuracy of 91.27% for AD and NC, 80.85% for MCI and NC, and 87.34% for AD and MCI.
https://doi.org/10.3837/tiis.2023.11.002 인용 PDF

Performance Comparison of CNN-Based Image Classification Models for Drone Identification System (드론 식별 시스템을 위한 합성곱 신경망 기반 이미지 분류 모델 성능 비교)

YeongWan Kim;DaeKyun Cho;GunWoo Park
- The Journal of the Convergence on Culture Technology
- /
- v.10 no.4
- /
- pp.639-644
- /
- 2024
Recent developments in the use of drones on battlefields, extending beyond reconnaissance to firepower support, have greatly increased the importance of technologies for early automatic drone identification. In this study, to identify an effective image classification model that can distinguish drones from other aerial targets of similar size and appearance, such as birds and balloons, we utilized a dataset of 3,600 images collected from the internet. We adopted a transfer learning approach that combines the feature extraction capabilities of three pre-trained convolutional neural network models (VGG16, ResNet50, InceptionV3) with an additional classifier. Specifically, we conducted a comparative analysis of the performance of these three pre-trained models to determine the most effective one. The results showed that the InceptionV3 model achieved the highest accuracy at 99.66%. This research represents a new endeavor in utilizing existing convolutional neural network models and transfer learning for drone identification, which is expected to make a significant contribution to the advancement of drone identification technologies.
https://doi.org/10.17703/JCCT.2024.10.4.639 인용 PDF

Principal Discriminant Variate (PDV) Method for Classification of Multicollinear Data: Application to Diagnosis of Mastitic Cows Using Near-Infrared Spectra of Plasma Samples

Jiang, Jian-Hui;Tsenkova, Roumiana;Yu, Ru-Qin;Ozaki, Yukihiro
- Proceedings of the Korean Society of Near Infrared Spectroscopy Conference
- /
- 2001.06a
- /
- pp.1244-1244
- /
- 2001
In linear discriminant analysis there are two important properties concerning the effectiveness of discriminant function modeling. The first is the separability of the discriminant function for different classes. The separability reaches its optimum by maximizing the ratio of between-class to within-class variance. The second is the stability of the discriminant function against noises present in the measurement variables. One can optimize the stability by exploring the discriminant variates in a principal variation subspace, i. e., the directions that account for a majority of the total variation of the data. An unstable discriminant function will exhibit inflated variance in the prediction of future unclassified objects, exposed to a significantly increased risk of erroneous prediction. Therefore, an ideal discriminant function should not only separate different classes with a minimum misclassification rate for the training set, but also possess a good stability such that the prediction variance for unclassified objects can be as small as possible. In other words, an optimal classifier should find a balance between the separability and the stability. This is of special significance for multivariate spectroscopy-based classification where multicollinearity always leads to discriminant directions located in low-spread subspaces. A new regularized discriminant analysis technique, the principal discriminant variate (PDV) method, has been developed for handling effectively multicollinear data commonly encountered in multivariate spectroscopy-based classification. The motivation behind this method is to seek a sequence of discriminant directions that not only optimize the separability between different classes, but also account for a maximized variation present in the data. Three different formulations for the PDV methods are suggested, and an effective computing procedure is proposed for a PDV method. Near-infrared (NIR) spectra of blood plasma samples from mastitic and healthy cows have been used to evaluate the behavior of the PDV method in comparison with principal component analysis (PCA), discriminant partial least squares (DPLS), soft independent modeling of class analogies (SIMCA) and Fisher linear discriminant analysis (FLDA). Results obtained demonstrate that the PDV method exhibits improved stability in prediction without significant loss of separability. The NIR spectra of blood plasma samples from mastitic and healthy cows are clearly discriminated between by the PDV method. Moreover, the proposed method provides superior performance to PCA, DPLS, SIMCA and FLDA, indicating that PDV is a promising tool in discriminant analysis of spectra-characterized samples with only small compositional difference, thereby providing a useful means for spectroscopy-based clinic applications.
PDF

PRINCIPAL DISCRIMINANT VARIATE (PDV) METHOD FOR CLASSIFICATION OF MULTICOLLINEAR DATA WITH APPLICATION TO NEAR-INFRARED SPECTRA OF COW PLASMA SAMPLES

Jiang, Jian-Hui;Yuqing Wu;Yu, Ru-Qin;Yukihiro Ozaki
- Proceedings of the Korean Society of Near Infrared Spectroscopy Conference
- /
- 2001.06a
- /
- pp.1042-1042
- /
- 2001
In linear discriminant analysis there are two important properties concerning the effectiveness of discriminant function modeling. The first is the separability of the discriminant function for different classes. The separability reaches its optimum by maximizing the ratio of between-class to within-class variance. The second is the stability of the discriminant function against noises present in the measurement variables. One can optimize the stability by exploring the discriminant variates in a principal variation subspace, i. e., the directions that account for a majority of the total variation of the data. An unstable discriminant function will exhibit inflated variance in the prediction of future unclassified objects, exposed to a significantly increased risk of erroneous prediction. Therefore, an ideal discriminant function should not only separate different classes with a minimum misclassification rate for the training set, but also possess a good stability such that the prediction variance for unclassified objects can be as small as possible. In other words, an optimal classifier should find a balance between the separability and the stability. This is of special significance for multivariate spectroscopy-based classification where multicollinearity always leads to discriminant directions located in low-spread subspaces. A new regularized discriminant analysis technique, the principal discriminant variate (PDV) method, has been developed for handling effectively multicollinear data commonly encountered in multivariate spectroscopy-based classification. The motivation behind this method is to seek a sequence of discriminant directions that not only optimize the separability between different classes, but also account for a maximized variation present in the data. Three different formulations for the PDV methods are suggested, and an effective computing procedure is proposed for a PDV method. Near-infrared (NIR) spectra of blood plasma samples from daily monitoring of two Japanese cows have been used to evaluate the behavior of the PDV method in comparison with principal component analysis (PCA), discriminant partial least squares (DPLS), soft independent modeling of class analogies (SIMCA) and Fisher linear discriminant analysis (FLDA). Results obtained demonstrate that the PDV method exhibits improved stability in prediction without significant loss of separability. The NIR spectra of blood plasma samples from two cows are clearly discriminated between by the PDV method. Moreover, the proposed method provides superior performance to PCA, DPLS, SIMCA md FLDA, indicating that PDV is a promising tool in discriminant analysis of spectra-characterized samples with only small compositional difference.
PDF

Search Result 145, Processing Time 0.021 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)