• Title/Summary/Keyword: class imbalance

Search Result 120, Processing Time 0.025 seconds

A quantification study of blood test results for dyspnea patients (호흡곤란 환자에 대한 혈액검사 결과들의 수량화 연구)

  • Park, Cheol-Yong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.22 no.3
    • /
    • pp.477-485
    • /
    • 2011
  • Park et. al (2010) proposed a statistical model for determining the admission or discharge of 668 patients with a chief complaint of dyspnea by the number of 11 blood tests belonging to the corresponding discharge intervals. Since this method does not take into consideration the importance of each blood test result, its performance might not be optimally good. In this study, we employ a quantification method to evaluate the importance of those blood test results, and then provide a new statistical mode that takes the importance into consideration. The results show that the performance of this new model is a little better than that of the model by Park et. al (2010).

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

  • Pouramini, Jafar;Minaei-Bidgoli, Behrouze;Esmaeili, Mahdi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.8
    • /
    • pp.3725-3748
    • /
    • 2018
  • Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.

High Voltage SMPS Design based on Dual-Excitation Flyback Converter (이중 여자 플라이백 기반 고압 SMPS 설계)

  • Yang, Hee-Won;Kim, Seong-Ae;Park, Seong-Mi;Park, Sung-Jun
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.20 no.2
    • /
    • pp.115-124
    • /
    • 2017
  • This paper aims to develop an SMPS topology for handling a high range of input voltages based on a DC-DC flyback converter circuit. For this purpose, 2 capacitors of the same specifications were serially connected on the input terminal side, with a flyback converter of the same circuit configuration serially connected to each of them, so as to achieve high input voltage and an effect of dividing input voltage. The serially connected flyback converters have the transformer turn ratio of 1:1, so that each coil is used for the winding of a single transformer, which is a characteristic of doubly-fed configuration and enables the correction of input capacitor voltage imbalance. In addition, a pulse transformer was designed and fabricated in a way that can achieve the isolation and noise robustness of the PWM output signal of the PWM controller that applies gate voltage to individual flyback converter switches. PSIM simulation was carried out to verify such a structure and confirm its feasibility, and a 100W class stack was fabricated and used to verify the feasibility of the proposed high voltage SMPS topology.

Default Prediction for Real Estate Companies with Imbalanced Dataset

  • Dong, Yuan-Xiang;Xiao, Zhi;Xiao, Xue
    • Journal of Information Processing Systems
    • /
    • v.10 no.2
    • /
    • pp.314-333
    • /
    • 2014
  • When analyzing default predictions in real estate companies, the number of non-defaulted cases always greatly exceeds the defaulted ones, which creates the two-class imbalance problem. This lowers the ability of prediction models to distinguish the default sample. In order to avoid this sample selection bias and to improve the prediction model, this paper applies a minority sample generation approach to create new minority samples. The logistic regression, support vector machine (SVM) classification, and neural network (NN) classification use an imbalanced dataset. They were used as benchmarks with a single prediction model that used a balanced dataset corrected by the minority samples generation approach. Instead of using prediction-oriented tests and the overall accuracy, the true positive rate (TPR), the true negative rate (TNR), G-mean, and F-score are used to measure the performance of default prediction models for imbalanced dataset. In this paper, we describe an empirical experiment that used a sampling of 14 default and 315 non-default listed real estate companies in China and report that most results using single prediction models with a balanced dataset generated better results than an imbalanced dataset.

Semi-supervised Software Defect Prediction Model Based on Tri-training

  • Meng, Fanqi;Cheng, Wenying;Wang, Jingdong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.11
    • /
    • pp.4028-4042
    • /
    • 2021
  • Aiming at the problem of software defect prediction difficulty caused by insufficient software defect marker samples and unbalanced classification, a semi-supervised software defect prediction model based on a tri-training algorithm was proposed by combining feature normalization, over-sampling technology, and a Tri-training algorithm. First, the feature normalization method is used to smooth the feature data to eliminate the influence of too large or too small feature values on the model's classification performance. Secondly, the oversampling method is used to expand and sample the data, which solves the unbalanced classification of labelled samples. Finally, the Tri-training algorithm performs machine learning on the training samples and establishes a defect prediction model. The novelty of this model is that it can effectively combine feature normalization, oversampling techniques, and the Tri-training algorithm to solve both the under-labelled sample and class imbalance problems. Simulation experiments using the NASA software defect prediction dataset show that the proposed method outperforms four existing supervised and semi-supervised learning in terms of Precision, Recall, and F-Measure values.

A Study on the Improvement of Image Classification Performance in the Defense Field through Cost-Sensitive Learning of Imbalanced Data (불균형데이터의 비용민감학습을 통한 국방분야 이미지 분류 성능 향상에 관한 연구)

  • Jeong, Miae;Ma, Jungmok
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.24 no.3
    • /
    • pp.281-292
    • /
    • 2021
  • With the development of deep learning technology, researchers and technicians keep attempting to apply deep learning in various industrial and academic fields, including the defense. Most of these attempts assume that the data are balanced. In reality, since lots of the data are imbalanced, the classifier is not properly built and the model's performance can be low. Therefore, this study proposes cost-sensitive learning as a solution to the imbalance data problem of image classification in the defense field. In the proposed model, cost-sensitive learning is a method of giving a high weight on the cost function of a minority class. The results of cost-sensitive based model shows the test F1-score is higher when cost-sensitive learning is applied than general learning's through 160 experiments using submarine/non-submarine dataset and warship/non-warship dataset. Furthermore, statistical tests are conducted and the results are shown significantly.

A Study on Optimization of Classification Performance through Fourier Transform and Image Augmentation (푸리에 변환 및 이미지 증강을 통한 분류 성능 최적화에 관한 연구)

  • Kihyun Kim;Seong-Mok Kim;Yong Soo Kim
    • Journal of Korean Society for Quality Management
    • /
    • v.51 no.1
    • /
    • pp.119-129
    • /
    • 2023
  • Purpose: This study proposes a classification model for implementing condition-based maintenance (CBM) by monitoring the real-time status of a machine using acceleration sensor data collected from a vehicle. Methods: The classification model's performance was improved by applying Fourier transform to convert the acceleration sensor data from the time domain to the frequency domain. Additionally, the Generative Adversarial Network (GAN) algorithm was used to augment images and further enhance the classification model's performance. Results: Experimental results demonstrate that the GAN algorithm can effectively serve as an image augmentation technique to enhance the performance of the classification model. Consequently, the proposed approach yielded a significant improvement in the classification model's accuracy. Conclusion: While this study focused on the effectiveness of the GAN algorithm as an image augmentation method, further research is necessary to compare its performance with other image augmentation techniques. Additionally, it is essential to consider the potential for performance degradation due to class imbalance and conduct follow-up studies to address this issue.

Two-Branch Classifier for Retinal Imaging Analysis (망막 영상 분석을 위한 두 갈래 분류기)

  • Oh, Young-tack;Park, Hyunjin
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.05a
    • /
    • pp.614-616
    • /
    • 2021
  • The world faces difficulties in terms of eye care, including treatment, quality of prevention, vision rehabilitation services, and scarcity of trained eye care experts. However, it is difficult to develop a method for classifying various ocular diseases because the existing dataset for retinal image disclosure does not consist of various diseases found in clinical practice. We propose a method for classifying ocular diseases using the Retinal Fundus Multi-disease Image Dataset (RFMiD), a dataset published in the ISBI-2021 challenge. Our goal is to develop a robust and generalizable model for screening retinal images into normal and abnormal categories. The performance of the proposed model shows a value of 0.9782 for the test dataset as an area under the curve (AUC) score.

  • PDF

Cost-Sensitive Learning for Cardio-Cerebrovascular Disease Risk Prediction (심혈관질환 위험 예측을 위한 비용민감 학습 모델)

  • Yu Na Lee;Kyung-Hee Lee;Wan-Sup Cho
    • The Journal of Bigdata
    • /
    • v.6 no.2
    • /
    • pp.161-168
    • /
    • 2021
  • In this study, we propose a cardiovascular disease prediction model using machine learning. First, a multidimensional analysis of various differences between the two groups is performed and the results are visualized. In particular, we propose a predictive model using cost-sensitive learning that can improve the sensitivity for cases where there is a high class imbalance between the normal and patient groups, such as diseases. In this study, a predictive model is developed using CART and XGBoost, which are representative machine learning technologies, and prediction and performance are compared for cardiovascular disease patient data. According to the study results, CART showed higher accuracy and specificity than XGBoost, and the accuracy was about 70% to 74%.

A Study of Analysis on the Menu Concept of the Hotel Semi Buffet Restaurants - Focusing on the 1st class hotels in seoul - (호텔 세미뷔페 레스토랑의 메뉴 컨셉 분석 - 서울시내 특1급 호텔을 중심으로 -)

  • Min, Kye-Hong;Choi, Young-Ki
    • Journal of the Korean Society of Food Culture
    • /
    • v.22 no.5
    • /
    • pp.597-602
    • /
    • 2007
  • For the hotel industry, the situations having difficulties in management are becoming we planed by the rises of the cost and labor costs, the imbalance between supply and demand, stiffening competitions between the hotels. Therefore, there has been a plan for a great change to attract customers, escaping from the existing form of management in order to secure competitive powers in the food and beverage field. For that purpose, we plan to investigate into the preference of buffet restaurants in ten 5star hotels in Seoul. By the analysis, we also plan to present the menu concepts that stand out and are preferred by the customers in managing semi-buffet restaurants. Therefore, the linear and planar coordinate values of the H Hotels and I Hotels came out both positive(+) as results of a similarity analysis using MOS, we can predict that they would be positioning on the same dimension. Furthermore we can predict that the menu of antipasto, sushi, sashimi and desserts would be positioning on the same dimension as a result of analysis of the most preferred menu by customers for each station in managing a semi-buffet restaurant. Based on these results, there must be continuous supervision over the menu of buffet restaurants.