• Title/Summary/Keyword: Imbalance training

Search Result 115, Processing Time 0.037 seconds

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.

A Study on Improvement of the Pilot Certification System for stabilizing Supply and Demand of Harbour Pilots (도선사수급안정화를 위한 도선사 자격제도 개선에 관한 연구)

  • Jeon, Yeong-Woo;Kim, Tae-goun;Ji, Sangwon;Kim, JinKwan
    • Journal of the Korean Society of Marine Environment & Safety
    • /
    • v.23 no.7
    • /
    • pp.834-846
    • /
    • 2017
  • An increase in the number of retiring experienced pilots as well as drastic graying of new pilots will raise the problems on deepening imbalance of supply and demand of pilots over the next 7 years, which could entail fatal problems for the safety of harbour pilotage. In this study, the improvement plan of legal system to ease imbalance between supply and demand of pilots and help secure more experienced pilots has been proposed. A current state survey and analysis, statistical analysis, questionnaire survey on foreign countries, in-depth consultation with experts, etc., were all carried out to support this research. The conclusions of this study are, firstly, to propose an amendment that the minimum requirement of 5 years of seagoing service as a master to sit for the pilot exam should be relaxed to 2 years(which must include at least 1 year of master's seagoing service within the most recent 5 years) but the minimum requirement of 1 year of pilotage service should be reinforced to 1 year and 6 months to obtain a higher class of pilot certificate. Secondly, it is proposed that an amendment offering an additional 1 point per year over the minimum period of 2 years of seagoing service as a master should be added, with a maximum of 10 points in order to rationalize the additional incentive point system. In order to secure experienced pilots and resolve the legal conflict between the certificate revalidation system and the retirement system, it is also proposed that an amendment be passed revoking the retirement system and limiting the validity of any new certificates only to 68 years of age when issuing or revalidating a certificate, if an applicant is over a certain age. Promotional work, such as collecting opinions from interested parties and generating positive public awareness, should be carried out in the future. It will also be necessary to conduct a study on the training pilot exam system.

Demand Characteristics and Analysis of Changes in Spatial Accessibility of Public Sports Facilities (공공체육시설 수요특성 및 공간적 접근성 분석)

  • Kim, Seong-Hee;Kim, Yong-Jin
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.7
    • /
    • pp.283-293
    • /
    • 2017
  • This study analyzed the actual conditions of use of public sports facilities and characteristics of the users of the facilities through surveys and measured the spatial imbalance of the public sports facilities currently supplied by using gravity potential model. This study also suggests evaluation criteria that may be considered for efficient location selection by examining the change of accessibility to the facilities that meet the needs of users in the future. As the results of the questionnaire survey, unlike current usage, the users hoped for badminton, weight training and swimming. And we could confirm the demand for the expansion of the multi - purpose indoor gym which can carry out such activities in the areas. As the result of the analysis on the difference in accessibility of the public sports facility, there were some large variations in the regions. It was found that a balanced supply of facilities was needed in terms of equity. In particular, when analyzing by considering the population estimates of 2025, It is analyzed that the accessibility will be reduced to about 60% compared to that of 2015. In addition, it is evaluated as the best alternative in terms of overall efficiency that the location of the facilities should be in Munsan area where population growth is expected in the future.

A Hybrid Under-sampling Approach for Better Bankruptcy Prediction (부도예측 개선을 위한 하이브리드 언더샘플링 접근법)

  • Kim, Taehoon;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.173-190
    • /
    • 2015
  • The purpose of this study is to improve bankruptcy prediction models by using a novel hybrid under-sampling approach. Most prior studies have tried to enhance the accuracy of bankruptcy prediction models by improving the classification methods involved. In contrast, we focus on appropriate data preprocessing as a means of enhancing accuracy. In particular, we aim to develop an effective sampling approach for bankruptcy prediction, since most prediction models suffer from class imbalance problems. The approach proposed in this study is a hybrid under-sampling method that combines the k-Reverse Nearest Neighbor (k-RNN) and one-class support vector machine (OCSVM) approaches. k-RNN can effectively eliminate outliers, while OCSVM contributes to the selection of informative training samples from majority class data. To validate our proposed approach, we have applied it to data from H Bank's non-external auditing companies in Korea, and compared the performances of the classifiers with the proposed under-sampling and random sampling data. The empirical results show that the proposed under-sampling approach generally improves the accuracy of classifiers, such as logistic regression, discriminant analysis, decision tree, and support vector machines. They also show that the proposed under-sampling approach reduces the risk of false negative errors, which lead to higher misclassification costs.

Occlusal Analysis in the Policemen with Temporomandibular Disorders Using T-scan II System (경찰 종사자의 측두하악장애환자에서 T-scan II System을 이용한 교합분석)

  • Lim, Hyun-Dae;Jung, Seung-Ah;Lee, You-Mee
    • Journal of Oral Medicine and Pain
    • /
    • v.31 no.4
    • /
    • pp.365-373
    • /
    • 2006
  • This study suggested correction of excessive mouth opening or maximum occlusal contact to analyse occlusal contact time, occlusal contact number and force through evaluation of occlusal pattern in policemen with temporomandibular disorders. The community of policemen influence on temporomandibular disorder's development and progress due to other condition of mouth opening and maximal occlusal contact. Repeated training or changes of usual life style may cause imbalance of stomatognathic system including the masticatory muscle, then develop or aggravate pain of temporomandibular joints and associated structures. This study uses T-scan II system(Tekscan Co., USA) for evaluation on occlusal pattern may influence temporomandibular disorders, and then the subjects take a sensor at 20 mm opening for maximal occlusal contact force. The policemen with temporomandibualr disorders get more long time on maximum contact timing, more short on end contact timing, and more force on end contact force than general society's. So they get closure of mouth with more short time and more force, then transfer remaining load to temporomandibular joint. There are no statistically significances between affected side and occlusal pattern of occlusal contact time and force. There are Left -right dental arch imbalances seems on Rt. dental arch if affected side is right and Lt. dental arch if affected side is left. In above results, It's worth due consideration that policemen with temporomandibular disorders get more smooth mandibualr movement and less force on maximal occlusal contact position.

Development of Classification Model for hERG Ion Channel Inhibitors Using SVM Method (SVM 방법을 이용한 hERG 이온 채널 저해제 예측모델 개발)

  • Gang, Sin-Moon;Kim, Han-Jo;Oh, Won-Seok;Kim, Sun-Young;No, Kyoung-Tai;Nam, Ky-Youb
    • Journal of the Korean Chemical Society
    • /
    • v.53 no.6
    • /
    • pp.653-662
    • /
    • 2009
  • Developing effective tools for predicting absorption, distribution, metabolism, excretion properties and toxicity (ADME/T) of new chemical entities in the early stage of drug design is one of the most important tasks in drug discovery and development today. As one of these attempts, support vector machines (SVM) has recently been exploited for the prediction of ADME/T related properties. However, two problems in SVM modeling, i.e. feature selection and parameters setting, are still far from solved. The two problems have been shown to be crucial to the efficiency and accuracy of SVM classification. In particular, the feature selection and optimal SVM parameters setting influence each other, which indicates that they should be dealt with simultaneously. In this account, we present an integrated practical solution, in which genetic-based algorithm (GA) is used for feature selection and grid search (GS) method for parameters optimization. hERG ion-channel inhibitor classification models of ADME/T related properties has been built for assessing and testing the proposed GA-GS-SVM. We generated 6 different models that are 3 different single models and 3 different ensemble models using training set - 1891 compounds and validated with external test set - 175 compounds. We compared single model with ensemble model to solve data imbalance problems. It was able to improve accuracy of prediction to use ensemble model.

Effect on the Activity and Ratio of the Serratus Anterior, Pectoralis Major, and Upper Trapezius according to the Angle of Abduction and External Weight During Shoulder Protraction Exercise for Winged Scapular Subjects (날개 어깨뼈 대상자들에게 어깨 내밈 운동시 벌림 각도와 외부 무게에 따른 앞톱니근, 큰가슴근, 위 등세모근의 활성도 및 비율에 미치는 영향)

  • BadamKhorl, Yadam;Kim, Tae-ho;Park, Han-kyu
    • Physical Therapy Korea
    • /
    • v.26 no.3
    • /
    • pp.1-10
    • /
    • 2019
  • Background: Winged scapular (WS) causes muscle imbalance with abnormal patterns when moving the arm. In particular, the over-activation of the upper trapezius (UT) and decrease in activity of the lower trapezius (LT) and serratus anterior (SA) produce abnormal scapulohumeral rhythm. Therefore, the SA requires special attention in all shoulder rehabilitation programs. In fact, many previous studies have been devoted to the SA muscle strength training needed for WS correction. Objects: The purpose of this study was to investigate the effect of shoulder girdle muscle and ratio according to the angle of shoulder abduction and external weight in supine position. Methods: Twenty three WS patients participated in this experiment. They performed scapular protraction exercise in supine position with the weights of 0 kg, 1 kg, 1.5 kg, and 2 kg at shoulder abduction angles of $0^{\circ}$, $30^{\circ}$, $60^{\circ}$, and $90^{\circ}$. The angle and weight applications were randomized. Surface electromyography (EMG) was used to collect the EMG data of the SA, pectoralis major (PM), and UT during the exercise. The ratio of PM/SA and UT/SA was confirmed. Two-way repeated analyses of variance were used to determine the statistical significance of SA, PM, and UT and the ratios of PM/SA and UT/SA. Results: There was a significant difference in SA according to angle (p<.05). Significant differences were also identified depending on the angle and weight (p<.05). The angle of abduction at $0^{\circ}$, $30^{\circ}$ and weight of 2 kg showed the highest SA activity. However, there was no significant difference between PM and UT (p>.05). There was a significant difference between PM/SA and UT/SA in ratio of muscle activity according to angle (p<.05). Significant differences were found at PM/SA angles of $30^{\circ}$, $60^{\circ}$ and $90^{\circ}$ (p<.05). For UT/SA, significant difference was only observed at $90^{\circ}$ (p<.05). Conclusion: Based on the results of this study, in order to strengthen the SA, it was found to be most effective to use 1 and 1.5 kg weights with abduction angles of $0^{\circ}$ and $30^{\circ}$ at shoulder protraction in supine position.

The Performance Improvement of U-Net Model for Landcover Semantic Segmentation through Data Augmentation (데이터 확장을 통한 토지피복분류 U-Net 모델의 성능 개선)

  • Baek, Won-Kyung;Lee, Moung-Jin;Jung, Hyung-Sup
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.6_2
    • /
    • pp.1663-1676
    • /
    • 2022
  • Recently, a number of deep-learning based land cover segmentation studies have been introduced. Some studies denoted that the performance of land cover segmentation deteriorated due to insufficient training data. In this study, we verified the improvement of land cover segmentation performance through data augmentation. U-Net was implemented for the segmentation model. And 2020 satellite-derived landcover dataset was utilized for the study data. The pixel accuracies were 0.905 and 0.923 for U-Net trained by original and augmented data respectively. And the mean F1 scores of those models were 0.720 and 0.775 respectively, indicating the better performance of data augmentation. In addition, F1 scores for building, road, paddy field, upland field, forest, and unclassified area class were 0.770, 0.568, 0.433, 0.455, 0.964, and 0.830 for the U-Net trained by original data. It is verified that data augmentation is effective in that the F1 scores of every class were improved to 0.838, 0.660, 0.791, 0.530, 0.969, and 0.860 respectively. Although, we applied data augmentation without considering class balances, we find that data augmentation can mitigate biased segmentation performance caused by data imbalance problems from the comparisons between the performances of two models. It is expected that this study would help to prove the importance and effectiveness of data augmentation in various image processing fields.

Experimental Comparison of Network Intrusion Detection Models Solving Imbalanced Data Problem (데이터의 불균형성을 제거한 네트워크 침입 탐지 모델 비교 분석)

  • Lee, Jong-Hwa;Bang, Jiwon;Kim, Jong-Wouk;Choi, Mi-Jung
    • KNOM Review
    • /
    • v.23 no.2
    • /
    • pp.18-28
    • /
    • 2020
  • With the development of the virtual community, the benefits that IT technology provides to people in fields such as healthcare, industry, communication, and culture are increasing, and the quality of life is also improving. Accordingly, there are various malicious attacks targeting the developed network environment. Firewalls and intrusion detection systems exist to detect these attacks in advance, but there is a limit to detecting malicious attacks that are evolving day by day. In order to solve this problem, intrusion detection research using machine learning is being actively conducted, but false positives and false negatives are occurring due to imbalance of the learning dataset. In this paper, a Random Oversampling method is used to solve the unbalance problem of the UNSW-NB15 dataset used for network intrusion detection. And through experiments, we compared and analyzed the accuracy, precision, recall, F1-score, training and prediction time, and hardware resource consumption of the models. Based on this study using the Random Oversampling method, we develop a more efficient network intrusion detection model study using other methods and high-performance models that can solve the unbalanced data problem.

Machine learning-based corporate default risk prediction model verification and policy recommendation: Focusing on improvement through stacking ensemble model (머신러닝 기반 기업부도위험 예측모델 검증 및 정책적 제언: 스태킹 앙상블 모델을 통한 개선을 중심으로)

  • Eom, Haneul;Kim, Jaeseong;Choi, Sangok
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.105-129
    • /
    • 2020
  • This study uses corporate data from 2012 to 2018 when K-IFRS was applied in earnest to predict default risks. The data used in the analysis totaled 10,545 rows, consisting of 160 columns including 38 in the statement of financial position, 26 in the statement of comprehensive income, 11 in the statement of cash flows, and 76 in the index of financial ratios. Unlike most previous prior studies used the default event as the basis for learning about default risk, this study calculated default risk using the market capitalization and stock price volatility of each company based on the Merton model. Through this, it was able to solve the problem of data imbalance due to the scarcity of default events, which had been pointed out as the limitation of the existing methodology, and the problem of reflecting the difference in default risk that exists within ordinary companies. Because learning was conducted only by using corporate information available to unlisted companies, default risks of unlisted companies without stock price information can be appropriately derived. Through this, it can provide stable default risk assessment services to unlisted companies that are difficult to determine proper default risk with traditional credit rating models such as small and medium-sized companies and startups. Although there has been an active study of predicting corporate default risks using machine learning recently, model bias issues exist because most studies are making predictions based on a single model. Stable and reliable valuation methodology is required for the calculation of default risk, given that the entity's default risk information is very widely utilized in the market and the sensitivity to the difference in default risk is high. Also, Strict standards are also required for methods of calculation. The credit rating method stipulated by the Financial Services Commission in the Financial Investment Regulations calls for the preparation of evaluation methods, including verification of the adequacy of evaluation methods, in consideration of past statistical data and experiences on credit ratings and changes in future market conditions. This study allowed the reduction of individual models' bias by utilizing stacking ensemble techniques that synthesize various machine learning models. This allows us to capture complex nonlinear relationships between default risk and various corporate information and maximize the advantages of machine learning-based default risk prediction models that take less time to calculate. To calculate forecasts by sub model to be used as input data for the Stacking Ensemble model, training data were divided into seven pieces, and sub-models were trained in a divided set to produce forecasts. To compare the predictive power of the Stacking Ensemble model, Random Forest, MLP, and CNN models were trained with full training data, then the predictive power of each model was verified on the test set. The analysis showed that the Stacking Ensemble model exceeded the predictive power of the Random Forest model, which had the best performance on a single model. Next, to check for statistically significant differences between the Stacking Ensemble model and the forecasts for each individual model, the Pair between the Stacking Ensemble model and each individual model was constructed. Because the results of the Shapiro-wilk normality test also showed that all Pair did not follow normality, Using the nonparametric method wilcoxon rank sum test, we checked whether the two model forecasts that make up the Pair showed statistically significant differences. The analysis showed that the forecasts of the Staging Ensemble model showed statistically significant differences from those of the MLP model and CNN model. In addition, this study can provide a methodology that allows existing credit rating agencies to apply machine learning-based bankruptcy risk prediction methodologies, given that traditional credit rating models can also be reflected as sub-models to calculate the final default probability. Also, the Stacking Ensemble techniques proposed in this study can help design to meet the requirements of the Financial Investment Business Regulations through the combination of various sub-models. We hope that this research will be used as a resource to increase practical use by overcoming and improving the limitations of existing machine learning-based models.