• Title/Summary/Keyword: Imbalanced data

Search Result 153, Processing Time 0.027 seconds

Application of Random Over Sampling Examples(ROSE) for an Effective Bankruptcy Prediction Model (효과적인 기업부도 예측모형을 위한 ROSE 표본추출기법의 적용)

  • Ahn, Cheolhwi;Ahn, Hyunchul
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.8
    • /
    • pp.525-535
    • /
    • 2018
  • If the frequency of a particular class is excessively higher than the frequency of other classes in the classification problem, data imbalance problems occur, which make machine learning distorted. Corporate bankruptcy prediction often suffers from data imbalance problems since the ratio of insolvent companies is generally very low, whereas the ratio of solvent companies is very high. To mitigate these problems, it is required to apply a proper sampling technique. Until now, oversampling techniques which adjust the class distribution of a data set by sampling minor class with replacement have popularly been used. However, they are a risk of overfitting. Under this background, this study proposes ROSE(Random Over Sampling Examples) technique which is proposed by Menardi and Torelli in 2014 for the effective corporate bankruptcy prediction. The ROSE technique creates new learning samples by synthesizing the samples for learning, so it leads to better prediction accuracy of the classifiers while avoiding the risk of overfitting. Specifically, our study proposes to combine the ROSE method with SVM(support vector machine), which is known as the best binary classifier. We applied the proposed method to a real-world bankruptcy prediction case of a Korean major bank, and compared its performance with other sampling techniques. Experimental results showed that ROSE contributed to the improvement of the prediction accuracy of SVM in bankruptcy prediction compared to other techniques, with statistical significance. These results shed a light on the fact that ROSE can be a good alternative for resolving data imbalance problems of the prediction problems in social science area other than bankruptcy prediction.

Improving target recognition of active sonar multi-layer processor through deep learning of a small amounts of imbalanced data (소수 불균형 데이터의 심층학습을 통한 능동소나 다층처리기의 표적 인식성 개선)

  • Young-Woo Ryu;Jeong-Goo Kim
    • The Journal of the Acoustical Society of Korea
    • /
    • v.43 no.2
    • /
    • pp.225-233
    • /
    • 2024
  • Active sonar transmits sound waves to detect covertly maneuvering underwater objects and detects the signals reflected back from the target. However, in addition to the target's echo, the active sonar's received signal is mixed with seafloor, sea surface reverberation, biological noise, and other noise, making target recognition difficult. Conventional techniques for detecting signals above a threshold not only cause false detections or miss targets depending on the set threshold, but also have the problem of having to set an appropriate threshold for various underwater environments. To overcome this, research has been conducted on automatic calculation of threshold values through techniques such as Constant False Alarm Rate (CFAR) and application of advanced tracking filters and association techniques, but there are limitations in environments where a significant number of detections occur. As deep learning technology has recently developed, efforts have been made to apply it in the field of underwater target detection, but it is very difficult to acquire active sonar data for discriminator learning, so not only is the data rare, but there are only a very small number of targets and a relatively large number of non-targets. There are difficulties due to the imbalance of data. In this paper, the image of the energy distribution of the detection signal is used, and a classifier is learned in a way that takes into account the imbalance of the data to distinguish between targets and non-targets and added to the existing technique. Through the proposed technique, target misclassification was minimized and non-targets were eliminated, making target recognition easier for active sonar operators. And the effectiveness of the proposed technique was verified through sea experiment data obtained in the East Sea.

Illegal Cash Accommodation Detection Modeling Using Ensemble Size Reduction (신용카드 불법현금융통 적발을 위한 축소된 앙상블 모형)

  • Lee, Hwa-Kyung;Han, Sang-Bum;Jhee, Won-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.1
    • /
    • pp.93-116
    • /
    • 2010
  • Ensemble approach is applied to the detection modeling of illegal cash accommodation (ICA) that is the well-known type of fraudulent usages of credit cards in far east nations and has not been addressed in the academic literatures. The performance of fraud detection model (FDM) suffers from the imbalanced data problem, which can be remedied to some extent using an ensemble of many classifiers. It is generally accepted that ensembles of classifiers produce better accuracy than a single classifier provided there is diversity in the ensemble. Furthermore, recent researches reveal that it may be better to ensemble some selected classifiers instead of all of the classifiers at hand. For the effective detection of ICA, we adopt ensemble size reduction technique that prunes the ensemble of all classifiers using accuracy and diversity measures. The diversity in ensemble manifests itself as disagreement or ambiguity among members. Data imbalance intrinsic to FDM affects our approach for ICA detection in two ways. First, we suggest the training procedure with over-sampling methods to obtain diverse training data sets. Second, we use some variants of accuracy and diversity measures that focus on fraud class. We also dynamically calculate the diversity measure-Forward Addition and Backward Elimination. In our experiments, Neural Networks, Decision Trees and Logit Regressions are the base models as the ensemble members and the performance of homogeneous ensembles are compared with that of heterogeneous ensembles. The experimental results show that the reduced size ensemble is as accurate on average over the data-sets tested as the non-pruned version, which provides benefits in terms of its application efficiency and reduced complexity of the ensemble.

The Development of Park Analysis Indicators and Current Status: A Case Study of Daejeon Metropolitan City (공원 분석 지표 개발 및 현황 분석: 대전광역시를 중심으로)

  • Hwang, Jae-Yeon;Gwak, Seung-Yeon;Kim, Sang-Kyu;Park, Min-Ju
    • Land and Housing Review
    • /
    • v.13 no.1
    • /
    • pp.99-112
    • /
    • 2022
  • There is growing significance in securing urban parks and enhancing their accessibility due to irrational residential developments and apartment construction. Accordingly, Daejeon Metropolitan City has carried out urban park management projects to improve the quality of parks and create new parks. Daejeon Metropolitan City generates and manages park data for the purpose of management by the administrative district. However, these datasets take different forms in each administrative district. This study integrates the park data in Daejeon, generated by administrative districts, into the same format and generates geographic information data with the area information of each park for analysis. Analysis results show that urban parks are severely imbalanced across administrative districts, requiring new policy measures. In addition, by normalizing the park analysis results and, then, creating their rankings, this study compares them with the actual park information in detail to confirm the soundness of the dataset. The analysis results provide implications to improve the management of urban parks. This study proposes integrated datasets and the continued management of them in each administrative district by including essential data that can feature the objective information of the parks along with park evaluation indicators based on previous studies.

The Study of Correlationship of the Fukuda Stepping Test to Determine Type of Idiopathic Scoliosis Curve (척주측만증 환자의 만곡과 후쿠다 검사의 상관성에 관한 연구)

  • Lee, Sang-Yeol;Jo, Marg-Eun;Ko, Min-Ji;Kim, Young-Ju;Lee, Seung-Min
    • Journal of the Korean Society of Physical Medicine
    • /
    • v.11 no.2
    • /
    • pp.13-16
    • /
    • 2016
  • PURPOSE: The Fukuda test can be used at home and in school to diagnose scoliosis at an early stage and prevent serious curvature of the spine. This study aimed to use the Fukuda test to detect scoliosis. An additional aim was to invoke the national interest in imbalanced postures and habits, which result in scoliosis, by providing data obtained in periodic assessments. METHODS: The study consisted of 35 idiopathic scoliosis patients (22 in right lumbar spinal region and 13 in the left lumbar spinal region). The distance of displacement and angle of displacement were measured following the Fukuda test. A correlation analysis was then used to examine the distance of displacement and angles of displacement and rotation with regard to the direction of the curve in scoliosis. RESULTS: There was a significant negative correlation (p<0.00) between the direction of the curve in scoliosis and the angle of displacement, but there was no correlation between the cobb's angle and distance of displacement or between the cobb's angle and angle of rotation. CONCLUSION: The Fukuda test did not capture changes in spinal curvature such as the cobb's angle, or subsequent changes in the muscles. Thus, the Fukuda test is not suited to examining the direction or status of the thoracic curve in scoliosis patients. Simple methods to objectively measure scoliosis need to be developed.

A Study on Book Categorization in Social Sciences Using kNN Classifiers and Table of Contents Text (목차 정보와 kNN 분류기를 이용한 사회과학 분야 도서 자동 분류에 관한 연구)

  • Lee, Yong-Gu
    • Journal of the Korean Society for information Management
    • /
    • v.37 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • This study applied automatic classification using table of contents (TOC) text for 6,253 social science books from a newly arrived list collected by a university library. The k-nearest neighbors (kNN) algorithm was used as a classifier, and the ten divisions on the second level of the DDC's main class 300 given to books by the library were used as classes (labels). The features used in this study were keywords extracted from titles and TOCs of the books. The TOCs were obtained through the OpenAPI from an Internet bookstore. As a result, it was found that the TOC features were good for improving both classification recall and precision. The TOC was shown to reduce the overfitting problem of imbalanced data with its rich features. Law and education have high topic specificity in the field of social sciences, so the only title features can bring good classification performance in these fields.

The Efficiency Rating Prediction for Cultural Tourism Festival Based of DEA (DEA를 적용한 문화관광축제의 효율성 등급 예측모형)

  • Kim, Eun-Mi;Hong, Tae-Ho
    • The Journal of Information Systems
    • /
    • v.29 no.3
    • /
    • pp.145-157
    • /
    • 2020
  • Purpose This study proposed an approach for predicting the efficiency rating of the cultural tourism festivals using DEA and machine learning techniques. The cultural tourism festivals are selected for the best festivals through peer reviews by tourism experts. However, only 10% of the festivals which are held in a year could be evaluated in the view of effectiveness without considering the efficiency of festivals. Design/methodology/approach Efficiency scores were derived from the results of DEA for the prediction of efficiency ratings. This study utilized BCC models to reflect the size effect of festivals and classified the festivals into four ratings according the efficiency scores. Multi-classification method were considered to build the prediction of four ratings for the festivals in this study. We utilized neural networks and SVMs with OAO(one-against-one), OAR(one-against-rest), C&S(crammer & singer) with Korea festival data from 2013 to 2018. Findings The number of total visitors in low efficient rating of DEA is more larger than the number of total visitors in high efficient ratings although the total expenditure of visitors is the highest in the most efficient rating when we analyzed the results of DEA for the characteristics of four ratings. SVM with OAO model showed the most superior performance in accuracy as SVM with OAR model was not trained well because of the imbalanced distribution between efficient rating and the other ratings. Our approach could predict the efficiency of festivals which were not included in the review process of culture tourism festivals without rebuilding DEA models each time. This enables us to manage the festivals efficiently with the proposed machine learning models.

Changes in Mitogen-activated Protein Kinase Activities During Acidification-induced Apoptosis in CHO Cells

  • Kim, Jin-Young;Jeong, Dae-Won;Roh, Sang-Ho;Min, Byung-Moo
    • International Journal of Oral Biology
    • /
    • v.30 no.3
    • /
    • pp.85-90
    • /
    • 2005
  • Homeostatic pH is very important for various cellular processes, including metabolism, survival, and death. An imbalanced-pH might induce cellular acidosis, which is involved in many abnormal events such as apoptosis and malignancy. One of several factors contributing to the onset of metabolic acidosis is the production of lactate and protons by lactate dehydrogenase (LDH) in anaerobic glycolysis. LDH is an important enzyme that catalyzes the reversible conversion of pyruvate to lactate. This study sought to examine whether decreases in extracellular pH induce apoptosis of CHO cells, and to elucidate the role of mitogen-activated protein kinases (MAPKs) in acidification-induced apoptosis. To test apoptotic signaling by acidification we used CHO dhfr cells that were sensitive to acidification, and CHO/anti-LDH cells that are resistant to acidification-induced apoptosis and have reduced LDH activity by stable LDH antisense mRNA expression. In the present study, cellular lactic acid-induced acidification and the role of MAPKs signaling in acidification-induced apoptosis were investigated. Acidification, which is caused by $HCO{_3}^-$-free conditions, induced apoptosis and MAPKs (ERK, JNK, and p38) activation. However, MAPKs were slightly activated in acidic conditions in the CHO/anti-LDH cells, indicating that lactic acid-induced acidification induces activation of MAPKs. Treatment with a p38 inhibitor, PD169316, increased acidification-induced apoptosis but apoptosis was not affected by inhibitors for ERK (U0126) or JNK (SP600125). Thus, these data support the hypothesis that activation of the p38 MAPK during acidification-induced apoptosis contributes to cell survival.

Bridging the Chasm between Design and Marketing: Problems and Solutions in the Integration Between Design and Marketing (디자인과 마케팅 협업의 틈새관리: 디자인과 마케팅의 협업시 통합의 문제와 해결방안)

  • Im, Subin;Joo, Jaewoo;Linder, Martin;Nam, Kiyoung
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.16 no.2
    • /
    • pp.1026-1035
    • /
    • 2015
  • Although integrating design and marketing is critical for successful new product development (NPD), there has been a limited attention to the potential problems that arise during the NPD process and their possible solutions in academic literature. In order to narrow this gap, our study conducted a series of surveys of an interdisciplinary class project between marketing and design students over two year periods at one of U.S. universities. From the survey data collected from the total of 65 students who participated in the collaboration projects, we identified two most common problems: (1) conflict from the functional background, and (2) the conflict from imbalanced decision-making authority between design and marketing. In order to resolve such conflict, we found the two contrasting solutions: (1) facilitating communication and (2) prohibiting communication. Our findings contribute to the formation of a theoretical basis for research on the topic of design-marketing integration.

Identifying Supply-demand Relationships on Ecosystem Services Using Socio-ecological Approach in Gyeong-gi Province (사회-생태계 이론을 활용한 경기도 지역 생태계서비스 공급-수요관계 분석)

  • Park, Yoon-Sun;Kim, Choong-Ki;Lee, Jae-Hyuck;Song, Young-Keun;Hong, Hyun-Jeong
    • Journal of Korean Society of Rural Planning
    • /
    • v.27 no.3
    • /
    • pp.35-46
    • /
    • 2021
  • Ecosystem services play a role in promoting sustainable development by contributing to human welfare. For sustainable development, a balance between supply and demand for ecosystem services must be made. In this regard, in this study, factor analysis was performed using the results of measuring ecosystem services for the supply of ecosystem services and national statistical data representing socio-economic factors for demand for ecosystem services The results of analysis for Gyeong-gi Province are as follows. The service supply based on the result of ecosystem services was divided into the mixed service provisioning as factor1, the food provisioning as factor2, and the P retention service provisioning area as factor3. As for the demand for services based on socio-economic factors, factor1 is divided into urbanized areas, factor2 is forest development area, and factor3 is agricultural activity development area. Local governments that maintain balance were evaluated as Pocheon, Yangpyeong, Icheon, Pyeongtaek, Goyang, Suwon, Gwangmyeong, and Osan, and imbalanced local governments appeared in Gimpo, Uiwang, Anseong, and Yeoju. A management plan to maintain the balance between supply and demand of ecosystem services was suggested. The analysis method and results of this study are expected to be applicable to various local governments through regional expansion.