• Title/Summary/Keyword: 불균형(不均衡)

Search Result 2,197, Processing Time 0.036 seconds

Prediction of Good Seller in Overseas sales of Domestic Books Using Big Data (빅데이터를 활용한 국내 도서의 해외 판매시 굿셀러 예측)

  • Kim, Nayeon;Kim, Doyoung;Kim, Miryeo;Jung, Jiyeong;Kim, Hyon Hee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.05a
    • /
    • pp.401-404
    • /
    • 2022
  • 한국 문학이 세계로 뻗어나감에 따라 해외 시장에서 자리를 잡는 것이 중요해진 시점이다. 본 연구에서는 2016 년도부터 2020 년도까지 최근 5 년간 해외 출간된 도서들 중에서 굿셀러로 분류되는 누적 5 천부 이상 판매 여부를 예측하고자 했다. 굿셀러로 분류되는 도서는 전체 번역 도서 중 적은 비율을 차지하여 데이터 불균형이 발생하였으며, 본 연구에서는 SMOTE 기법과 앙상블 알고리즘을 적용하여 데이터 불균형 문제를 해결하였다. 그 결과, 데이터 클래스 비율이 1:1 에 가까울수록 성능 개선 효과가 나타났으며 LightGBM 모델이 99.83%의 AUC 값을 얻어 다른 앙상블 알고리즘에 비해 가장 좋은 예측 성능을 보임을 검증하였다. 또한 누적 5 천부 이상 판매 여부 예측에 있어 큰 영향을 미치는 변수로는 작가가 가장 중요한 요인으로 나타났으며 출간 국가, 그리고 평점 평균, 평점 참여자 수 같은 온라인 요인도 판매 예측에 유의미한 변수로 나타난 것을 확인할 수 있었다.

The Consumption Structure of Korean Elderly Households Depending on Poverty Status and Family Type (빈곤지위와 가구유형에 따른 노인가구의 소비특성 차이 분석)

  • Baek, Hakyoung
    • 한국노년학
    • /
    • v.30 no.3
    • /
    • pp.911-931
    • /
    • 2010
  • This study was conducted with objectives to assess consumption structure of the elderly households in Korea, focusing on the difference of consumption structure depending on the poverty status and family type. The results of this study show that the poor elderly households have primarily consumed the necessary goods for health care, food, clothing, and shelter. Especially, the poor single elderly living alone and married couples living independently(or alone) have been in the serious unbalanced consumption status. Based on the findings of the study, it is recommended that the support schemes to help the consumption of necessary goods should be introduced to improve their economic well-being. The support schemes to promote their social role as consumers should be also introduced.

Mitigiating Data Imbalance via Ensembled Data Augmentation: An Explainable Credit Scoring Models (데이터 증강 기법의 앙상블을 통한 레이블 불균형 해 소: 설명 가능한 신용평가 모델을 중심으로)

  • Ji-Young Chung;So-Yeon Lee;Ye-Lin Yong;Min-Jun Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.483-486
    • /
    • 2023
  • 최근 금융 분야는 예측 모델의 복잡성으로 인한 블랙박스 문제와 금융 규제에 대한 관심이 높아지고 있다. 이에 따라 금융 업계는 신뢰성과 투명성을 강조하며, 특히 신용평가 분야에서 설명 가능한 모델 연구가 활발히 진행되고 있다. 또한, 해당 분야에서 소수 클래스에 대해 충분히 학습하지 못하고 다수 클래스에 과적합 될 수 있는 데이터 불균형 문제 역시 강조되고 있다. 이는 제 2종 오류(Type 2 Error)를 최소화해야 하는 상황에서 더욱 부각되며, 대출 상환 능력이 낮은 고객을 최대한 식별해야 하는 개인 신용평가 문제에서 매우 중요한 화두로 떠오르고 있다. 본 논문에서는 어텐션 메커니즘을 활용하여 모델의 설명 가능성을 개선하고, 분석 결과를 해석하는 데 도움이 되고자 한다. 더 나아가, SMOTE, GAN, ADASYN 등 총 다섯 가지 데이터 증강 기법을 실험하여, 이를 앙상블 하였을 때 소수 클래스 레이블에 대한 분류 정확도를 크게 개선할 수 있음을 확인하였다.

AI Performance Based On Learning-Data Labeling Accuracy (인공지능 학습데이터 라벨링 정확도에 따른 인공지능 성능)

  • Ji-Hoon Lee;Jieun Shin
    • Journal of Industrial Convergence
    • /
    • v.22 no.1
    • /
    • pp.177-183
    • /
    • 2024
  • The study investigates the impact of data quality on the performance of artificial intelligence (AI). To this end, the impact of labeling error levels on the performance of artificial intelligence was compared and analyzed through simulation, taking into account the similarity of data features and the imbalance of class composition. As a result, data with high similarity between characteristic variables were found to be more sensitive to labeling accuracy than data with low similarity between characteristic variables. It was observed that artificial intelligence accuracy tended to decrease rapidly as class imbalance increased. This will serve as the fundamental data for evaluating the quality criteria and conducting related research on artificial intelligence learning data.

LSTM-based fraud detection system framework using real-time data resampling techniques (실시간 리샘플링 기법을 활용한 LSTM 기반의 사기 거래 탐지 시스템)

  • Seo-Yi Kim;Yeon-Ji Lee;Il-Gu Lee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.505-508
    • /
    • 2024
  • 금융산업의 디지털 전환은 사용자에게 편리함을 제공하지만 기존에 존재하지 않던 보안상 취약점을 유발했다. 이러한 문제를 해결하기 위해 기계학습 기술을 적용한 사기 거래 탐지 시스템에 대한 연구가 활발하게 이루어지고 있다. 하지만 모델 학습 과정에서 발생하는 데이터 불균형 문제로 인해 오랜 시간이 소요되고 탐지 성능이 저하되는 문제가 있다. 본 논문에서는 실시간 데이터 오버 샘플링을 통해 이상 거래 탐지 시 데이터 불균형 문제를 해결하고 모델 학습 시간을 개선한 새로운 이상 거래 탐지 시스템(Fraud Detection System, FDS)을 제안한다. 본 논문에서 제안하는 SMOTE(Synthetic Minority Oversampling Technique)를 적용한 LSTM(Long-Short Term Memory) 알고리즘 기반의 FDS 프레임워크는 종래의 LSTM 알고리즘 기반의 FDS 모델과 비교했을 때, 데이터 사이즈가 96.5% 감소했으며, 정밀도, 재현율, F1-Score 가 34.81%, 11.14%, 22.51% 개선되었다.

The Demand and Supply of Nutritionist Workforce in Korea and Policy Recommendations (국민영양관리를 위한 영양사 인력의 적정수급에 관한 연구)

  • Oh, Young-Ho
    • Journal of Nutrition and Health
    • /
    • v.43 no.5
    • /
    • pp.533-542
    • /
    • 2010
  • The objective of this study is to provide basic information and policy implications needed to balance the supply and demand for dietitian by projecting supply and demand for dietitian. The data from the Ministry of Health Welfare and Family on the number of licensed nutritionist, resident registration data of the Ministry of Public Administration and Security, and health insurance qualification data of the National Health Insurance Corporation were used to examine the current status of supply. To project the supply of nutritionist workforce, the in-out moves method and demographic method were used. The ratios of nutritionist to population and GDP, and that of other countries were applied as the demand projection method. According to the study results, the projection on the imbalance of supply and demand for dietitian by year 2021 differs depending on the method used. First, according to the results based on age-adjusted population ratio, there is an oversupply of 1,643 dietitians in year 2010, and 2,076 dietitians in year 2020. Second, although the projection on the imbalance of the supply and demand for dietitian differs depending on whether the GDD is calculated in won(₩) or dollar($). it is expected that there will be an oversupply in general. Third, as to the scenario using the nutritionist ratio in foreign countries, the oversupply of dietitian is likely in Korea, under any scenario, when comparing the nutritionist supply projection with the demand projection based on the nutritionist ratio in the United States. However, the projection of the supply and demand varies in each scenario when the European nutritionist ratio is applied. Under European 'scenario 1', an oversupply is expected, whereas under 'scenario 2', a shortage of supply is expected. A careful approach is required in interpreting the supply and demand projection using criteria of other countries, because dietitian assumes different roles and functions in each country. Although a slight oversupply of nutritionist workforce is projected, it does not cause a major problem as the demand for diet therapy is expected to rise due to aging and the increase of chronic diseases, and as the demand for clinical dietitians in hospitals increases. Accordingly, the demand for dietitians will rise and, in this context, the oversupply of nutritionist will not incur much problem. However, the nutritionist qualification is much too open in Korea, and this has a negative effect on the quality of the nutritionist workforce. Therefore, it is important that the nutritionist qualifications and requirements are reinforced in the future, enhance the quality level of the nutritionist supply, and maintain the balance between the supply and demand.

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.

Formalization of Productivity Metrics for Equipment in Multi-sectioned Road Construction Projects (다(多)공구 도로 공사 현장 장비들의 운영 실태 파악을 위한 생산성 지표 정립에 관한 연구)

  • Kim, Hong-Yeul;Koo, Bon-Sang
    • Korean Journal of Construction Engineering and Management
    • /
    • v.13 no.4
    • /
    • pp.100-109
    • /
    • 2012
  • Large road construction projects are typically partitioned into sections that are then contracted individually to contractors. Each section requires using similar heavy equipment including excavators, dump trucks and pavers, which constitute the highest cost. Normally the equipment is not shared between them, as each contractor wishes to have their equipment readily available. However, such practices result in very low utilization of these equipment. The goal of this research is to develop a programmatic resource sharing system in which contractors can share equipment depending on the changing needs of a multi-sectioned road project. This paper introduces the results of a survey performed to investigate how contractors currently manage the supply and demand of equipment and the equipment that are practical for sharing across a project. More importantly, the paper describes a set of metrics (DPR, nDPR, SDI) needed to quantify the amount of supply/demand variance occurring in each section. The metrics were used on an actual road construction project, and the results show that each section suffers from an imbalance between its monthly planned and actual utilization of equipment. The results also indicate that the sharing of the equipment can lead to potentially large savings as equipment requirements can be met within a project as to short leasing from outside vendors.

Building practical treatment protocol by comparing the effect of adjustment between Thompson Terminal Technique and Exercise in malpositioned pelvic which induces imbalance of body (골반변위에 따른 신체 불균형에 대한 톰슨터미널테크닉과 운동요법의 교정 효과비교분석을 통한 임상치료프로토콜의 구성)

  • Park, Joon-Ki;Choi, Eun-Seok;Kim, Min-Jung;Lee, Man-Su;Lee, Min-Sun
    • Journal of Digital Convergence
    • /
    • v.14 no.5
    • /
    • pp.445-457
    • /
    • 2016
  • The purpose of the study is to provide frame work of efficient diagnostic and treatment protocol for the people with malpositioned pelvic which causes imbalance of body. Study subjects were grouped as experimental, comparison and control group. Each group consisted of five men and five women randomly assigned. Experimental group was to be tested with Thompson Terminal Technique, its corrective effect and the effect of maintaining the treatment. There were 43.01%p difference in effectiveness of the applied technique between Thompson Terminal Technique and Muscle Energy Technique. It indicates that Thompson Terminal Technique is more effective in treating pelvic misalignment than Muscle Energy Technique. As a result, the use of chiropractic and resistance exercises is proven to be effective for treating the imbalance of body. Also, to maximize the effect of treatment, it is preferable to apply Muscle Energy Technique after applying the Thompson Terminal Technique.

Transmission and Disequilibrium Tests Based on Sibship Data (형제 및 자매의 유전자형 자료에 기초한 전달불균형 검정법에 관한 연구)

  • Kim, Jin-Heum;Jang, Yang-Soo
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.1
    • /
    • pp.81-94
    • /
    • 2008
  • Family-based tests such as the transmission and disequilibrium tests(TDT) have proved to be powerful tools in the search for disease genes. Unlike case-control studies, the tests are not affected by population admixture, which can lead to spurious association of multiple highly linked makers with disease-susceptible genes. Those tests have largely required knowledge of parental marker genotypes. However, parental data are often not available for late-onset diseases. In this article we propose sib-TDTs that overcome this problem by use of marker data from unaffected sib(s) instead of parents. To do this end, we fist defined a Mantel-Haenszel-type statistic for each haplotype and then proposed two tests based on this statistic. Simulation studies suggest that the proposed tests are robust to population admixture and are monotone increasing as a relative risk increases irrespective of mode of inheritance. We also illustrated the proposed tests with data adopted from Yonsei Cardiovascular Genome Center.