• Title/Summary/Keyword: Decision Tree Regression

Search Result 328, Processing Time 0.03 seconds

Convergence analysis for geographic variations and risk factors in the prevalence of hyperlipidemia using measures of Korean Community Health Survey (지역사회건강조사 지표를 이용한 고지혈증 유병율의 지역 간 변이와 위험 요인의 융복합적 분석)

  • Kim, Yoo-Mi;Kang, Sung-Hong
    • Journal of Digital Convergence
    • /
    • v.13 no.8
    • /
    • pp.419-429
    • /
    • 2015
  • We investigate how the regional prevalence of hyperlipidemia is affected by health-related and socioeconomic factors with a special emphasis on geographic variations. We focus on the likelihood of hyperlipidemia as function of various region-specific attributes. We analysis a data set at the level of 249 small administrative districts collected from 2012 Korean Community Health Survey by Korea Centers for Disease Control and Prevention. To estimate, we use several methods including correlation analysis, multiple regression and decision tree model. We find that the average prevalence of hyperlipidemia in 249 small districts is 9.6% and its coefficient of variation is 28.3%. Prevalence of hyperlipidemia in continental and capital regions is higher than in southeast coastal regions. Further findings using decision tree model suggest that variations of hyperlipidemia prevalence between regions is more likely to be associated with rate of employee, level of stress, prevalence of hypertension, angina pectoris, and osteoarthritis in their regions.

A study on the prediction of korean NPL market return (한국 NPL시장 수익률 예측에 관한 연구)

  • Lee, Hyeon Su;Jeong, Seung Hwan;Oh, Kyong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.123-139
    • /
    • 2019
  • The Korean NPL market was formed by the government and foreign capital shortly after the 1997 IMF crisis. However, this market is short-lived, as the bad debt has started to increase after the global financial crisis in 2009 due to the real economic recession. NPL has become a major investment in the market in recent years when the domestic capital market's investment capital began to enter the NPL market in earnest. Although the domestic NPL market has received considerable attention due to the overheating of the NPL market in recent years, research on the NPL market has been abrupt since the history of capital market investment in the domestic NPL market is short. In addition, decision-making through more scientific and systematic analysis is required due to the decline in profitability and the price fluctuation due to the fluctuation of the real estate business. In this study, we propose a prediction model that can determine the achievement of the benchmark yield by using the NPL market related data in accordance with the market demand. In order to build the model, we used Korean NPL data from December 2013 to December 2017 for about 4 years. The total number of things data was 2291. As independent variables, only the variables related to the dependent variable were selected for the 11 variables that indicate the characteristics of the real estate. In order to select the variables, one to one t-test and logistic regression stepwise and decision tree were performed. Seven independent variables (purchase year, SPC (Special Purpose Company), municipality, appraisal value, purchase cost, OPB (Outstanding Principle Balance), HP (Holding Period)). The dependent variable is a bivariate variable that indicates whether the benchmark rate is reached. This is because the accuracy of the model predicting the binomial variables is higher than the model predicting the continuous variables, and the accuracy of these models is directly related to the effectiveness of the model. In addition, in the case of a special purpose company, whether or not to purchase the property is the main concern. Therefore, whether or not to achieve a certain level of return is enough to make a decision. For the dependent variable, we constructed and compared the predictive model by calculating the dependent variable by adjusting the numerical value to ascertain whether 12%, which is the standard rate of return used in the industry, is a meaningful reference value. As a result, it was found that the hit ratio average of the predictive model constructed using the dependent variable calculated by the 12% standard rate of return was the best at 64.60%. In order to propose an optimal prediction model based on the determined dependent variables and 7 independent variables, we construct a prediction model by applying the five methodologies of discriminant analysis, logistic regression analysis, decision tree, artificial neural network, and genetic algorithm linear model we tried to compare them. To do this, 10 sets of training data and testing data were extracted using 10 fold validation method. After building the model using this data, the hit ratio of each set was averaged and the performance was compared. As a result, the hit ratio average of prediction models constructed by using discriminant analysis, logistic regression model, decision tree, artificial neural network, and genetic algorithm linear model were 64.40%, 65.12%, 63.54%, 67.40%, and 60.51%, respectively. It was confirmed that the model using the artificial neural network is the best. Through this study, it is proved that it is effective to utilize 7 independent variables and artificial neural network prediction model in the future NPL market. The proposed model predicts that the 12% return of new things will be achieved beforehand, which will help the special purpose companies make investment decisions. Furthermore, we anticipate that the NPL market will be liquidated as the transaction proceeds at an appropriate price.

Relationship Between Above-and Below-Ground Biomass for Norway Spruce (Picea abies) : Estimating Root System Biomass from Breast Height Diameter (독일가문비나무(Picea abies [L.] Karst)의 지상부(地上部)와 지하부(地下部) 생체량(生體量)에 관(關)한 연구(硏究) : 흉고직경(胸高直徑)에 의한 뿌리생체량(生體量) 추정(推定))

  • Lee, Do-Hyung
    • Journal of Korean Society of Forest Science
    • /
    • v.90 no.3
    • /
    • pp.338-345
    • /
    • 2001
  • This study was conducted to elucidate the relationship between the root structure and the crown structure of Norway spruce(Picea abies [L.] Karst), and thereafter to obtain the regression equation for the estimation of relative root and needle biomass using the tree height and diameter at breast height(DBH) without measurement of root and needle biomass. The study site was Barbis stands of Harz region located in central part of Germany. Five dominant and three co-dominant trees of 30 to 40 year-old Norway spruce were selected and tree height, diameter at breast height, clear bole length, weight of total needle and branch, cross section and sapwood area at breast height for biomass of above ground part and also the length of root, the number of root, the weight of root, the cross section area of root etc. by dividing the horizontal and vertical roots for below ground part of tree were measured. The significantly correlation was shown between the biomass of most of variables of above ground parts and those of below ground parts. For the diameter of breast height to the weight of total root, regression equation was Y = 3.56X - 45.94 and decision coefficient was 0.96 showing highly correlation. The weight of total branches and needles, and the tree height etc. of above ground parts showed highly positive relationship with below ground biomass. The results obtained from this study can be used to the estimating of biomass of below ground using variables of above ground such as DBH in the 30 to 40 year-old Norway spruce stands.

  • PDF

데이터마이닝을 위한 혼합 데이터베이스에서의 속성선택

  • Cha, Un-Ok;Heo, Mun-Yeol
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.05a
    • /
    • pp.103-108
    • /
    • 2003
  • 데이터마이닝을 위한 대용량 데이터베이스를 축소시키는 방법 중에 속성선택 방법이 많이 사용되고 있다. 본 논문에서는 세 가지 속성선택 방법을 사용하여 조건속성 수를 60%이상 축소시켜 결정나무와 로지스틱 회귀모형에 적용시켜보고 이들의 효율을 비교해 본다. 세 가지 속성선택 방법은 MDI, 정보획득, ReliefF 방법이다. 결정나무 방법은 QUEST, CART, C4.5를 사용하였다. 속성선택 방법들의 분류 정확성은 UCI 데이터베이스에 주어진 Credit 승인 데이터베이스와 German Credit 데이터베이스를 사용하여 10층-교차확인 방법으로 평가하였다.

  • PDF

A Study on Improving the predict accuracy rate of Hybrid Model Technique Using Error Pattern Modeling : Using Logistic Regression and Discriminant Analysis

  • Cho, Yong-Jun;Hur, Joon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.2
    • /
    • pp.269-278
    • /
    • 2006
  • This paper presents the new hybrid data mining technique using error pattern, modeling of improving classification accuracy. The proposed method improves classification accuracy by combining two different supervised learning methods. The main algorithm generates error pattern modeling between the two supervised learning methods(ex: Neural Networks, Decision Tree, Logistic Regression and so on.) The Proposed modeling method has been applied to the simulation of 10,000 data sets generated by Normal and exponential random distribution. The simulation results show that the performance of proposed method is superior to the existing methods like Logistic regression and Discriminant analysis.

  • PDF

Hybrid Learning Architectures for Advanced Data Mining:An Application to Binary Classification for Fraud Management (개선된 데이터마이닝을 위한 혼합 학습구조의 제시)

  • Kim, Steven H.;Shin, Sung-Woo
    • Journal of Information Technology Application
    • /
    • v.1
    • /
    • pp.173-211
    • /
    • 1999
  • The task of classification permeates all walks of life, from business and economics to science and public policy. In this context, nonlinear techniques from artificial intelligence have often proven to be more effective than the methods of classical statistics. The objective of knowledge discovery and data mining is to support decision making through the effective use of information. The automated approach to knowledge discovery is especially useful when dealing with large data sets or complex relationships. For many applications, automated software may find subtle patterns which escape the notice of manual analysis, or whose complexity exceeds the cognitive capabilities of humans. This paper explores the utility of a collaborative learning approach involving integrated models in the preprocessing and postprocessing stages. For instance, a genetic algorithm effects feature-weight optimization in a preprocessing module. Moreover, an inductive tree, artificial neural network (ANN), and k-nearest neighbor (kNN) techniques serve as postprocessing modules. More specifically, the postprocessors act as second0order classifiers which determine the best first-order classifier on a case-by-case basis. In addition to the second-order models, a voting scheme is investigated as a simple, but efficient, postprocessing model. The first-order models consist of statistical and machine learning models such as logistic regression (logit), multivariate discriminant analysis (MDA), ANN, and kNN. The genetic algorithm, inductive decision tree, and voting scheme act as kernel modules for collaborative learning. These ideas are explored against the background of a practical application relating to financial fraud management which exemplifies a binary classification problem.

  • PDF

Development of Coil Breakage Prediction Model In Cold Rolling Mill

  • Park, Yeong-Bok;Hwang, Hwa-Won
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2005.06a
    • /
    • pp.1343-1346
    • /
    • 2005
  • In the cold rolling mill, coil breakage that generated in rolling process makes the various types of troubles such as the degradation of productivity and the damage of equipment. Recent researches were done by the mechanical analysis such as the analysis of roll chattering or strip inclining and the prevention of breakage that detects the crack of coil. But they could cover some kind of breakages. The prediction of Coil breakage was very complicated and occurred rarely. We propose to build effective prediction modes for coil breakage in rolling process, based on data mining model. We proposed three prediction models for coil breakage: (1) decision tree based model, (2) regression based model and (3) neural network based model. To reduce model parameters, we selected important variables related to the occurrence of coil breakage from the attributes of coil setup by using the methods such as decision tree, variable selection and the choice of domain experts. We developed these prediction models and chose the best model among them using SEMMA process that proposed in SAS E-miner environment. We estimated model accuracy by scoring the prediction model with the posterior probability. We also have developed a software tool to analyze the data and generate the proposed prediction models either automatically and in a user-driven manner. It also has an effective visualization feature that is based on PCA (Principle Component Analysis).

  • PDF

A study for improving data mining methods for continuous response variables (연속형 반응변수를 위한 데이터마이닝 방법 성능 향상 연구)

  • Choi, Jin-Soo;Lee, Seok-Hyung;Cho, Hyung-Jun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.5
    • /
    • pp.917-926
    • /
    • 2010
  • It is known that bagging and boosting techniques improve the performance in classification problem. A number of researchers have proved the high performance of bagging and boosting through experiments for categorical response but not for continuous response. We study whether bagging and boosting improve data mining methods for continuous responses such as linear regression, decision tree, neural network through bagging and boosting. The analysis of eight real data sets prove the high performance of bagging and boosting empirically.

Discovering Relationships between Skin Type and Life Style Using Data Mining Techniques: A Case Study of Korea

  • Kim, Taeheung;Ha, Jihyun;Lee, Jong-Seok;Oh, Younhak;Cho, Yong Ju
    • Industrial Engineering and Management Systems
    • /
    • v.15 no.1
    • /
    • pp.110-121
    • /
    • 2016
  • With the growing interest in skincare and maintenance, there are increasing numbers of studies on the classification of skin type and the factors influencing each type. This study presents a novel methodology by using data mining, for the determination of the relationships between skin type, lifestyle, and patterns of cosmetic utilization. Eight skin-specific factors, which are moisture, sebum in U-zone (both cheeks), sebum in T-zone (forehead, nose, and chin), pore, melanin, wrinkle, acne, hemoglobin, were measured in 1,246 subjects living in South Korea, in conjunction with a questionnaire survey analyzing their lifestyles and pattern of cosmetic utilization. Using various multivariate statistical methods and data mining techniques, we classified the skin types based on the skin-specific values, determined the relationship between skin type and lifestyle, and accordingly sorted the subjects into clusters. Logistic regression analysis revealed gender-related differences in the skin; therefore, separate analyses were performed for males and females. Using the Gaussian Mixture Modeling (GMM) technique, we classified the subjects based on skin type (two male and four female). Using the ANOVA and decision tree techniques, we attempted to characterize the relationship between each skin type and the lifestyles of the subjects. Menstruation, eating habits, stress, and smoking were identified as the major factors affecting the skin.

Selecting the Best Prediction Model for Readmission

  • Lee, Eun-Whan
    • Journal of Preventive Medicine and Public Health
    • /
    • v.45 no.4
    • /
    • pp.259-266
    • /
    • 2012
  • Objectives: This study aims to determine the risk factors predicting rehospitalization by comparing three models and selecting the most successful model. Methods: In order to predict the risk of rehospitalization within 28 days after discharge, 11 951 inpatients were recruited into this study between January and December 2009. Predictive models were constructed with three methods, logistic regression analysis, a decision tree, and a neural network, and the models were compared and evaluated in light of their misclassification rate, root asymptotic standard error, lift chart, and receiver operating characteristic curve. Results: The decision tree was selected as the final model. The risk of rehospitalization was higher when the length of stay (LOS) was less than 2 days, route of admission was through the out-patient department (OPD), medical department was in internal medicine, 10th revision of the International Classification of Diseases code was neoplasm, LOS was relatively shorter, and the frequency of OPD visit was greater. Conclusions: When a patient is to be discharged within 2 days, the appropriateness of discharge should be considered, with special concern of undiscovered complications and co-morbidities. In particular, if the patient is admitted through the OPD, any suspected disease should be appropriately examined and prompt outcomes of tests should be secured. Moreover, for patients of internal medicine practitioners, co-morbidity and complications caused by chronic illness should be given greater attention.