• Title/Summary/Keyword: tree-based models

Search Result 437, Processing Time 0.032 seconds

Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm (SVM과 meta-learning algorithm을 이용한 고지혈증 유병 예측모형 개발과 활용)

  • Lee, Seulki;Shin, Taeksoo
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.111-124
    • /
    • 2018
  • This study aims to develop a classification model for predicting the occurrence of hyperlipidemia, one of the chronic diseases. Prior studies applying data mining techniques for predicting disease can be classified into a model design study for predicting cardiovascular disease and a study comparing disease prediction research results. In the case of foreign literatures, studies predicting cardiovascular disease were predominant in predicting disease using data mining techniques. Although domestic studies were not much different from those of foreign countries, studies focusing on hypertension and diabetes were mainly conducted. Since hypertension and diabetes as well as chronic diseases, hyperlipidemia, are also of high importance, this study selected hyperlipidemia as the disease to be analyzed. We also developed a model for predicting hyperlipidemia using SVM and meta learning algorithms, which are already known to have excellent predictive power. In order to achieve the purpose of this study, we used data set from Korea Health Panel 2012. The Korean Health Panel produces basic data on the level of health expenditure, health level and health behavior, and has conducted an annual survey since 2008. In this study, 1,088 patients with hyperlipidemia were randomly selected from the hospitalized, outpatient, emergency, and chronic disease data of the Korean Health Panel in 2012, and 1,088 nonpatients were also randomly extracted. A total of 2,176 people were selected for the study. Three methods were used to select input variables for predicting hyperlipidemia. First, stepwise method was performed using logistic regression. Among the 17 variables, the categorical variables(except for length of smoking) are expressed as dummy variables, which are assumed to be separate variables on the basis of the reference group, and these variables were analyzed. Six variables (age, BMI, education level, marital status, smoking status, gender) excluding income level and smoking period were selected based on significance level 0.1. Second, C4.5 as a decision tree algorithm is used. The significant input variables were age, smoking status, and education level. Finally, C4.5 as a decision tree algorithm is used. In SVM, the input variables selected by genetic algorithms consisted of 6 variables such as age, marital status, education level, economic activity, smoking period, and physical activity status, and the input variables selected by genetic algorithms in artificial neural network consist of 3 variables such as age, marital status, and education level. Based on the selected parameters, we compared SVM, meta learning algorithm and other prediction models for hyperlipidemia patients, and compared the classification performances using TP rate and precision. The main results of the analysis are as follows. First, the accuracy of the SVM was 88.4% and the accuracy of the artificial neural network was 86.7%. Second, the accuracy of classification models using the selected input variables through stepwise method was slightly higher than that of classification models using the whole variables. Third, the precision of artificial neural network was higher than that of SVM when only three variables as input variables were selected by decision trees. As a result of classification models based on the input variables selected through the genetic algorithm, classification accuracy of SVM was 88.5% and that of artificial neural network was 87.9%. Finally, this study indicated that stacking as the meta learning algorithm proposed in this study, has the best performance when it uses the predicted outputs of SVM and MLP as input variables of SVM, which is a meta classifier. The purpose of this study was to predict hyperlipidemia, one of the representative chronic diseases. To do this, we used SVM and meta-learning algorithms, which is known to have high accuracy. As a result, the accuracy of classification of hyperlipidemia in the stacking as a meta learner was higher than other meta-learning algorithms. However, the predictive performance of the meta-learning algorithm proposed in this study is the same as that of SVM with the best performance (88.6%) among the single models. The limitations of this study are as follows. First, various variable selection methods were tried, but most variables used in the study were categorical dummy variables. In the case with a large number of categorical variables, the results may be different if continuous variables are used because the model can be better suited to categorical variables such as decision trees than general models such as neural networks. Despite these limitations, this study has significance in predicting hyperlipidemia with hybrid models such as met learning algorithms which have not been studied previously. It can be said that the result of improving the model accuracy by applying various variable selection techniques is meaningful. In addition, it is expected that our proposed model will be effective for the prevention and management of hyperlipidemia.

Design and Evaluation of ANFIS-based Classification Model (ANFIS 기반 분류모형의 설계 및 성능평가)

  • Song, Hee-Seok;Kim, Jae-Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.15 no.3
    • /
    • pp.151-165
    • /
    • 2009
  • Fuzzy neural network is an integrated model of artificial neural network and fuzzy system and it has been successfully applied in control and forecasting area. Recently ANFIS(Adaptive Network-based Fuzzy Inference System) has been noticed widely among various fuzzy neural network models because of its outstanding accuracy of control and forecasting area. We design a new classification model based on ANFIS and evaluate it in terms of classification accuracy. We identified ANFIS-based classification model has higher classification accuracy compared to existing classification model, C5.0 decision tree model by comparing their experimental results.

  • PDF

Linear Spectral Mixture Analysis of Landsat Imagery for Wetland land-Cover Classification in Paldang Reservoir and Vicinity

  • Kim, Sang-Wook;Park, Chong-Hwa
    • Korean Journal of Remote Sensing
    • /
    • v.20 no.3
    • /
    • pp.197-205
    • /
    • 2004
  • Wetlands are lands with a mixture of water, herbaceous or woody vegetation and wet soil. And linear spectral mixture analysis (LSMA) is one of the most often used methods in handling the spectral mixture problem. This study aims to test LSMA is an enhanced routine for classification of wetland land-covers in Paldang reservoir and vicinity (paldang Reservoir) using Landsat TM and ETM+ imagery. In the LSMA process, reference endmembers were driven from scatter-plots of Landsat bands 3, 4 and 5, and a series of endmember models were developed based on green vegetation (GV), soil and water endmembers which are the main indicators of wetlands. To consider phenological characteristics of Paldang Reservoir, a soil endmember was subdivided into bright and dark soil endmembers in spring and a green vegetation (GV) endmember was subdivided into GV tree and GV herbaceous endmembers in fall. We found that LSMA fractions improved the classification accuracy of the wetland land-cover. Four endmember models provided better GV and soil discrimination and the root mean squared (RMS) errors were 0.011 and 0.0039, in spring and fall respectively. Phenologically, a fall image is more appropriate to classify wetland land-cover than spring's. The classification result using 4 endmember fractions of a fall image reached 85.2 and 74.2 percent of the producer's and user's accuracy respectively. This study shows that this routine will be an useful tool for identifying and monitoring the status of wetlands in Paldang Reservoir.

Implementing Linear Models in Genetic Programming to Utilize Accumulated Data in Shipbuilding (조선분야의 축적된 데이터 활용을 위한 유전적프로그래밍에서의 선형(Linear) 모델 개발)

  • Lee, Kyung-Ho;Yeun, Yun-Seog;Yang, Young-Soon
    • Journal of the Society of Naval Architects of Korea
    • /
    • v.42 no.5 s.143
    • /
    • pp.534-541
    • /
    • 2005
  • Until now, Korean shipyards have accumulated a great amount of data. But they do not have appropriate tools to utilize the data in practical works. Engineering data contains experts' experience and know-how in its own. It is very useful to extract knowledge or information from the accumulated existing data by using data mining technique This paper treats an evolutionary computation based on genetic programming (GP), which can be one of the components to realize data mining. The paper deals with linear models of GP for the regression or approximation problem when given learning samples are not sufficient. The linear model, which is a function of unknown parameters, is built through extracting all possible base functions from the standard GP tree by utilizing the symbolic processing algorithm. In addition to a standard linear model consisting of mathematic functions, one variant form of a linear model, which can be built using low order Taylor series and can be converted into the standard form of a polynomial, is considered in this paper. The suggested model can be utilized as a designing tool to predict design parameters with small accumulated data.

Finding a plan to improve recognition rate using classification analysis

  • Kim, SeungJae;Kim, SungHwan
    • International journal of advanced smart convergence
    • /
    • v.9 no.4
    • /
    • pp.184-191
    • /
    • 2020
  • With the emergence of the 4th Industrial Revolution, core technologies that will lead the 4th Industrial Revolution such as AI (artificial intelligence), big data, and Internet of Things (IOT) are also at the center of the topic of the general public. In particular, there is a growing trend of attempts to present future visions by discovering new models by using them for big data analysis based on data collected in a specific field, and inferring and predicting new values with the models. In order to obtain the reliability and sophistication of statistics as a result of big data analysis, it is necessary to analyze the meaning of each variable, the correlation between the variables, and multicollinearity. If the data is classified differently from the hypothesis test from the beginning, even if the analysis is performed well, unreliable results will be obtained. In other words, prior to big data analysis, it is necessary to ensure that data is well classified according to the purpose of analysis. Therefore, in this study, data is classified using a decision tree technique and a random forest technique among classification analysis, which is a machine learning technique that implements AI technology. And by evaluating the degree of classification of the data, we try to find a way to improve the classification and analysis rate of the data.

Identification of shear transfer mechanisms in RC beams by using machine-learning technique

  • Zhang, Wei;Lee, Deuckhang;Ju, Hyunjin;Wang, Lei
    • Computers and Concrete
    • /
    • v.30 no.1
    • /
    • pp.43-74
    • /
    • 2022
  • Machine learning technique is recently opening new opportunities to identify the complex shear transfer mechanisms of reinforced concrete (RC) beam members. This study employed 1224 shear test specimens to train decision tree-based machine learning (ML) programs, by which strong correlations between shear capacity of RC beams and key input parameters were affirmed. In addition, shear contributions of concrete and shear reinforcement (the so-called Vc and Vs) were identified by establishing three independent ML models trained under different strategies with various combinations of datasets. Detailed parametric studies were then conducted by utilizing the well-trained ML models. It appeared that the presence of shear reinforcement can make the predicted shear contribution from concrete in RC beams larger than the pure shear contribution of concrete due to the intervention effect between shear reinforcement and concrete. On the other hand, the size effect also brought a significant impact on the shear contribution of concrete (Vc), whereas, the addition of shear reinforcements can effectively mitigate the size effect. It was also found that concrete tends to be the primary source of shear resistance when shear span-depth ratio a/d<1.0 while shear reinforcements become the primary source of shear resistance when a/d>2.0.

Prediction of Larix kaempferi Stand Growth in Gangwon, Korea, Using Machine Learning Algorithms

  • Hyo-Bin Ji;Jin-Woo Park;Jung-Kee Choi
    • Journal of Forest and Environmental Science
    • /
    • v.39 no.4
    • /
    • pp.195-202
    • /
    • 2023
  • In this study, we sought to compare and evaluate the accuracy and predictive performance of machine learning algorithms for estimating the growth of individual Larix kaempferi trees in Gangwon Province, Korea. We employed linear regression, random forest, XGBoost, and LightGBM algorithms to predict tree growth using monitoring data organized based on different thinning intensities. Furthermore, we compared and evaluated the goodness-of-fit of these models using metrics such as the coefficient of determination (R2), mean absolute error (MAE), and root mean square error (RMSE). The results revealed that XGBoost provided the highest goodness-of-fit, with an R2 value of 0.62 across all thinning intensities, while also yielding the lowest values for MAE and RMSE, thereby indicating the best model fit. When predicting the growth volume of individual trees after 3 years using the XGBoost model, the agreement was exceptionally high, reaching approximately 97% for all stand sites in accordance with the different thinning intensities. Notably, in non-thinned plots, the predicted volumes were approximately 2.1 m3 lower than the actual volumes; however, the agreement remained highly accurate at approximately 99.5%. These findings will contribute to the development of growth prediction models for individual trees using machine learning algorithms.

A Detection of Novel Habitats of Abies Koreana by Using Species Distribution Models(SDMs) and Its Application for Plant Conservation (종 분포 모형을 활용한 새로운 구상나무 서식지 탐색, 그리고 식물보전 활용)

  • Kim, Nam-Shin;Han, DongUk;Cha, Jin-Yeol;Park, Yong-Su;Cho, Hyeun-Je;Kwon, Hye-Jin;Cho, Yong-Chan;Oh, Seung-Hwan;Lee, Chang-Seok
    • Journal of the Korean Society of Environmental Restoration Technology
    • /
    • v.18 no.6
    • /
    • pp.135-149
    • /
    • 2015
  • Korean fir(Abies koreana E.H.Wilson 1920), endemic tree species of Korean peninsula, is considered as vulnerable and endangered species to recent rapid environmental changes such as land use and climate change. There are limited activities and efforts to find natural habitats of Korean fir for conservation of the species and habitats. In this study, by applying SDMs (Species Distribution Models) based on climate and topographic factors of Korean fir, we developed Korean fir's predicted distribution model and explored novel natural habitats. In Mt. Shinbulsan, Youngnam region and Mt. Songnisan, we could find korean fir's two novel habitat and the former was the warmest($13^{\circ}C$ in annual mean temperature), the driest(1,200mm~1,600mm in annual rainfall) and relatively low altitude environment among Korean fir's habitats in Korea. The result of SDMs did not include mountain areas of Gangwon-do as habitats of A. nephrolepis, because there were different contributions of key habitat environment factors, summer rainfall, winter mean temperature and winter rainfall, between A. koreana and A. nephrolepis. Our results raise modification of other distribution models on Korean fir. Novel habitat of Korean fir in Mt. Shinbulsan revealed similar habitat affinity of the species, ridgy and rocky site, with other habitats in Korea. Our results also suggest potential areas for creation of Korea fir's alternative habitats through species reintroduction in landscape and ecosystem level.

Predicting Surgical Complications in Adult Patients Undergoing Anterior Cervical Discectomy and Fusion Using Machine Learning

  • Arvind, Varun;Kim, Jun S.;Oermann, Eric K.;Kaji, Deepak;Cho, Samuel K.
    • Neurospine
    • /
    • v.15 no.4
    • /
    • pp.329-337
    • /
    • 2018
  • Objective: Machine learning algorithms excel at leveraging big data to identify complex patterns that can be used to aid in clinical decision-making. The objective of this study is to demonstrate the performance of machine learning models in predicting postoperative complications following anterior cervical discectomy and fusion (ACDF). Methods: Artificial neural network (ANN), logistic regression (LR), support vector machine (SVM), and random forest decision tree (RF) models were trained on a multicenter data set of patients undergoing ACDF to predict surgical complications based on readily available patient data. Following training, these models were compared to the predictive capability of American Society of Anesthesiologists (ASA) physical status classification. Results: A total of 20,879 patients were identified as having undergone ACDF. Following exclusion criteria, patients were divided into 14,615 patients for training and 6,264 for testing data sets. ANN and LR consistently outperformed ASA physical status classification in predicting every complication (p < 0.05). The ANN outperformed LR in predicting venous thromboembolism, wound complication, and mortality (p < 0.05). The SVM and RF models were no better than random chance at predicting any of the postoperative complications (p < 0.05). Conclusion: ANN and LR algorithms outperform ASA physical status classification for predicting individual postoperative complications. Additionally, neural networks have greater sensitivity than LR when predicting mortality and wound complications. With the growing size of medical data, the training of machine learning on these large datasets promises to improve risk prognostication, with the ability of continuously learning making them excellent tools in complex clinical scenarios.

Hybrid machine learning with HHO method for estimating ultimate shear strength of both rectangular and circular RC columns

  • Quang-Viet Vu;Van-Thanh Pham;Dai-Nhan Le;Zhengyi Kong;George Papazafeiropoulos;Viet-Ngoc Pham
    • Steel and Composite Structures
    • /
    • v.52 no.2
    • /
    • pp.145-163
    • /
    • 2024
  • This paper presents six novel hybrid machine learning (ML) models that combine support vector machines (SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting (GB), extreme gradient boosting (XGB), and categorical gradient boosting (CGB) with the Harris Hawks Optimization (HHO) algorithm. These models, namely HHO-SVM, HHO-DT, HHO-RF, HHO-GB, HHO-XGB, and HHO-CGB, are designed to predict the ultimate strength of both rectangular and circular reinforced concrete (RC) columns. The prediction models are established using a comprehensive database consisting of 325 experimental data for rectangular columns and 172 experimental data for circular columns. The ML model hyperparameters are optimized through a combination of cross-validation technique and the HHO. The performance of the hybrid ML models is evaluated and compared using various metrics, ultimately identifying the HHO-CGB model as the top-performing model for predicting the ultimate shear strength of both rectangular and circular RC columns. The mean R-value and mean a20-index are relatively high, reaching 0.991 and 0.959, respectively, while the mean absolute error and root mean square error are low (10.302 kN and 27.954 kN, respectively). Another comparison is conducted with four existing formulas to further validate the efficiency of the proposed HHO-CGB model. The Shapely Additive Explanations method is applied to analyze the contribution of each variable to the output within the HHO-CGB model, providing insights into the local and global influence of variables. The analysis reveals that the depth of the column, length of the column, and axial loading exert the most significant influence on the ultimate shear strength of RC columns. A user-friendly graphical interface tool is then developed based on the HHO-CGB to facilitate practical and cost-effective usage.