• Title/Summary/Keyword: random forests regression

Search Result 35, Processing Time 0.018 seconds

Ensemble approach for improving prediction in kernel regression and classification

  • Han, Sunwoo;Hwang, Seongyun;Lee, Seokho
    • Communications for Statistical Applications and Methods
    • /
    • v.23 no.4
    • /
    • pp.355-362
    • /
    • 2016
  • Ensemble methods often help increase prediction ability in various predictive models by combining multiple weak learners and reducing the variability of the final predictive model. In this work, we demonstrate that ensemble methods also enhance the accuracy of prediction under kernel ridge regression and kernel logistic regression classification. Here we apply bagging and random forests to two kernel-based predictive models; and present the procedure of how bagging and random forests can be embedded in kernel-based predictive models. Our proposals are tested under numerous synthetic and real datasets; subsequently, they are compared with plain kernel-based predictive models and their subsampling approach. Numerical studies demonstrate that ensemble approach outperforms plain kernel-based predictive models.

Study on the ensemble methods with kernel ridge regression

  • Kim, Sun-Hwa;Cho, Dae-Hyeon;Seok, Kyung-Ha
    • Journal of the Korean Data and Information Science Society
    • /
    • v.23 no.2
    • /
    • pp.375-383
    • /
    • 2012
  • The purpose of the ensemble methods is to increase the accuracy of prediction through combining many classifiers. According to recent studies, it is proved that random forests and forward stagewise regression have good accuracies in classification problems. However they have great prediction error in separation boundary points because they used decision tree as a base learner. In this study, we use the kernel ridge regression instead of the decision trees in random forests and boosting. The usefulness of our proposed ensemble methods was shown by the simulation results of the prostate cancer and the Boston housing data.

A review of tree-based Bayesian methods

  • Linero, Antonio R.
    • Communications for Statistical Applications and Methods
    • /
    • v.24 no.6
    • /
    • pp.543-559
    • /
    • 2017
  • Tree-based regression and classification ensembles form a standard part of the data-science toolkit. Many commonly used methods take an algorithmic view, proposing greedy methods for constructing decision trees; examples include the classification and regression trees algorithm, boosted decision trees, and random forests. Recent history has seen a surge of interest in Bayesian techniques for constructing decision tree ensembles, with these methods frequently outperforming their algorithmic counterparts. The goal of this article is to survey the landscape surrounding Bayesian decision tree methods, and to discuss recent modeling and computational developments. We provide connections between Bayesian tree-based methods and existing machine learning techniques, and outline several recent theoretical developments establishing frequentist consistency and rates of convergence for the posterior distribution. The methodology we present is applicable for a wide variety of statistical tasks including regression, classification, modeling of count data, and many others. We illustrate the methodology on both simulated and real datasets.

Application of Random Forests to Association Studies Using Mitochondrial Single Nucleotide Polymorphisms

  • Kim, Yoon-Hee;Kim, Ho
    • Genomics & Informatics
    • /
    • v.5 no.4
    • /
    • pp.168-173
    • /
    • 2007
  • In previous nuclear genomic association studies, Random Forests (RF), one of several up-to-date machine learning methods, has been used successfully to generate evidence of association of genetic polymorphisms with diseases or other phenotypes. Compared with traditional statistical analytic methods, such as chi-square tests or logistic regression models, the RF method has advantages in handling large numbers of predictor variables and examining gene-gene interactions without a specific model. Here, we applied the RF method to find the association between mitochondrial single nucleotide polymorphisms (mtSNPs) and diabetes risk. The results from a chi-square test validated the usage of RF for association studies using mtDNA. Indexes of important variables such as the Gini index and mean decrease in accuracy index performed well compared with chi-square tests in favor of finding mtSNPs associated with a real disease example, type 2 diabetes.

Usage of coot optimization-based random forests analysis for determining the shallow foundation settlement

  • Yi, Han;Xingliang, Jiang;Ye, Wang;Hui, Wang
    • Geomechanics and Engineering
    • /
    • v.32 no.3
    • /
    • pp.271-291
    • /
    • 2023
  • Settlement estimation in cohesion materials is a crucial topic to tackle because of the complexity of the cohesion soil texture, which could be solved roughly by substituted solutions. The goal of this research was to implement recently developed machine learning features as effective methods to predict settlement (Sm) of shallow foundations over cohesion soil properties. These models include hybridized support vector regression (SVR), random forests (RF), and coot optimization algorithm (COM), and black widow optimization algorithm (BWOA). The results indicate that all created systems accurately simulated the Sm, with an R2 of better than 0.979 and 0.9765 for the train and test data phases, respectively. This indicates extraordinary efficiency and a good correlation between the experimental and simulated Sm. The model's results outperformed those of ANFIS - PSO, and COM - RF findings were much outstanding to those of the literature. By analyzing established designs utilizing different analysis aspects, such as various error criteria, Taylor diagrams, uncertainty analyses, and error distribution, it was feasible to arrive at the final result that the recommended COM - RF was the outperformed approach in the forecasting process of Sm of shallow foundation, while other techniques were also reliable.

Generalized Partially Linear Additive Models for Credit Scoring

  • Shim, Ju-Hyun;Lee, Young-K.
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.4
    • /
    • pp.587-595
    • /
    • 2011
  • Credit scoring is an objective and automatic system to assess the credit risk of each customer. The logistic regression model is one of the popular methods of credit scoring to predict the default probability; however, it may not detect possible nonlinear features of predictors despite the advantages of interpretability and low computation cost. In this paper, we propose to use a generalized partially linear model as an alternative to logistic regression. We also introduce modern ensemble technologies such as bagging, boosting and random forests. We compare these methods via a simulation study and illustrate them through a German credit dataset.

Biological Feature Selection and Disease Gene Identification using New Stepwise Random Forests

  • Hwang, Wook-Yeon
    • Industrial Engineering and Management Systems
    • /
    • v.16 no.1
    • /
    • pp.64-79
    • /
    • 2017
  • Identifying disease genes from human genome is a critical task in biomedical research. Important biological features to distinguish the disease genes from the non-disease genes have been mainly selected based on traditional feature selection approaches. However, the traditional feature selection approaches unnecessarily consider many unimportant biological features. As a result, although some of the existing classification techniques have been applied to disease gene identification, the prediction performance was not satisfactory. A small set of the most important biological features can enhance the accuracy of disease gene identification, as well as provide potentially useful knowledge for biologists or clinicians, who can further investigate the selected biological features as well as the potential disease genes. In this paper, we propose a new stepwise random forests (SRF) approach for biological feature selection and disease gene identification. The SRF approach consists of two stages. In the first stage, only important biological features are iteratively selected in a forward selection manner based on one-dimensional random forest regression, where the updated residual vector is considered as the current response vector. We can then determine a small set of important biological features. In the second stage, random forests classification with regard to the selected biological features is applied to identify disease genes. Our extensive experiments show that the proposed SRF approach outperforms the existing feature selection and classification techniques in terms of biological feature selection and disease gene identification.

Application of a comparative analysis of random forest programming to predict the strength of environmentally-friendly geopolymer concrete

  • Ying Bi;Yeng Yi
    • Steel and Composite Structures
    • /
    • v.50 no.4
    • /
    • pp.443-458
    • /
    • 2024
  • The construction industry, one of the biggest producers of greenhouse emissions, is under a lot of pressure as a result of growing worries about how climate change may affect local communities. Geopolymer concrete (GPC) has emerged as a feasible choice for construction materials as a result of the environmental issues connected to the manufacture of cement. The findings of this study contribute to the development of machine learning methods for estimating the properties of eco-friendly concrete, which might be used in lieu of traditional concrete to reduce CO2 emissions in the building industry. In the present work, the compressive strength (fc) of GPC is calculated using random forests regression (RFR) methodology where natural zeolite (NZ) and silica fume (SF) replace ground granulated blast-furnace slag (GGBFS). From the literature, a thorough set of experimental experiments on GPC samples were compiled, totaling 254 data rows. The considered RFR integrated with artificial hummingbird optimization (AHA), black widow optimization algorithm (BWOA), and chimp optimization algorithm (ChOA), abbreviated as ARFR, BRFR, and CRFR. The outcomes obtained for RFR models demonstrated satisfactory performance across all evaluation metrics in the prediction procedure. For R2 metric, the CRFR model gained 0.9988 and 0.9981 in the train and test data set higher than those for BRFR (0.9982 and 0.9969), followed by ARFR (0.9971 and 0.9956). Some other error and distribution metrics depicted a roughly 50% improvement for CRFR respect to ARFR.

A Prediction Model for the Development of Cataract Using Random Forests (Random Forests 기법을 이용한 백내장 예측모형 - 일개 대학병원 건강검진 수검자료에서 -)

  • Han, Eun-Jeong;Song, Ki-Jun;Kim, Dong-Geon
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.4
    • /
    • pp.771-780
    • /
    • 2009
  • Cataract is the main cause of blindness and visual impairment, especially, age-related cataract accounts for about half of the 32 million cases of blindness worldwide. As the life expectancy and the expansion of the elderly population are increasing, the cases of cataract increase as well, which causes a serious economic and social problem throughout the country. However, the incidence of cataract can be reduced dramatically through early diagnosis and prevention. In this study, we developed a prediction model of cataracts for early diagnosis using hospital data of 3,237 subjects who received the screening test first and then later visited medical center for cataract check-ups cataract between 1994 and 2005. To develop the prediction model, we used random forests and compared the predictive performance of this model with other common discriminant models such as logistic regression, discriminant model, decision tree, naive Bayes, and two popular ensemble model, bagging and arcing. The accuracy of random forests was 67.16%, sensitivity was 72.28%, and main factors included in this model were age, diabetes, WBC, platelet, triglyceride, BMI and so on. The results showed that it could predict about 70% of cataract existence by screening test without any information from direct eye examination by ophthalmologist. We expect that our model may contribute to diagnose cataract and help preventing cataract in early stages.

Novel two-stage hybrid paradigm combining data pre-processing approaches to predict biochemical oxygen demand concentration (생물화학적 산소요구량 농도예측을 위하여 데이터 전처리 접근법을 결합한 새로운 이단계 하이브리드 패러다임)

  • Kim, Sungwon;Seo, Youngmin;Zakhrouf, Mousaab;Malik, Anurag
    • Journal of Korea Water Resources Association
    • /
    • v.54 no.spc1
    • /
    • pp.1037-1051
    • /
    • 2021
  • Biochemical oxygen demand (BOD) concentration, one of important water quality indicators, is treated as the measuring item for the ecological chapter in lakes and rivers. This investigation employed novel two-stage hybrid paradigm (i.e., wavelet-based gated recurrent unit, wavelet-based generalized regression neural networks, and wavelet-based random forests) to predict BOD concentration in the Dosan and Hwangji stations, South Korea. These models were assessed with the corresponding independent models (i.e., gated recurrent unit, generalized regression neural networks, and random forests). Diverse water quality and quantity indicators were implemented for developing independent and two-stage hybrid models based on several input combinations (i.e., Divisions 1-5). The addressed models were evaluated using three statistical indices including the root mean square error (RMSE), Nash-Sutcliffe efficiency (NSE), and correlation coefficient (CC). It can be found from results that the two-stage hybrid models cannot always enhance the predictive precision of independent models confidently. Results showed that the DWT-RF5 (RMSE = 0.108 mg/L) model provided more accurate prediction of BOD concentration compared to other optimal models in Dosan station, and the DWT-GRNN4 (RMSE = 0.132 mg/L) model was the best for predicting BOD concentration in Hwangji station, South Korea.