Search | Korea Science

Comparison of tree-based ensemble models for regression

Park, Sangho;Kim, Chanmin
- Communications for Statistical Applications and Methods
- /
- v.29 no.5
- /
- pp.561-589
- /
- 2022
When multiple classifications and regression trees are combined, tree-based ensemble models, such as random forest (RF) and Bayesian additive regression trees (BART), are produced. We compare the model structures and performances of various ensemble models for regression settings in this study. RF learns bootstrapped samples and selects a splitting variable from predictors gathered at each node. The BART model is specified as the sum of trees and is calculated using the Bayesian backfitting algorithm. Throughout the extensive simulation studies, the strengths and drawbacks of the two methods in the presence of missing data, high-dimensional data, or highly correlated data are investigated. In the presence of missing data, BART performs well in general, whereas RF provides adequate coverage. The BART outperforms in high dimensional, highly correlated data. However, in all of the scenarios considered, the RF has a shorter computation time. The performance of the two methods is also compared using two real data sets that represent the aforementioned situations, and the same conclusion is reached.
https://doi.org/10.29220/CSAM.2022.29.5.561 인용 PDF KSCI

ASYMPTOTIC NORMALITY OF WAVELET ESTIMATOR OF REGRESSION FUNCTION UNDER NA ASSUMPTIONS

Liang, Han-Ying;Qi, Yan-Yan
- Bulletin of the Korean Mathematical Society
- /
- v.44 no.2
- /
- pp.247-257
- /
- 2007
Consider the heteroscedastic regression model $Y_i=g(x_i)+{\sigma}_i\;{\epsilon}_i=(1{\leq}i{\leq}n)$, where ${\sigma}^2_i=f(u_i)$, the design points $(x_i,\;u_i)$ are known and nonrandom, and g and f are unknown functions defined on closed interval [0, 1]. Under the random errors $\epsilon_i$ form a sequence of NA random variables, we study the asymptotic normality of wavelet estimators of g when f is a known or unknown function.
https://doi.org/10.4134/BKMS.2007.44.2.247 인용 PDF KSCI

Restricted maximum likelihood estimation of a censored random effects panel regression model

Lee, Minah;Lee, Seung-Chun
- Communications for Statistical Applications and Methods
- /
- v.26 no.4
- /
- pp.371-383
- /
- 2019
Panel data sets have been developed in various areas, and many recent studies have analyzed panel, or longitudinal data sets. Maximum likelihood (ML) may be the most common statistical method for analyzing panel data models; however, the inference based on the ML estimate will have an inflated Type I error because the ML method tends to give a downwardly biased estimate of variance components when the sample size is small. The under estimation could be severe when data is incomplete. This paper proposes the restricted maximum likelihood (REML) method for a random effects panel data model with a censored dependent variable. Note that the likelihood function of the model is complex in that it includes a multidimensional integral. Many authors proposed to use integral approximation methods for the computation of likelihood function; however, it is well known that integral approximation methods are inadequate for high dimensional integrals in practice. This paper introduces to use the moments of truncated multivariate normal random vector for the calculation of multidimensional integral. In addition, a proper asymptotic standard error of REML estimate is given.
https://doi.org/10.29220/CSAM.2019.26.4.371 인용 PDF KSCI

Likelihood-Based Inference of Random Effects and Application in Logistic Regression (우도에 기반한 임의효과에 대한 추론과 로지스틱 회귀모형에서의 응용)

Kim, Gwangsu
- The Korean Journal of Applied Statistics
- /
- v.28 no.2
- /
- pp.269-279
- /
- 2015
This paper considers inferences of random effects. We show that the proposed confidence distribution (CD) performs well in logistic regression for random intercepts with small samples. Real data analyses are also done to identify the subject effects clearly.
https://doi.org/10.5351/KJAS.2015.28.2.269 인용 PDF KSCI

Estimation of Genetic Parameters for First Lactation Monthly Test-day Milk Yields using Random Regression Test Day Model in Karan Fries Cattle

Singh, Ajay;Singh, Avtar;Singh, Manvendra;Prakash, Ved;Ambhore, G.S.;Sahoo, S.K.;Dash, Soumya
- Asian-Australasian Journal of Animal Sciences
- /
- v.29 no.6
- /
- pp.775-781
- /
- 2016
A single trait linear mixed random regression test-day model was applied for the first time for analyzing the first lactation monthly test-day milk yield records in Karan Fries cattle. The test-day milk yield data was modeled using a random regression model (RRM) considering different order of Legendre polynomial for the additive genetic effect (4th order) and the permanent environmental effect (5th order). Data pertaining to 1,583 lactation records spread over a period of 30 years were recorded and analyzed in the study. The variance component, heritability and genetic correlations among test-day milk yields were estimated using RRM. RRM heritability estimates of test-day milk yield varied from 0.11 to 0.22 in different test-day records. The estimates of genetic correlations between different test-day milk yields ranged 0.01 (test-day 1 [TD-1] and TD-11) to 0.99 (TD-4 and TD-5). The magnitudes of genetic correlations between test-day milk yields decreased as the interval between test-days increased and adjacent test-day had higher correlations. Additive genetic and permanent environment variances were higher for test-day milk yields at both ends of lactation. The residual variance was observed to be lower than the permanent environment variance for all the test-day milk yields.
https://doi.org/10.5713/ajas.15.0643 인용 PDF KSCI

Prediction of Future Milk Yield with Random Regression Model Using Test-day Records in Holstein Cows

Park, Byoungho;Lee, Deukhwan
- Asian-Australasian Journal of Animal Sciences
- /
- v.19 no.7
- /
- pp.915-921
- /
- 2006
Various random regression models with different order of Legendre polynomials for permanent environmental and genetic effects were constructed to predict future milk yield of Holstein cows in Korea. A total of 257,908 test-day (TD) milk yield records from a total of 28,135 cows belonging to 1,090 herds were considered for estimating (co)variance of the random covariate coefficients using an expectation-maximization REML algorithm in an animal mixed model. The variances did not change much between the models, having different order of Legendre polynomial, but a decreasing trend was observed with increase in the order of Legendre polynomial in the model. The R-squared value of the model increased and the residual variance reduced with the increase in order of Legendre polynomial in the model. Therefore, a model with $5^{th}$ order of Legendre polynomial was considered for predicting future milk yield. For predicting the future milk yield of cows, 132,771 TD records from 28,135 cows were randomly selected from the above data by way of preceding partial TD record, and then future milk yields were estimated using incomplete records from each cow randomly retained. Results suggested that we could predict the next four months milk yield with an error deviation of 4 kg. The correlation of more than 70% between predicted and observed values was estimated for the next four months milk yield. Even using only 3 TD records of some cows, the average milk yield of Korean Holstein cows would be predicted with high accuracy if compared with observed milk yield. Persistency of each cow was estimated which might be useful for selecting the cows with higher persistency. The results of the present study suggested the use of a $5^{th}$ order Legendre polynomial to predict the future milk yield of each cow.
https://doi.org/10.5713/ajas.2006.915 인용 PDF KSCI

A Study on the Performance Evaluation of Machine Learning for Predicting the Number of Movie Audiences (영화 관객 수 예측을 위한 기계학습 기법의 성능 평가 연구)

Jeong, Chan-Mi;Min, Daiki
- The Journal of Society for e-Business Studies
- /
- v.25 no.2
- /
- pp.49-63
- /
- 2020
The accurate prediction of box office in the early stage is crucial for film industry to make better managerial decision. With aims to improve the prediction performance, the purpose of this paper is to evaluate the use of machine learning methods. We tested both classification and regression based methods including k-NN, SVM and Random Forest. We first evaluate input variables, which show that reputation-related information generated during the first two-week period after release is significant. Prediction test results show that regression based methods provides lower prediction error, and Random Forest particularly outperforms other machine learning methods. Regression based method has better prediction power when films have small box office earnings. On the other hand, classification based method works better for predicting large box office earnings.
https://doi.org/10.7838/jsebs.2020.25.2.049 인용 PDF KSCI

Modeling of Flow-Accelerated Corrosion using Machine Learning: Comparison between Random Forest and Non-linear Regression (기계학습을 이용한 유동가속부식 모델링: 랜덤 포레스트와 비선형 회귀분석과의 비교)

Lee, Gyeong-Geun;Lee, Eun Hee;Kim, Sung-Woo;Kim, Kyung-Mo;Kim, Dong-Jin
- Corrosion Science and Technology
- /
- v.18 no.2
- /
- pp.61-71
- /
- 2019
Flow-Accelerated Corrosion (FAC) is a phenomenon in which a protective coating on a metal surface is dissolved by a flow of fluid in a metal pipe, leading to continuous wall-thinning. Recently, many countries have developed computer codes to manage FAC in power plants, and the FAC prediction model in these computer codes plays an important role in predictive performance. Herein, the FAC prediction model was developed by applying a machine learning method and the conventional nonlinear regression method. The random forest, a widely used machine learning technique in predictive modeling led to easy calculation of FAC tendency for five input variables: flow rate, temperature, pH, Cr content, and dissolved oxygen concentration. However, the model showed significant errors in some input conditions, and it was difficult to obtain proper regression results without using additional data points. In contrast, nonlinear regression analysis predicted robust estimation even with relatively insufficient data by assuming an empirical equation and the model showed better predictive power when the interaction between DO and pH was considered. The comparative analysis of this study is believed to provide important insights for developing a more sophisticated FAC prediction model.
https://doi.org/10.14773/cst.2019.18.2.61 인용 PDF KSCI HTML

Generalized Partially Linear Additive Models for Credit Scoring

Shim, Ju-Hyun;Lee, Young-K.
- The Korean Journal of Applied Statistics
- /
- v.24 no.4
- /
- pp.587-595
- /
- 2011
Credit scoring is an objective and automatic system to assess the credit risk of each customer. The logistic regression model is one of the popular methods of credit scoring to predict the default probability; however, it may not detect possible nonlinear features of predictors despite the advantages of interpretability and low computation cost. In this paper, we propose to use a generalized partially linear model as an alternative to logistic regression. We also introduce modern ensemble technologies such as bagging, boosting and random forests. We compare these methods via a simulation study and illustrate them through a German credit dataset.
https://doi.org/10.5351/KJAS.2011.24.4.587 인용 PDF KSCI

A BERRY-ESSEEN TYPE BOUND OF REGRESSION ESTIMATOR BASED ON LINEAR PROCESS ERRORS

Liang, Han-Ying;Li, Yu-Yu
- Journal of the Korean Mathematical Society
- /
- v.45 no.6
- /
- pp.1753-1767
- /
- 2008
Consider the nonparametric regression model $Y_{ni}\;=\;g(x_{ni})+{\epsilon}_{ni}$ ($1\;{\leq}\;i\;{\leq}\;n$), where g($\cdot$) is an unknown regression function, $x_{ni}$ are known fixed design points, and the correlated errors {${\epsilon}_{ni}$, $1\;{\leq}\;i\;{\leq}\;n$} have the same distribution as {$V_i$, $1\;{\leq}\;i\;{\leq}\;n$}, here $V_t\;=\;{\sum}^{\infty}_{j=-{\infty}}\;{\psi}_je_{t-j}$ with ${\sum}^{\infty}_{j=-{\infty}}\;|{\psi}_j|$ < $\infty$ and {$e_t$} are negatively associated random variables. Under appropriate conditions, we derive a Berry-Esseen type bound for the estimator of g($\cdot$). As corollary, by choice of the weights, the Berry-Esseen type bound can attain O($n^{-1/4}({\log}\;n)^{3/4}$).
https://doi.org/10.4134/JKMS.2008.45.6.1753 인용 PDF KSCI

Search Result 494, Processing Time 0.029 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)