• Title/Summary/Keyword: Multiple regression model

Search Result 2,523, Processing Time 0.027 seconds

Clustering Observations for Detecting Multiple Outliers in Regression Models

  • Seo, Han-Son;Yoon, Min
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.3
    • /
    • pp.503-512
    • /
    • 2012
  • Detecting outliers in a linear regression model eventually fails when similar observations are classified differently in a sequential process. In such circumstances, identifying clusters and applying certain methods to the clustered data can prevent a failure to detect outliers and is computationally efficient due to the reduction of data. In this paper, we suggest to implement a clustering procedure for this purpose and provide examples that illustrate the suggested procedure applied to the Hadi-Simonoff (1993) method, reverse Hadi-Simonoff method, and Gentleman-Wilk (1975) method.

VARIANCE ESTIMATION OF ERROR IN THE REGRESSION MODEL AT A POINT

  • Oh, Jong-Chul
    • Journal of applied mathematics & informatics
    • /
    • v.13 no.1_2
    • /
    • pp.501-508
    • /
    • 2003
  • Although the estimate of regression function is important, some have focused the variance estimation of error term in regression model. Different variance estimators perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are approximately satisfied so as to identify the best variance estimator for the given data. In this article, we suggest SHM estimator compared to LS estimator, which is common estimator using in parametric multiple regression analysis. Moreover, a combined estimator of variance, VEM, is suggested. In the simulation study it is shown that VEM performs well in practice.

A Score test for Detection of Outliers in Nonlinear Regression

  • Kahng, Myung-Wook
    • Journal of the Korean Statistical Society
    • /
    • v.22 no.2
    • /
    • pp.201-208
    • /
    • 1993
  • Given the specific mean shift outlier model, the score test for multiple outliers in nonlinear regression is discussed as an alternative to the likelihood ratio test. The geometric interpretation of the score statistic is also presented.

  • PDF

Is it Possible to Predict the ADI of Pesticides using the QSAR Approach?

  • Kim, Jae Hyoun
    • Journal of Environmental Health Sciences
    • /
    • v.38 no.6
    • /
    • pp.550-560
    • /
    • 2012
  • Objectives: QSAR methodology was applied to explain two different sets of acceptable daily intake (ADI) data of 74 pesticides proposed by both the USEPA and WHO in terms of setting guidelines for food and drinking water. Methods: A subset of calculated descriptors was selected from Dragon$^{(R)}$ software. QSARs were then developed utilizing a statistical technique, genetic algorithm-multiple linear regression (GA-MLR). The differences in each specific model in the prediction of the ADI of the pesticides were discussed. Results: The stepwise multiple linear regression analysis resulted in a statistically significant QSAR model with five descriptors. Resultant QSAR models were robust, showing good utility across multiple classes of pesticide compounds. The applicability domain was also defined. The proposed models were robust and satisfactory. Conclusions: The QSAR model could be a feasible and effective tool for predicting ADI and for the comparison of logADIEPA to logADIWHO. The statistical results agree with the fact that USEPA focuses on more subtle endpoints than does WHO.

Prediction of movie audience numbers using hybrid model combining GLS and Bass models (GLS와 Bass 모형을 결합한 하이브리드 모형을 이용한 영화 관객 수 예측)

  • Kim, Bokyung;Lim, Changwon
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.4
    • /
    • pp.447-461
    • /
    • 2018
  • Domestic film industry sales are increasing every year. Theaters are the primary sales channels for movies and the number of audiences using the theater affects additional selling rights. Therefore, the number of audiences using the theater is an important factor directly linked to movie industry sales. In this paper we consider a hybrid model that combines a multiple linear regression model and the Bass model to predict the audience numbers for a specific day. By combining the two models, the predictive value of the regression analysis was corrected to that of the Bass model. In the analysis, three films with different release dates were used. All subset regression method is used to generate all possible combinations and 5-fold cross validation to estimate the model 5 times. In this case, the predicted value is obtained from the model with the smallest root mean square error and then combined with the predicted value of the Bass model to obtain the final predicted value. With the existence of past data, it was confirmed that the weight of the Bass model increases and the compensation is added to the predicted value.

Comments on the regression coefficients (다중회귀에서 회귀계수 추정량의 특성)

  • Kahng, Myung-Wook
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.4
    • /
    • pp.589-597
    • /
    • 2021
  • In simple and multiple regression, there is a difference in the meaning of regression coefficients, and not only are the estimates of regression coefficients different, but they also have different signs. Understanding the relative contribution of explanatory variables in a regression model is an important part of regression analysis. In a standardized regression model, the regression coefficient can be interpreted as the change in the response variable with respect to the standard deviation when the explanatory variable increases by the standard deviation in a situation where the values of the explanatory variables other than the corresponding explanatory variable are fixed. However, the size of the standardized regression coefficient is not a proper measure of the relative importance of each explanatory variable. In this paper, the estimator of the regression coefficient in multiple regression is expressed as a function of the correlation coefficient and the coefficient of determination. Furthermore, it is considered in terms of the effect of an additional explanatory variable and additional increase in the coefficient of determination. We also explore the relationship between estimates of regression coefficients and correlation coefficients in various plots. These results are specifically applied when there are two explanatory variables.

Evaluating Variable Selection Techniques for Multivariate Linear Regression (다중선형회귀모형에서의 변수선택기법 평가)

  • Ryu, Nahyeon;Kim, Hyungseok;Kang, Pilsung
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.42 no.5
    • /
    • pp.314-326
    • /
    • 2016
  • The purpose of variable selection techniques is to select a subset of relevant variables for a particular learning algorithm in order to improve the accuracy of prediction model and improve the efficiency of the model. We conduct an empirical analysis to evaluate and compare seven well-known variable selection techniques for multiple linear regression model, which is one of the most commonly used regression model in practice. The variable selection techniques we apply are forward selection, backward elimination, stepwise selection, genetic algorithm (GA), ridge regression, lasso (Least Absolute Shrinkage and Selection Operator) and elastic net. Based on the experiment with 49 regression data sets, it is found that GA resulted in the lowest error rates while lasso most significantly reduces the number of variables. In terms of computational efficiency, forward/backward elimination and lasso requires less time than the other techniques.

Relationship between Stream Geomophological Factors and the Vegetation Abundance - With a Special Reference to the Han River System - (하천의 지형학적 인자와 식생종수의 관계 -한강수계를 중심으로-)

  • 이광우;김태균;심우경
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.30 no.3
    • /
    • pp.73-85
    • /
    • 2002
  • The purpose of this study was to develop prediction models for plant species abundance by stream restoration. Generally the stream plant is affected by stream gemophology. So in this study, the relationship between the vegetation abundance and stream gemophology was developed by multiple regression analysis. The stream characteristics utilized in this study were longitudinal slope, transectional slope, micro-landforms through the longitudinal direction, riparian width and geometric mean diameter and biggest diameter of bed material, and cumulated coarse and fine sand weight portion. The Pyungchang River with mountainous watershed and the Kyungan stream and the Bokha stream in the agricultural region were selected and vegetation species abundance and stream characteristics were documented from the site at 2~3km intervals from the upper stream to the lower. The Models for predicting the vegetation abundance were developed by multiple regression analysis using SPSS statistics package. The linear relationship between the dependant(species abundance) and independant(stream characteristics) variables was tested by a graphical method. Longitudinal and transectional slope had a nonlinear relationship with species abundance. In the next step, the independance between the independant variables was tested and the correlation between independant and dependant variables was tested by the Pearson bivariate correlation test. The selected independant variables were transectional slope, riparian width, and cumulated fine sand weight portion. From the multiple regression analysis, the $R^2$for the Pyungchang river, Kyungan stream, Bokga stream were 0.651, 0.512 and 0.240 respectively. The natural stream configuration in the Pyungchang river had the best result and the lower $R^2$for Kyunan and Bokha stream were due to human impact which disturbed the natural ecosystem. The lowest $R^2$for the Bokha stream was due to the shifting sandy bed. If the stream bed is fugitive, the prediction model may not be valid. Using the multiple regression models, the vegetation abundance could be predicted with stream characteristics such as, transection slope, riaparian width, cumulated fine sand weigth portion, after stream restoration.

Multiple linear regression model-based voltage imbalance estimation for high-power series battery pack (다중선형회귀모델 기반 고출력 직렬 배터리 팩의 전압 불균형 추정)

  • Kim, Seung-Woo;Lee, Pyeong-Yeon;Han, Dong-Ho;Kim, Jong-hoon
    • Journal of IKEEE
    • /
    • v.23 no.1
    • /
    • pp.1-8
    • /
    • 2019
  • In this paper, the electrical characteristics with various C-rates are tested with a high power series battery pack comprised of 18650 cylindrical nickel cobalt aluminum(NCA) lithium-ion battery. The electrical characteristics of discharge capacity test with 14S1P battery pack and electric vehicle (EV) cycle test with 4S1P battery pack are compared and analyzed by the various of C-rates. Multiple linear regression is used to estimate voltage imbalance of 14S1P and 4S1P battery packs with various C-rates based on experimental data. The estimation accuracy is evaluated by root mean square error(RMSE) to validate multiple linear regression. The result of this paper is contributed that to use for estimating the voltage imbalance of discharge capacity test with 14S1P battery pack using multiple linear regression better than to use the voltage imbalance of EV cycle with 4S1P battery pack.

Development of statistical forecast model for PM10 concentration over Seoul (서울지역 PM10 농도 예측모형 개발)

  • Sohn, Keon Tae;Kim, Dahong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.2
    • /
    • pp.289-299
    • /
    • 2015
  • The objective of the present study is to develop statistical quantitative forecast model for PM10 concentration over Seoul. We used three types of data (weather observation data in Korea, the China's weather observation data collected by GTS, and air quality numerical model forecasts). To apply the daily forecast system, hourly data are converted to daily data and then lagging was performed. The potential predictors were selected based on correlation analysis and multicollinearity check. Model validation has been performed for checking model stability. We applied two models (multiple regression model and threshold regression model) separately. The two models were compared based on the scatter plot of forecasts and observations, time series plots, RMSE, skill scores. As a result, a threshold regression model performs better than multiple regression model in high PM10 concentration cases.