• Title/Summary/Keyword: multiple Regression

Search Result 13,405, Processing Time 0.041 seconds

MULTIPLE OUTLIER DETECTION IN LOGISTIC REGRESSION BY USING INFLUENCE MATRIX

  • Lee, Gwi-Hyun;Park, Sung-Hyun
    • Journal of the Korean Statistical Society
    • /
    • v.36 no.4
    • /
    • pp.457-469
    • /
    • 2007
  • Many procedures are available to identify a single outlier or an isolated influential point in linear regression and logistic regression. But the detection of influential points or multiple outliers is more difficult, owing to masking and swamping problems. The multiple outlier detection methods for logistic regression have not been studied from the points of direct procedure yet. In this paper we consider the direct methods for logistic regression by extending the $Pe\tilde{n}a$ and Yohai (1995) influence matrix algorithm. We define the influence matrix in logistic regression by using Cook's distance in logistic regression, and test multiple outliers by using the mean shift model. To show accuracy of the proposed multiple outlier detection algorithm, we simulate artificial data including multiple outliers with masking and swamping.

A Study on Stochastic Estimation of Monthly Runoff by Multiple Regression Analysis (다중회귀분석에 의한 하천 월 유출량의 추계학적 추정에 관한 연구)

  • 김태철;정하우
    • Magazine of the Korean Society of Agricultural Engineers
    • /
    • v.22 no.3
    • /
    • pp.75-87
    • /
    • 1980
  • Most hydro]ogic phenomena are the complex and organic products of multiple causations like climatic and hydro-geological factors. A certain significant correlation on the run-off in river basin would be expected and foreseen in advance, and the effect of each these causual and associated factors (independant variables; present-month rainfall, previous-month run-off, evapotranspiration and relative humidity etc.) upon present-month run-off(dependent variable) may be determined by multiple regression analysis. Functions between independant and dependant variables should be treated repeatedly until satisfactory and optimal combination of independant variables can be obtained. Reliability of the estimated function should be tested according to the result of statistical criterion such as analysis of variance, coefficient of determination and significance-test of regression coefficients before first estimated multiple regression model in historical sequence is determined. But some error between observed and estimated run-off is still there. The error arises because the model used is an inadequate description of the system and because the data constituting the record represent only a sample from a population of monthly discharge observation, so that estimates of model parameter will be subject to sampling errors. Since this error which is a deviation from multiple regression plane cannot be explained by first estimated multiple regression equation, it can be considered as a random error governed by law of chance in nature. This unexplained variance by multiple regression equation can be solved by stochastic approach, that is, random error can be stochastically simulated by multiplying random normal variate to standard error of estimate. Finally hybrid model on estimation of monthly run-off in nonhistorical sequence can be determined by combining the determistic component of multiple regression equation and the stochastic component of random errors. Monthly run-off in Naju station in Yong-San river basin is estimated by multiple regression model and hybrid model. And some comparisons between observed and estimated run-off and between multiple regression model and already-existing estimation methods such as Gajiyama formula, tank model and Thomas-Fiering model are done. The results are as follows. (1) The optimal function to estimate monthly run-off in historical sequence is multiple linear regression equation in overall-month unit, that is; Qn=0.788Pn+0.130Qn-1-0.273En-0.1 About 85% of total variance of monthly runoff can be explained by multiple linear regression equation and its coefficient of determination (R2) is 0.843. This means we can estimate monthly runoff in historical sequence highly significantly with short data of observation by above mentioned equation. (2) The optimal function to estimate monthly runoff in nonhistorical sequence is hybrid model combined with multiple linear regression equation in overall-month unit and stochastic component, that is; Qn=0. 788Pn+0. l30Qn-1-0. 273En-0. 10+Sy.t The rest 15% of unexplained variance of monthly runoff can be explained by addition of stochastic process and a bit more reliable results of statistical characteristics of monthly runoff in non-historical sequence are derived. This estimated monthly runoff in non-historical sequence shows up the extraordinary value (maximum, minimum value) which is not appeared in the observed runoff as a random component. (3) "Frequency best fit coefficient" (R2f) of multiple linear regression equation is 0.847 which is the same value as Gaijyama's one. This implies that multiple linear regression equation and Gajiyama formula are theoretically rather reasonable functions.

  • PDF

Application of discrete Weibull regression model with multiple imputation

  • Yoo, Hanna
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.3
    • /
    • pp.325-336
    • /
    • 2019
  • In this article we extend the discrete Weibull regression model in the presence of missing data. Discrete Weibull regression models can be adapted to various type of dispersion data however, it is not widely used. Recently Yoo (Journal of the Korean Data and Information Science Society, 30, 11-22, 2019) adapted the discrete Weibull regression model using single imputation. We extend their studies by using multiple imputation also with several various settings and compare the results. The purpose of this study is to address the merit of using multiple imputation in the presence of missing data in discrete count data. We analyzed the seventh Korean National Health and Nutrition Examination Survey (KNHANES VII), from 2016 to assess the factors influencing the variable, 1 month hospital stay, and we compared the results using discrete Weibull regression model with those of Poisson, negative Binomial and zero-inflated Poisson regression models, which are widely used in count data analyses. The results showed that the discrete Weibull regression model using multiple imputation provided the best fit. We also performed simulation studies to show the accuracy of the discrete Weibull regression using multiple imputation given both under- and over-dispersed distribution, as well as varying missing rates and sample size. Sensitivity analysis showed the influence of mis-specification and the robustness of the discrete Weibull model. Using imputation with discrete Weibull regression to analyze discrete data will increase explanatory power and is widely applicable to various types of dispersion data with a unified model.

Comparison of Genetic Parameter Estimates of Total Sperm Cells of Boars between Random Regression and Multiple Trait Animal Models

  • Oh, S.-H.;See, M.T.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.21 no.7
    • /
    • pp.923-927
    • /
    • 2008
  • The objective of this study was to compare random regression model and multiple trait animal model estimates of the (co) variance of total sperm cells over the active lifetime of AI boars. Data were provided by Smithfield Premium Genetics (Rose Hill, NC). Total number of records and animals for the random regression model were 19,629 and 1,736, respectively. Data for multiple trait animal model analyses were edited to include only records produced at 9, 12, 15, 18, 21, 24, and 27 months of age. For the multiple trait method estimates of genetic and residual variance for total sperm cells were heterogeneous among age classifications. When comparing multiple trait method to random regression, heritability estimates were similar except for total sperm cells at 24 months of age. The multiple trait method also resulted in higher estimates of heritability of total sperm cells at every age when compared to random regression results. Random regression analysis provided more detail with regard to changes of variance components with age. Random regression methods are the most appropriate to analyze semen traits as they are longitudinal data measured over the lifetime of boars.

Inter-comparison of Prediction Skills of Multiple Linear Regression Methods Using Monthly Temperature Simulated by Multi-Regional Climate Models (다중 지역기후모델로부터 모의된 월 기온자료를 이용한 다중선형회귀모형들의 예측성능 비교)

  • Seong, Min-Gyu;Kim, Chansoo;Suh, Myoung-Seok
    • Atmosphere
    • /
    • v.25 no.4
    • /
    • pp.669-683
    • /
    • 2015
  • In this study, we investigated the prediction skills of four multiple linear regression methods for monthly air temperature over South Korea. We used simulation results from four regional climate models (RegCM4, SNURCM, WRF, and YSURSM) driven by two boundary conditions (NCEP/DOE Reanalysis 2 and ERA-Interim). We selected 15 years (1989~2003) as the training period and the last 5 years (2004~2008) as validation period. The four regression methods used in this study are as follows: 1) Homogeneous Multiple linear Regression (HMR), 2) Homogeneous Multiple linear Regression constraining the regression coefficients to be nonnegative (HMR+), 3) non-homogeneous multiple linear regression (EMOS; Ensemble Model Output Statistics), 4) EMOS with positive coefficients (EMOS+). It is same method as the third method except for constraining the coefficients to be nonnegative. The four regression methods showed similar prediction skills for the monthly air temperature over South Korea. However, the prediction skills of regression methods which don't constrain regression coefficients to be nonnegative are clearly impacted by the existence of outliers. Among the four multiple linear regression methods, HMR+ and EMOS+ methods showed the best skill during the validation period. HMR+ and EMOS+ methods showed a very similar performance in terms of the MAE and RMSE. Therefore, we recommend the HMR+ as the best method because of ease of development and applications.

Prediction of Pitting Corrosion Characteristics of AL-6XN Steel with Sensitization and Environmental Variables Using Multiple Linear Regression Method (다중선형회귀법을 활용한 예민화와 환경변수에 따른 AL-6XN강의 공식특성 예측)

  • Jung, Kwang-Hu;Kim, Seong-Jong
    • Corrosion Science and Technology
    • /
    • v.19 no.6
    • /
    • pp.302-309
    • /
    • 2020
  • This study aimed to predict the pitting corrosion characteristics of AL-6XN super-austenitic steel using multiple linear regression. The variables used in the model are degree of sensitization, temperature, and pH. Experiments were designed and cyclic polarization curve tests were conducted accordingly. The data obtained from the cyclic polarization curve tests were used as training data for the multiple linear regression model. The significance of each factor in the response (critical pitting potential, repassivation potential) was analyzed. The multiple linear regression model was validated using experimental conditions that were not included in the training data. As a result, the degree of sensitization showed a greater effect than the other variables. Multiple linear regression showed poor performance for prediction of repassivation potential. On the other hand, the model showed a considerable degree of predictive performance for critical pitting potential. The coefficient of determination (R2) was 0.7745. The possibility for pitting potential prediction was confirmed using multiple linear regression.

Multivariate Analysis for Clinicians (임상의를 위한 다변량 분석의 실제)

  • Oh, Joo Han;Chung, Seok Won
    • Clinics in Shoulder and Elbow
    • /
    • v.16 no.1
    • /
    • pp.63-72
    • /
    • 2013
  • In medical research, multivariate analysis, especially multiple regression analysis, is used to analyze the influence of multiple variables on the result. Multiple regression analysis should include variables in the model and the problem of multi-collinearity as there are many variables as well as the basic assumption of regression analysis. The multiple regression model is expressed as the coefficient of determination, $R^2$ and the influence of independent variables on result as a regression coefficient, ${\beta}$. Multiple regression analysis can be divided into multiple linear regression analysis, multiple logistic regression analysis, and Cox regression analysis according to the type of dependent variables (continuous variable, categorical variable (binary logit), and state variable, respectively), and the influence of variables on the result is evaluated by regression coefficient${\beta}$, odds ratio, and hazard ratio, respectively. The knowledge of multivariate analysis enables clinicians to analyze the result accurately and to design the further research efficiently.

Multiple Deletions in Logistic Regression Models

  • Jung, Kang-Mo
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.2
    • /
    • pp.309-315
    • /
    • 2009
  • We extended the results of Roy and Guria (2008) to multiple deletions in logistic regression models. Since single deletions may not exactly detect outliers or influential observations due to swamping effects and masking effects, it needs multiple deletions. We developed conditional deletion diagnostics which are designed to overcome problems of masking effects. We derived the closed forms for several statistics in logistic regression models. They give useful diagnostics on the statistics.

Development and Evaluation of Simple Regression Model and Multiple Regression Model for TOC Contentation Estimation in Stream Flow (하천수내 TOC 농도 추정을 위한 단순회귀모형과 다중회귀모형의 개발과 평가)

  • Jung, Jaewoon;Cho, Sohyun;Choi, Jinhee;Kim, Kapsoon;Jung, Soojung;Lim, Byungjin
    • Journal of Korean Society on Water Environment
    • /
    • v.29 no.5
    • /
    • pp.625-629
    • /
    • 2013
  • The objective of this study is to develop and evaluate simple and multiple regression models for Total Organic Carbon (TOC) concentration estimation in stream flow. For development (using water quality data in 2012) and evaluation (using water quality data in 2011) of regression models, we used water quality data from downstream of Yeongsan river basin during 2011 and 2012, and correlation analysis between TOC and water quality parameters was conducted. The concentrations of TOC were positively correlated with Chemical Oxygen Demand (COD), Biochemical Oxygen Demand (BOD), TN (Total Nitrogen), Water Temperature (WT) and Electric Conductivity (EC). From these results, simple and multiple regression models for TOC estimation were developed as follows : $TOC=0.5809{\times}BOD+3.1557$, $TOC=0.4365{\times}COD+1.3731$. As a result of the application evaluation of the developed regression models, the multiple regression model was found to estimate TOC better than simple regression models.

An Approach to Applying Multiple Linear Regression Models by Interlacing Data in Classifying Similar Software

  • Lim, Hyun-il
    • Journal of Information Processing Systems
    • /
    • v.18 no.2
    • /
    • pp.268-281
    • /
    • 2022
  • The development of information technology is bringing many changes to everyday life, and machine learning can be used as a technique to solve a wide range of real-world problems. Analysis and utilization of data are essential processes in applying machine learning to real-world problems. As a method of processing data in machine learning, we propose an approach based on applying multiple linear regression models by interlacing data to the task of classifying similar software. Linear regression is widely used in estimation problems to model the relationship between input and output data. In our approach, multiple linear regression models are generated by training on interlaced feature data. A combination of these multiple models is then used as the prediction model for classifying similar software. Experiments are performed to evaluate the proposed approach as compared to conventional linear regression, and the experimental results show that the proposed method classifies similar software more accurately than the conventional model. We anticipate the proposed approach to be applied to various kinds of classification problems to improve the accuracy of conventional linear regression.