• Title/Summary/Keyword: Multicollinearity

Search Result 174, Processing Time 0.024 seconds

Estimation of S&T Knowledge Production Function Using Principal Component Regression Model (주성분 회귀모형을 이용한 과학기술 지식생산함수 추정)

  • Park, Su-Dong;Sung, Oong-Hyun
    • Journal of Korea Technology Innovation Society
    • /
    • v.13 no.2
    • /
    • pp.231-251
    • /
    • 2010
  • The numbers of SCI paper or patent in science and technology are expected to be related with the number of researcher and knowledge stock (R&D stock, paper stock, patent stock). The results of the regression model showed that severe multicollinearity existed and errors were made in the estimation and testing of regression coefficients. To solve the problem of multicollinearity and estimate the effect of the independent variable properly, principal component regression model were applied for three cases with S&T knowledge production. The estimated principal component regression function was transformed into original independent variables to interpret properly its effect. The analysis indicated that the principal component regression model was useful to estimate the effect of the highly correlate production factors and showed that the number of researcher, R&D stock, paper or patent stock had all positive effect on the production of paper or patent.

  • PDF

A Study on the Derivation of the Unit Hydrograph using Multiple Regression Model (다중회귀모형으로 추정된 모수에 의한 최적단위유량도의 유도에 관한 연구)

  • 이종남;김채원;황창현
    • Water for future
    • /
    • v.25 no.1
    • /
    • pp.93-100
    • /
    • 1992
  • A study on the Derivation of the Unit Hydrograph using Multiple Regression Moe이. The purpose of this study is to deriver an optimal unit hydrograph suing the multiple regression model, particularly when only small amount of data is available. The presence of multicollinearity among the input data can cause serious oscillations in the derivation of the unit hydrograph. In this case, the oscillations in the unit hydrograph ordinate are eliminated by combining the data. The data used in this study are based upon the collection and arrangement of rainfall-runoff data(1977-1989) at the Soyang-river Dam site. When the matrix X is the rainfall series, the condition number and the reciprocal of the minimum eigenvalue of XTX are calculated by the Jacobi an method, and are compared with the oscillation in the unit hydrograph. The optimal unit hydrograph is derived by combining the numerous rainfall-runoff data. The conclusions are as follows; 1)The oscillations in the derived unit hydrograph are reduced by combining the data from each flood event. 2) The reciprocals of the minimum eigen\value of XTX, 1/k and the condition number CN are increased when the oscillations are active in the derived unit hydrograph. 3)The parameter estimates are validated by extending the model to the Soyang river Dam site with elimination of the autocorrelation in the disturbances. Finally, this paper illustrates the application of the multiple regression model to drive an optimal unit hydrograph dealing with the multicollinearity and the autocorrelation which cause some problems.

  • PDF

A New Deletion Criterion of Principal Components Regression with Orientations of the Parameters

  • Lee, Won-Woo
    • Journal of the Korean Statistical Society
    • /
    • v.16 no.2
    • /
    • pp.55-70
    • /
    • 1987
  • The principal components regression is one of the substitues for least squares method when there exists multicollinearity in the multiple linear regression model. It is observed graphically that the performance of the principal components regression is strongly dependent upon the values of the parameters. Accordingly, a new deletion criterion which determines proper principal components to be deleted from the analysis is developed and its usefulness is checked by simulations.

  • PDF

A Graphical Method for Evaluating the Mixture Component Effects of Ridge Regression Estimator in Mixture Experiments

  • Jang, Dae-Heung
    • Communications for Statistical Applications and Methods
    • /
    • v.6 no.1
    • /
    • pp.1-10
    • /
    • 1999
  • When the component proportions in mixture experiments are restricted by lower and upper bounds multicollinearity appears all too frequently. The ridge regression can be used to stabilize the coefficient estimates in the fitted model. I propose a graphical method for evaluating the mixture component effects of ridge regression estimator with respect to the prediction variance and the prediction bias.

  • PDF

A Study on Developing a CER Using Production Cost Data in Korean Maneuver Weapon System (한국형 기동무기체계 양산비 비용추정관계식 개발에 관한 연구)

  • Lee, Doo-Hyun;Kim, Gak-Gyu
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.39 no.3
    • /
    • pp.51-61
    • /
    • 2014
  • In this paper, we deal with developing a cost estimation relationships (CER) for Korean maneuverable weapons systems using historical production cost. To develop the CER, we collected the historical data of the production cost of four tanks and five armored vehicles. We also analyzed the Required Operational Capability (ROC) of the weapons systems and chose cost drivers that can compare operational capabilities of the weapons systems We used Forward selection, Backward selection, Stepwise Regression and $R^2$ selection as the cost drivers which have the greatest influence with the dependent variables. And we used Principle Component Regression, Robust Regression and Weighted Regression to deal with multicollinearity and outlier among the data to develop a more appropriate CER. As a result, we were able to develop a production cost CER for Korean maneuverable weapons systems that have the lowest cost errors. Thus, this research is meaningful in terms of developing a CER based on Korean original cost data without foreign data and these methods will contribute to developing a Korean cost analysis program in the future.

Developing an R&D CER Using Historical Defense Weapon System Data in Korea (한국 무기체계 개발 실적을 고려한 연구개발 비용추정관계식 개발)

  • Eo, Won-Jae;Lee, Yong-Bok;Kang, Sung-Jin
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.33 no.3
    • /
    • pp.55-62
    • /
    • 2010
  • Currently, cost estimation is very important to the government acquisition programs to support decisions about funding and to evaluate resource requirement as key decision points. Parametric cost estimating models have been used extensively to obtain appropriate cost estimates in early acquisition phase. However, they have many restrictions to ensure the cost estimating result in Korean defense environment because they are developed in the U.S.A. environment. In order to obtain a good R&D cost estimate, developing our own CERs (Cost Estimation Relationships) using historical R&D data is essential. Nevertheless, there has been little research to develop our own CERs. In this research, we established a CER development process and found some cost drivers in the historical movement weapon system R&D data. The R&D CER is developed using the PCR(Principle Component Regression) method to remove multicollinearity among data and to overcome the restriction of the insufficient number of sample. At least, this research is meaningful as a first attempt in terms of defining the CER development process and obtaining our own R&D CER based on the historical data in Korean weapon system R&D environment.

A Criterion for the Selection of Principal Components in the Robust Principal Component Regression (로버스트주성분회귀에서 최적의 주성분선정을 위한 기준)

  • Kim, Bu-Yong
    • Communications for Statistical Applications and Methods
    • /
    • v.18 no.6
    • /
    • pp.761-770
    • /
    • 2011
  • Robust principal components regression is suggested to deal with both the multicollinearity and outlier problem. A main aspect of the robust principal components regression is the selection of an optimal set of principal components. Instead of the eigenvalue of the sample covariance matrix, a selection criterion is developed based on the condition index of the minimum volume ellipsoid estimator which is highly robust against leverage points. In addition, the least trimmed squares estimation is employed to cope with regression outliers. Monte Carlo simulation results indicate that the proposed criterion is superior to existing ones.

Development of Ridge Regression Model of Pollutant Load Using Runoff Weighted Value Based on Distributed Curve-Number (분포형 CN 기반 토지피복별 유출가중치를 이용한 오염부하량 능형회귀모형 개발)

  • Song, Chul Min;Kim, Jin Soo
    • Journal of The Korean Society of Agricultural Engineers
    • /
    • v.60 no.1
    • /
    • pp.111-120
    • /
    • 2018
  • The purpose of this study was to develop a ridge regression (RR) model to estimate BOD and TP load using runoff weighted value. The concept of runoff weighted value, based on distributed curve-number (CN), was introduced to reflect the impact of land covers on runoff. The estimated runoff depths by distributed CN were closer to the observed values than those by area weighted mean CN. The RR is a technique used when the data suffers from multicollinearity. The RR model was developed for five flow duration intervals with the independent variables of daily runoff discharge of seven land covers and dependent variables of daily pollutant load. The RR model was applied to Heuk river watershed, a subwatershed of the Han river watershed. The variance inflation factors of the RR model decreased to the value less than 10. The RR model showed a good performance with Nash-Sutcliffe efficiency (NSE) of 0.73 and 0.87, and Pearson correlation coefficient of 0.88 and 0.93 for BOD and TP, respectively. The results suggest that the methods used in the study can be applied to estimate pollutant load of different land cover watersheds using limited data.

Development of Regression Models Resolving High-Dimensional Data and Multicollinearity Problem for Heavy Rain Damage Data (호우피해자료에서의 고차원 자료 및 다중공선성 문제를 해소한 회귀모형 개발)

  • Kim, Jeonghwan;Park, Jihyun;Choi, Changhyun;Kim, Hung Soo
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.38 no.6
    • /
    • pp.801-808
    • /
    • 2018
  • The learning of the linear regression model is stable on the assumption that the sample size is sufficiently larger than the number of explanatory variables and there is no serious multicollinearity between explanatory variables. In this study, we investigated the difficulty of model learning when the assumption was violated by analyzing a real heavy rain damage data and we proposed to use a principal component regression model or a ridge regression model after integrating data to overcome the difficulty. We evaluated the predictive performance of the proposed models by using the test data independent from the training data, and confirmed that the proposed methods showed better predictive performances than the linear regression model.

Optimal Selection of Classifier Ensemble Using Genetic Algorithms (유전자 알고리즘을 이용한 분류자 앙상블의 최적 선택)

  • Kim, Myung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.4
    • /
    • pp.99-112
    • /
    • 2010
  • Ensemble learning is a method for improving the performance of classification and prediction algorithms. It is a method for finding a highly accurateclassifier on the training set by constructing and combining an ensemble of weak classifiers, each of which needs only to be moderately accurate on the training set. Ensemble learning has received considerable attention from machine learning and artificial intelligence fields because of its remarkable performance improvement and flexible integration with the traditional learning algorithms such as decision tree (DT), neural networks (NN), and SVM, etc. In those researches, all of DT ensemble studies have demonstrated impressive improvements in the generalization behavior of DT, while NN and SVM ensemble studies have not shown remarkable performance as shown in DT ensembles. Recently, several works have reported that the performance of ensemble can be degraded where multiple classifiers of an ensemble are highly correlated with, and thereby result in multicollinearity problem, which leads to performance degradation of the ensemble. They have also proposed the differentiated learning strategies to cope with performance degradation problem. Hansen and Salamon (1990) insisted that it is necessary and sufficient for the performance enhancement of an ensemble that the ensemble should contain diverse classifiers. Breiman (1996) explored that ensemble learning can increase the performance of unstable learning algorithms, but does not show remarkable performance improvement on stable learning algorithms. Unstable learning algorithms such as decision tree learners are sensitive to the change of the training data, and thus small changes in the training data can yield large changes in the generated classifiers. Therefore, ensemble with unstable learning algorithms can guarantee some diversity among the classifiers. To the contrary, stable learning algorithms such as NN and SVM generate similar classifiers in spite of small changes of the training data, and thus the correlation among the resulting classifiers is very high. This high correlation results in multicollinearity problem, which leads to performance degradation of the ensemble. Kim,s work (2009) showedthe performance comparison in bankruptcy prediction on Korea firms using tradition prediction algorithms such as NN, DT, and SVM. It reports that stable learning algorithms such as NN and SVM have higher predictability than the unstable DT. Meanwhile, with respect to their ensemble learning, DT ensemble shows the more improved performance than NN and SVM ensemble. Further analysis with variance inflation factor (VIF) analysis empirically proves that performance degradation of ensemble is due to multicollinearity problem. It also proposes that optimization of ensemble is needed to cope with such a problem. This paper proposes a hybrid system for coverage optimization of NN ensemble (CO-NN) in order to improve the performance of NN ensemble. Coverage optimization is a technique of choosing a sub-ensemble from an original ensemble to guarantee the diversity of classifiers in coverage optimization process. CO-NN uses GA which has been widely used for various optimization problems to deal with the coverage optimization problem. The GA chromosomes for the coverage optimization are encoded into binary strings, each bit of which indicates individual classifier. The fitness function is defined as maximization of error reduction and a constraint of variance inflation factor (VIF), which is one of the generally used methods to measure multicollinearity, is added to insure the diversity of classifiers by removing high correlation among the classifiers. We use Microsoft Excel and the GAs software package called Evolver. Experiments on company failure prediction have shown that CO-NN is effectively applied in the stable performance enhancement of NNensembles through the choice of classifiers by considering the correlations of the ensemble. The classifiers which have the potential multicollinearity problem are removed by the coverage optimization process of CO-NN and thereby CO-NN has shown higher performance than a single NN classifier and NN ensemble at 1% significance level, and DT ensemble at 5% significance level. However, there remain further research issues. First, decision optimization process to find optimal combination function should be considered in further research. Secondly, various learning strategies to deal with data noise should be introduced in more advanced further researches in the future.