• 제목/요약/키워드: Bias in variable selection

검색결과 39건 처리시간 0.022초

A Study on the Bias Reduction in Split Variable Selection in CART

  • Song, Hyo-Im;Song, Eun-Tae;Song, Moon Sup
    • Communications for Statistical Applications and Methods
    • /
    • 제11권3호
    • /
    • pp.553-562
    • /
    • 2004
  • In this short communication we discuss the bias problems of CART in split variable selection and suggest a method to reduce the variable selection bias. Penalties proportional to the number of categories or distinct values are applied to the splitting criteria of CART. The results of empirical comparisons show that the proposed modification of CART reduces the bias in variable selection.

Bias Reduction in Split Variable Selection in C4.5

  • Shin, Sung-Chul;Jeong, Yeon-Joo;Song, Moon Sup
    • Communications for Statistical Applications and Methods
    • /
    • 제10권3호
    • /
    • pp.627-635
    • /
    • 2003
  • In this short communication we discuss the bias problem of C4.5 in split variable selection and suggest a method to reduce the variable selection bias among categorical predictor variables. A penalty proportional to the number of categories is applied to the splitting criterion gain of C4.5. The results of empirical comparisons show that the proposed modification of C4.5 reduces the size of classification trees.

의사결정나무에서 분리 변수 선택에 관한 연구 (A Study on Selection of Split Variable in Constructing Classification Tree)

  • 정성석;김순영;임한필
    • 응용통계연구
    • /
    • 제17권2호
    • /
    • pp.347-357
    • /
    • 2004
  • 의사결정나무에서 분리 변수를 선택하는 것은 매우 중요한 일이다. C4.5는 변수 선택에 있어 연속형 변수로의 변수 선택 편의가 심각하고, QUEST는 연속형 변수와 관련해서 정규성 가정이 위반될 경우 변수 선택력이 떨어진다. 본 논문에서는 통계적 로버스트 검정 알고리즘을 제안하고, 모의 실험을 통하여 C4.5, QUEST그러고 제안된 알고리즘의 효율성을 비교하였다. 실험 결과 제안된 알고리즘이 변수 선택 편의와 변수 선택력 측면에서 로버스트함을 알 수 있었다.

변수선택 편향이 없는 회귀나무를 만들기 위한 알고리즘 (Regression Trees with. Unbiased Variable Selection)

  • 김진흠;김민호
    • 응용통계연구
    • /
    • 제17권3호
    • /
    • pp.459-473
    • /
    • 2004
  • 본 논문에서는 Breiman 등(1984)의 전체탐색법이 갖고 있는 변수선택 편향을 극복할 수 있는 알고리즘을 제안하였다. 제안한 알고리즘은 노드의 분리 변수를 선택하는 단계와 그 선택된 변수에 대해서만 이진분리를 위한 분리점을 찾는 단계로 나뉘어져 있다. 예측변수가 연속형 일 때는 스피어만의 순위상관계수에 의한 검정을 수행하고, 범주형일 때는 크루스칼-왈리스의 통계량에 의한 검정을 수행하여 통계적으로 가장 유의한 변수를 분리변수로 선택하였고 Breiman 등(1984)의 전체탐색법을 그 변수에만 적용하여 노드의 분리기준을 정하였다 모의실험 연구를 통해 Breiman등(19히)의 CART와 제안한 알고리즘을 변수선택 편의, 변수선택력파 평균제곱오차 측면에서 서로 비교하였다. 아울러 두 알고리즘을 실제 자료에 적용하여 효율을 서로 비교하였다.

A Study on Unbiased Methods in Constructing Classification Trees

  • Lee, Yoon-Mo;Song, Moon Sup
    • Communications for Statistical Applications and Methods
    • /
    • 제9권3호
    • /
    • pp.809-824
    • /
    • 2002
  • we propose two methods which separate the variable selection step and the split-point selection step. We call these two algorithms as CHITES method and F&CHITES method. They adapted some of the best characteristics of CART, CHAID, and QUEST. In the first step the variable, which is most significant to predict the target class values, is selected. In the second step, the exhaustive search method is applied to find the splitting point based on the selected variable in the first step. We compared the proposed methods, CART, and QUEST in terms of variable selection bias and power, error rates, and training times. The proposed methods are not only unbiased in the null case, but also powerful for selecting correct variables in non-null cases.

A Study on Split Variable Selection Using Transformation of Variables in Decision Trees

  • Chung, Sung-S.;Lee, Ki-H.;Lee, Seung-S.
    • Journal of the Korean Data and Information Science Society
    • /
    • 제16권2호
    • /
    • pp.195-205
    • /
    • 2005
  • In decision tree analysis, C4.5 and CART algorithm have some problems of computational complexity and bias on variable selection. But QUEST algorithm solves these problems by dividing the step of variable selection and split point selection. When input variables are continuous, QUEST algorithm uses ANOVA F-test under the assumption of normality and homogeneity of variances. In this paper, we investigate the influence of violation of normality assumption and effect of the transformation of variables in the QUEST algorithm. In the simulation study, we obtained the empirical powers of variable selection and the empirical bias of variable selection after transformation of variables having various type of underlying distributions.

  • PDF

Robust Variable Selection in Classification Tree

  • 장정이;정광모
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2001년도 추계학술발표회 논문집
    • /
    • pp.89-94
    • /
    • 2001
  • In this study we focus on variable selection in decision tree growing structure. Some of the splitting rules and variable selection algorithms are discussed. We propose a competitive variable selection method based on Kruskal-Wallis test, which is a nonparametric version of ANOVA F-test. Through a Monte Carlo study we note that CART has serious bias in variable selection towards categorical variables having many values, and also QUEST using F-test is not so powerful to select informative variables under heavy tailed distributions.

  • PDF

데이터마이닝 패키지에서 변수선택 편의에 관한 연구 (A Study on Variable Selection Bias in Data Mining Software Packages)

  • 송문섭;윤영주
    • 응용통계연구
    • /
    • 제14권2호
    • /
    • pp.475-486
    • /
    • 2001
  • 데이터마이닝 패키지에 구현된 분류나무 알고리즘 가운데 CART, CHAID, QUEST, C4.5에서 변수 선택법을 비교하였다. CART의 전체탐색법이 편의를 갖는다는 사실은 잘알려졌으며, 여기서는 상품화된 패키지들에서 이들 알고리즘의 편의와 선택력을 모의실험 연구를 통하여 비교하였다. 상용 패키지로는 CART, Enterprise Miner, AnswerTree, Clementine을 사용하였다. 본 논문의 제한된 모의실험 연구 결과에 의하면 C4.5와 CART는 모두 변수선택에서 심각한 편의를 갖고 있으며, CHAID와 QUEST는 비교적 안정된 결과를 보여주고 있었다.

  • PDF

교수 및 학습 프로그램 평가연구의 선별편향성 개선을 위한 제언 (Suggestions to Improve Selection-Bias in Teaching or Studying Programs)

  • 박경호
    • 의학교육논단
    • /
    • 제12권1호
    • /
    • pp.3-8
    • /
    • 2010
  • This study is designed to evaluate the effectiveness of teaching or studying programs, and thus to overcome the selectionbias in studies. Selection-bias derived from unobservable characteristics in the course of participants selection of the teaching or studying programs, in the case of cross-section data instrumental variable(IV) method and two stage least square estimation were suggested as an analysis tool. Panel data were analyzed by using both fixed effect in which individual effects are captured by intercept terms and random effect estimation where an unobserved effect can be characterized as being randomly drawn from a given distribution.

Variable Selection with Nonconcave Penalty Function on Reduced-Rank Regression

  • Jung, Sang Yong;Park, Chongsun
    • Communications for Statistical Applications and Methods
    • /
    • 제22권1호
    • /
    • pp.41-54
    • /
    • 2015
  • In this article, we propose nonconcave penalties on a reduced-rank regression model to select variables and estimate coefficients simultaneously. We apply HARD (hard thresholding) and SCAD (smoothly clipped absolute deviation) symmetric penalty functions with singularities at the origin, and bounded by a constant to reduce bias. In our simulation study and real data analysis, the new method is compared with an existing variable selection method using $L_1$ penalty that exhibits competitive performance in prediction and variable selection. Instead of using only one type of penalty function, we use two or three penalty functions simultaneously and take advantages of various types of penalty functions together to select relevant predictors and estimation to improve the overall performance of model fitting.