• Title/Summary/Keyword: categorical variable

Search Result 104, Processing Time 0.026 seconds

Effects on Regression Estimates under Misspecified Generalized Linear Mixed Models for Counts Data

  • Jeong, Kwang Mo
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.6
    • /
    • pp.1037-1047
    • /
    • 2012
  • The generalized linear mixed model(GLMM) is widely used in fitting categorical responses of clustered data. In the numerical approximation of likelihood function the normality is assumed for the random effects distribution; subsequently, the commercial statistical packages also routinely fit GLMM under this normality assumption. We may also encounter departures from the distributional assumption on the response variable. It would be interesting to investigate the impact on the estimates of parameters under misspecification of distributions; however, there has been limited researche on these topics. We study the sensitivity or robustness of the maximum likelihood estimators(MLEs) of GLMM for counts data when the true underlying distribution is normal, gamma, exponential, and a mixture of two normal distributions. We also consider the effects on the MLEs when we fit Poisson-normal GLMM whereas the outcomes are generated from the negative binomial distribution with overdispersion. Through a small scale Monte Carlo study we check the empirical coverage probabilities of parameters and biases of MLEs of GLMM.

Reliability Metrics Design and Verification for the Acquisition of Small and Mid-Sized Web Application (중소규모 웹어플리케이션 개발업체 신뢰성평가를 위한 신뢰도 메트릭의 설계 및 유효성 검증)

  • Choi, Kwoung-Hee;Rhew, Sung-Yul
    • Asia pacific journal of information systems
    • /
    • v.16 no.3
    • /
    • pp.193-203
    • /
    • 2006
  • Software reliability prediction is a statistical method to put in place a timely software development practice useful for objective assessment of bidders. The current study suggests one research method that enables reliability assessment of such previous projects by studying user satisfaction and project management history. If incorporated into the existing acquisition process, the reliability assessment method will further enhance objectivity and accuracy in bidder selection process. The GQM(Goal Question Metric) paradigm was used to identify assessment metrics for bidder evaluation and questionnaires were collected from users to create user satisfaction indexes. In addition, 'weight of evidence', the most appropriate categorical method, was used to isolate attributes of each variable that may contribute to reliability assessment.

Extension of the Mantel-Haenszel test to bivariate interval censored data

  • Lee, Dong-Hyun;Kim, Yang-Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.4
    • /
    • pp.403-411
    • /
    • 2022
  • This article presents an independence test between pairs of interval censored failure times. The Mantel-Haenszel test is commonly applied to test the independence between two categorical variables accompanied with a strata variable. Hsu and Prentice (1996) applied a Mantel-Haenszel test to the sequence of 2 × 2 tables formed at the grids which are composed of failure times. In this article, due to unknown failure times, the suitable grid points should be determined and the status of failure and at risk are estimated at those grid points. We also consider a weighted test statistic to bring a more powerful test. Simulation studies are performed to evaluate the power of test statistics under finite samples. The method is applied to analyze two real data sets, mastitis data from milk cows and an age-related eye disease study.

Categorical Variable Selection in Naïve Bayes Classification (단순 베이즈 분류에서의 범주형 변수의 선택)

  • Kim, Min-Sun;Choi, Hosik;Park, Changyi
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.3
    • /
    • pp.407-415
    • /
    • 2015
  • $Na{\ddot{i}}ve$ Bayes Classification is based on input variables that are a conditionally independent given output variable. The $Na{\ddot{i}}ve$ Bayes assumption is unrealistic but simplifies the problem of high dimensional joint probability estimation into a series of univariate probability estimations. Thus $Na{\ddot{i}}ve$ Bayes classier is often adopted in the analysis of massive data sets such as in spam e-mail filtering and recommendation systems. In this paper, we propose a variable selection method based on ${\chi}^2$ statistic on input and output variables. The proposed method retains the simplicity of $Na{\ddot{i}}ve$ Bayes classier in terms of data processing and computation; however, it can select relevant variables. It is expected that our method can be useful in classification problems for ultra-high dimensional or big data such as the classification of diseases based on single nucleotide polymorphisms(SNPs).

Logistic Regressions with Sensory Evaluation Data about Hanwoo Steer Beef (한우 거세우 고기 관능평가 데이터의 로지스틱 회귀분석)

  • Lee, Hye-Jung;Kim, Jae-Hee
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.5
    • /
    • pp.857-870
    • /
    • 2010
  • This study was conducted to investigate the relationship between the socio-demographic factors and the Korean consumers palatability evaluation grades with Hanwoo sensory evaluation data from 2006 to 2008 by National Institute of Animal Science. The dichotomy logistic regression model and the multinomial logistic regression model are fitted with the independent variables such as the consumer living location, age, gender occupation, monthly income, beef cut and the the palatability grade as the categorical dependent variable and tenderness, 리avor and juiciness as the continuous dependent variable. Stepwise variable selection procedure is incorporated to find the final model and odds ratios are calculated to nd the associations between categories.

Variable Selection for Multi-Purpose Multivariate Data Analysis (다목적 다변량 자료분석을 위한 변수선택)

  • Huh, Myung-Hoe;Lim, Yong-Bin;Lee, Yong-Goo
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.1
    • /
    • pp.141-149
    • /
    • 2008
  • Recently we frequently analyze multivariate data with quite large number of variables. In such data sets, virtually duplicated variables may exist simultaneously even though they are conceptually distinguishable. Duplicate variables may cause problems such as the distortion of principal axes in principal component analysis and factor analysis and the distortion of the distances between observations, i.e. the input for cluster analysis. Also in supervised learning or regression analysis, duplicated explanatory variables often cause the instability of fitted models. Since real data analyses are aimed often at multiple purposes, it is necessary to reduce the number of variables to a parsimonious level. The aim of this paper is to propose a practical algorithm for selection of a subset of variables from a given set of p input variables, by the criterion of minimum trace of partial variances of unselected variables unexplained by selected variables. The usefulness of proposed method is demonstrated in visualizing the relationship between selected and unselected variables, in building a predictive model with very large number of independent variables, and in reducing the number of variables and purging/merging categories in categorical data.

Error cause analysis of Pearson test statistics for k-population homogeneity test (k-모집단 동질성검정에서 피어슨검정의 오차성분 분석에 관한 연구)

  • Heo, Sunyeong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.4
    • /
    • pp.815-824
    • /
    • 2013
  • Traditional Pearson chi-squared test is not appropriate for the data collected by the complex sample design. When one uses the traditional Pearson chi-squared test to the complex sample categorical data, it may give wrong test results, and the error may occur not only due to the biased variance estimators but also due to the biased point estimators of cell proportions. In this study, the design based consistent Wald test statistics was derived for k-population homogeneity test, and the traditional Pearson chi-squared test statistics was partitioned into three parts according to the causes of error; the error due to the bias of variance estimator, the error due to the bias of cell proportion estimator, and the unseparated error due to the both bias of variance estimator and bias of cell proportion estimator. An analysis was conducted for empirical results of the relative size of each error component to the Pearson chi-squared test statistics. The second year data from the fourth Korean national health and nutrition examination survey (KNHANES, IV-2) was used for the analysis. The empirical results show that the relative size of error from the bias of variance estimator was relatively larger than the size of error from the bias of cell proportion estimator, but its degrees were different variable by variable.

Categorical data analysis of sensory evaluation data with Hanwoo bull beef (한우 수소 고기 관능평가 데이터에 대한 범주형 자료 분석)

  • Lee, Hye-Jung;Cho, Soo-Hyun;Kim, Jae-Hee
    • Journal of the Korean Data and Information Science Society
    • /
    • v.20 no.5
    • /
    • pp.819-827
    • /
    • 2009
  • This study was conducted to investigate the relationship between the sociodemographic factors and the Korean consumers palatability evaluation grades with Hanwoo sensory evaluation data. The dichotomy logistic regression model and the multinomial logistic regression model are fitted with the independent variables such as the consumer living location, age, gender, occupation, monthly income, and beef cut and the the palatability grade as the dependent variable. Stepwise variable selection procedure is incorporated to find the final model and odds ratios are calculated to find the associations between categories.

  • PDF

Landslide Risk Assessment in Inje Using Logistic Regression Model (로지스틱 회귀분석을 이용한 인제군 산사태지역의 위험도 평가)

  • Lee, Hwan-Gil;Kim, Gi-Hong
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.30 no.3
    • /
    • pp.313-321
    • /
    • 2012
  • Korea has been continuously affected by landslides, as 70% of the land is covered by mountains and most of annual rainfall concentrates between June and September. Recently, abrupt climate change affects the increase of landslide occurrence. Gangwon region is especially suffered by landslide damages, because the most of the part is mountainous, steep, and having shallow soil. In this study, a landslide risk assessment model was developed by applying logistic regression to the various data of Duksan-ri, Inje-eup, Inje-gun, Gangwon-do, which has suffered massive landslide triggered by heavy rain in July 2006. The information collected from field investigation and aerial photos right after the landslide of study area were stored in GIS DB for analysis. Slope gradient entered in two ways-as categorical variable and as linear variable. Error matrix for each case was made, and developed model showed the classification accuracy of 81.4% and 81.9%, respectively.

k-Nearest Neighbor-Based Approach for the Estimation of Mutual Information (상호정보 추정을 위한 k-최근접이웃 기반방법)

  • Cha, Woon-Ock;Huh, Moon-Yul
    • Communications for Statistical Applications and Methods
    • /
    • v.15 no.6
    • /
    • pp.977-991
    • /
    • 2008
  • This study is about the k-nearest neighbor-based approach for the estimation of mutual information when the type of target variable is categorical and continuous. The results of Monte-Carlo simulation and experiments with real-world data show that k=1 is preferable. In practical application with real world data, our study shows that jittering and bootstrapping is needed.