• 제목/요약/키워드: 범주형 자료분석

Search Result 176, Processing Time 0.022 seconds

Categorical Data Analysis System in the Internet (인터넷상에서의 범주형 자료분석 시스템 개발)

  • Hong, Jong Seon;Kim, Dong Uk;O, Min Gwon
    • The Korean Journal of Applied Statistics
    • /
    • v.12 no.1
    • /
    • pp.81-81
    • /
    • 1999
  • A categorical data analysis system in the World Wide Web is proposed with an easy- to-use environment . This system is composed of four components. First, this system presents several graphical displays for Exploratory Data Analysis for categorical data. Second, it provides some measures of association Including dynamic graphics for mosaic plots of Hartigan and Kleiner (1981) and Friendly (1994). Dynamic graphics for mosaic plots give some useful informations. Third, this system can analyze categorical data with loglinear models. So we can select the best fitted loglinear model interactively.

Comparing Accuracy of Imputation Methods for Categorical Incomplete Data (범주형 자료의 결측치 추정방법 성능 비교)

  • 신형원;손소영
    • The Korean Journal of Applied Statistics
    • /
    • v.15 no.1
    • /
    • pp.33-43
    • /
    • 2002
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include category method, logistic regression, and association rule. In this study, we propose two fusions algorithms based on both neural network and voting scheme that combine the results of individual imputation methods. A Mont-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data pattern are (1) input-output function, (2) data size, (3) noise of input-output function (4) proportion of missing data, and (5) pattern of missing data. Experimental study results indicate the following: when the data size is small and missing data proportion is large, modal category method, association rule, and neural network based fusion have better performances than the other methods. However, when the data size is small and correlation between input and missing output is strong, logistic regression and neural network barred fusion algorithm appear better than the others. When data size is large with low missing data proportion, a large noise, and strong correlation between input and missing output, neural networks based fusion algorithm turns out to be the best choice.

Variable selection for latent class analysis using clustering efficiency (잠재변수 모형에서의 군집효율을 이용한 변수선택)

  • Kim, Seongkyung;Seo, Byungtae
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.6
    • /
    • pp.721-732
    • /
    • 2018
  • Latent class analysis (LCA) is an important tool to explore unseen latent groups in multivariate categorical data. In practice, it is important to select a suitable set of variables because the inclusion of too many variables in the model makes the model complicated and reduces the accuracy of the parameter estimates. Dean and Raftery (Annals of the Institute of Statistical Mathematics, 62, 11-35, 2010) proposed a headlong search algorithm based on Bayesian information criteria values to choose meaningful variables for LCA. In this paper, we propose a new variable selection procedure for LCA by utilizing posterior probabilities obtained from each fitted model. We propose a new statistic to measure the adequacy of LCA and develop a variable selection procedure. The effectiveness of the proposed method is also presented through some numerical studies.

An Empirical Study on the Measurement of Clustering and Trend Analysis among the Asian Container Ports Using the Variable Group Benchmarking and Categorical Variable Models (가변 그룹 벤치마킹 모형과 범주형 변수모형을 이용한 아시아 컨테이너항만의 클러스터링측정 및 추세분석에 관한 실증적 연구)

  • Park, Rokyung
    • Journal of Korea Port Economic Association
    • /
    • v.29 no.1
    • /
    • pp.143-175
    • /
    • 2013
  • The purpose of this paper is to show the clustering trend by using the variable group benchmarking(VGB) and categorical variable(CV) models for 38 Asian ports during 9 years(2001-2009) with 4 inputs(birth length, depth, total area, and number of crane) and 1 output(container TEU). The main empirical results of this paper are as follows. First, clustering results by using VGB show that Shanghai, Qingdao, and Ningbo ports took the core role for clustering. Second, CV analysis focusing on the container throughputs indicated that Singapore, Keelong, Dubai, and Kaosiung ports except Chinese ports are appeared as the center ports of clustering. Third, Aqaba, Dubai, Hongkong, Shanghai, Guangzhou, and Ningbo ports are recommended as the efficient ports for the target of clustering. Fourth, when the ports are classified by the regional location, Dubai, Khor Fakkan, Shanghai, Hongkong, Keelong, Ningbo, and Singapore ports are the core ports for clustering. On the whole, other ports located in Asia should be clustered to Dubai, Khor Fakkan, Shanghai, Hongkong, Ningbo, and Singapore ports. The policy implication of this paper is that Korean port policy planner should introduce the VGB model, and CV model for clustering among the international ports for enhancing the efficiency of inputs and outputs.

Variable Selection for Multi-Purpose Multivariate Data Analysis (다목적 다변량 자료분석을 위한 변수선택)

  • Huh, Myung-Hoe;Lim, Yong-Bin;Lee, Yong-Goo
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.1
    • /
    • pp.141-149
    • /
    • 2008
  • Recently we frequently analyze multivariate data with quite large number of variables. In such data sets, virtually duplicated variables may exist simultaneously even though they are conceptually distinguishable. Duplicate variables may cause problems such as the distortion of principal axes in principal component analysis and factor analysis and the distortion of the distances between observations, i.e. the input for cluster analysis. Also in supervised learning or regression analysis, duplicated explanatory variables often cause the instability of fitted models. Since real data analyses are aimed often at multiple purposes, it is necessary to reduce the number of variables to a parsimonious level. The aim of this paper is to propose a practical algorithm for selection of a subset of variables from a given set of p input variables, by the criterion of minimum trace of partial variances of unselected variables unexplained by selected variables. The usefulness of proposed method is demonstrated in visualizing the relationship between selected and unselected variables, in building a predictive model with very large number of independent variables, and in reducing the number of variables and purging/merging categories in categorical data.

Bayesian Analysis of Korean Alcohol Consumption Data Using a Zero-Inflated Ordered Probit Model (영 과잉 순서적 프로빗 모형을 이용한 한국인의 음주자료에 대한 베이지안 분석)

  • Oh, Man-Suk;Oh, Hyun-Tak;Park, Se-Mi
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.2
    • /
    • pp.363-376
    • /
    • 2012
  • Excessive zeroes are often observed in ordinal categorical response variables. An ordinary ordered Probit model is not appropriate for zero-inflated data especially when there are many different sources of generating 0 observations. In this paper, we apply a two-stage zero-inflated ordered Probit (ZIOP) model which incorporate the zero-flated nature of data, propose a Bayesian analysis of a ZIOP model, and apply the method to alcohol consumption data collected by the National Bureau of Statistics, Korea. In the first stage of a ZIOP model, a Probit model is introduced to divide the non-drinkers into genuine non-drinkers who do not participate in drinking due to personal beliefs or permanent health problems and potential drinkers who did not drink at the time of the survey but have the potential to become drinkers. In the second stage, an ordered probit model is applied to drinkers that consists of zero-consumption potential drinkers and positive consumption drinkers. The analysis results show that about 30% of non-drinkers are genuine non-drinkers and hence the Korean alcohol consumption data has the feature of zero-inflated data. A study on the marginal effect of each explanatory variable shows that certain explanatory variables have effects on the genuine non-drinkers and potential drinkers in opposite directions, which may not be detected by an ordered Probit model.

The Marginal Model for Categorical Data Analysis of $3\times3$ Cross-Trials ($3\times3$ 교차실험을 범주형 자료 분석을 위한 주변확률모형)

  • 안주선
    • The Korean Journal of Applied Statistics
    • /
    • v.14 no.1
    • /
    • pp.25-37
    • /
    • 2001
  • The marginal model is proposed for the analysis of data which have c(2: 3) categories in the 3 x 3 cross-over trials with three periods and three treatments. This model could be used for the counterpart of the Kenward-Jones' joint probability one and should be the generalization of Balagtas et ai's univariate marginal logits one, which analyze the treatment effects in the 3 x 3 cross-over trials with binary response variables[Kenward and Jones(1991), Balagtas et al(1995)]. The model equations for the marginal probability are constructed by the three types of link functions. The methods would be given for making of the link function matrices and model ones, and the estimation of parameters shall be discussed. The proposed model is applied to the analysis of Kenward and Jones' data.

  • PDF

상관분석을 응용한 산업재해사례 요인의 고찰

  • 홍광수;정국삼
    • Proceedings of the Korean Institute of Industrial Safety Conference
    • /
    • 1997.11a
    • /
    • pp.331-336
    • /
    • 1997
  • 본 연구에서 산업재해 사례를 연구 대상으로 재해 발생의 여러 가지 요인들의 관련을 검토하고자 통계적 기법을 이용한 재해요인별 상관분석, 또는 영향의 정도 파악, 재해 요인의 통제에 따른 기타 재해요인에 대한 영향 분석을 시도하는 통계학적 분석 방법을 이용한 재해 발생의 중요요인을 분석하고자 첫째, 산업재해 통계 자료의 내용을 분석하여 재해 관련 변수들을 파악하는데 불안전 행동 및 불안전상태에 의한 재해 형태와 기타 변수들 간의 정성적 상관분석을 통한 상관계수를 고찰, 둘째, 명목척도인 범주형 변수 상호 간의 관련 여부를 파악하기 위해 카이제곱(chi-square)검정을 행하여 입원 일수를 종속 변수로 하는 기타 변수들의 독립성 여부와 변수 상호간 연관이 있다고 판단될 때 각 변수의 연관의 정도 비교, 셋째, 어떤 변수 상호간 일정한 관계를 가질 때 변수의 범주별로 반응변수(종속변수)에 미치는 영향을 회귀식 형태로 파악하고 비교하기 위하여 로짓(logit)모형을 적용하였다. (중략)

  • PDF

Analysis of Large Tables (대규모 분할표 분석)

  • Choi, Hyun-Jip
    • The Korean Journal of Applied Statistics
    • /
    • v.18 no.2
    • /
    • pp.395-410
    • /
    • 2005
  • For the analysis of large tables formed by many categorical variables, we suggest a method to group the variables into several disjoint groups in which the variables are completely associated within the groups. We use a simple function of Kullback-Leibler divergence as a similarity measure to find the groups. Since the groups are complete hierarchical sets, we can identify the association structure of the large tables by the marginal log-linear models. Examples are introduced to illustrate the suggested method.

Bayesian ordinal probit semiparametric regression models: KNHANES 2016 data analysis of the relationship between smoking behavior and coffee intake (베이지안 순서형 프로빗 준모수 회귀 모형 : 국민건강영양조사 2016 자료를 통한 흡연양태와 커피섭취 간의 관계 분석)

  • Lee, Dasom;Lee, Eunji;Jo, Seogil;Choi, Taeryeon
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.1
    • /
    • pp.25-46
    • /
    • 2020
  • This paper presents ordinal probit semiparametric regression models using Bayesian Spectral Analysis Regression (BSAR) method. Ordinal probit regression is a way of modeling ordinal responses - usually more than two categories - by connecting the probability of falling into each category explained by a combination of available covariates using a probit (an inverse function of normal cumulative distribution function) link. The Bayesian probit model facilitates posterior sampling by bringing a latent variable following normal distribution, therefore, the responses are categorized by the cut-off points according to values of latent variables. In this paper, we extend the latent variable approach to a semiparametric model for the Bayesian ordinal probit regression with nonparametric functions using a spectral representation of Gaussian processes based BSAR method. The latent variable is decomposed into a parametric component and a nonparametric component with or without a shape constraint for modeling ordinal responses and predicting outcomes more flexibly. We illustrate the proposed methods with simulation studies in comparison with existing methods and real data analysis applied to a Korean National Health and Nutrition Examination Survey (KNHANES) 2016 for investigating nonparametric relationship between smoking behavior and coffee intake.