• 제목/요약/키워드: categorical data analysis

검색결과 195건 처리시간 0.026초

Complex Segregation Analysis of Categorical Traits in Farm Animals: Comparison of Linear and Threshold Models

  • Kadarmideen, Haja N.;Ilahi, H.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • 제18권8호
    • /
    • pp.1088-1097
    • /
    • 2005
  • Main objectives of this study were to investigate accuracy, bias and power of linear and threshold model segregation analysis methods for detection of major genes in categorical traits in farm animals. Maximum Likelihood Linear Model (MLLM), Bayesian Linear Model (BALM) and Bayesian Threshold Model (BATM) were applied to simulated data on normal, categorical and binary scales as well as to disease data in pigs. Simulated data on the underlying normally distributed liability (NDL) were used to create categorical and binary data. MLLM method was applied to data on all scales (Normal, categorical and binary) and BATM method was developed and applied only to binary data. The MLLM analyses underestimated parameters for binary as well as categorical traits compared to normal traits; with the bias being very severe for binary traits. The accuracy of major gene and polygene parameter estimates was also very low for binary data compared with those for categorical data; the later gave results similar to normal data. When disease incidence (on binary scale) is close to 50%, segregation analysis has more accuracy and lesser bias, compared to diseases with rare incidences. NDL data were always better than categorical data. Under the MLLM method, the test statistics for categorical and binary data were consistently unusually very high (while the opposite is expected due to loss of information in categorical data), indicating high false discovery rates of major genes if linear models are applied to categorical traits. With Bayesian segregation analysis, 95% highest probability density regions of major gene variances were checked if they included the value of zero (boundary parameter); by nature of this difference between likelihood and Bayesian approaches, the Bayesian methods are likely to be more reliable for categorical data. The BATM segregation analysis of binary data also showed a significant advantage over MLLM in terms of higher accuracy. Based on the results, threshold models are recommended when the trait distributions are discontinuous. Further, segregation analysis could be used in an initial scan of the data for evidence of major genes before embarking on molecular genome mapping.

연관성 기반 비유사성을 활용한 범주형 자료 군집분석 (Categorical Data Clustering Analysis Using Association-based Dissimilarity)

  • 이창기;정욱
    • 품질경영학회지
    • /
    • 제47권2호
    • /
    • pp.271-281
    • /
    • 2019
  • Purpose: The purpose of this study is to suggest a more efficient distance measure taking into account the relationship between categorical variables for categorical data cluster analysis. Methods: In this study, the association-based dissimilarity was employed to calculate the distance between two categorical data observations and the distance obtained from the association-based dissimilarity was applied to the PAM cluster algorithms to verify its effectiveness. The strength of association between two different categorical variables can be calculated using a mixture of dissimilarities between the conditional probability distributions of other categorical variables, given these two categorical values. In particular, this method is suitable for datasets whose categorical variables are highly correlated. Results: The simulation results using several real life data showed that the proposed distance which considered relationships among the categorical variables generally yielded better clustering performance than the Hamming distance. In addition, as the number of correlated variables was increasing, the difference in the performance of the two clustering methods based on different distance measures became statistically more significant. Conclusion: This study revealed that the adoption of the relationship between categorical variables using our proposed method positively affected the results of cluster analysis.

인터넷상에서의 범주형 자료분석 시스템 개발 (Categorical Data Analysis System in the Internet)

  • 홍종선;김동욱;오민권
    • 응용통계연구
    • /
    • 제12권1호
    • /
    • pp.81-81
    • /
    • 1999
  • A categorical data analysis system in the World Wide Web is proposed with an easy- to-use environment . This system is composed of four components. First, this system presents several graphical displays for Exploratory Data Analysis for categorical data. Second, it provides some measures of association Including dynamic graphics for mosaic plots of Hartigan and Kleiner (1981) and Friendly (1994). Dynamic graphics for mosaic plots give some useful informations. Third, this system can analyze categorical data with loglinear models. So we can select the best fitted loglinear model interactively.

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim;Kee-Jae Lee;Seung-Joo Lee
    • Communications for Statistical Applications and Methods
    • /
    • 제30권6호
    • /
    • pp.577-587
    • /
    • 2023
  • Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

On the clustering of huge categorical data

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • 제21권6호
    • /
    • pp.1353-1359
    • /
    • 2010
  • Basic objective in cluster analysis is to discover natural groupings of items. In general, clustering is conducted based on some similarity (or dissimilarity) matrix or the original input data. Various measures of similarities between objects are developed. In this paper, we consider a clustering of huge categorical real data set which shows the aspects of time-location-activity of Korean people. Some useful similarity measure for the data set, are developed and adopted for the categorical variables. Hierarchical and nonhierarchical clustering method are applied for the considered data set which is huge and consists of many categorical variables.

Categorical Data Analysis by Means of Echelon Analysis with Spatial Scan Statistics

  • Moon, Sung-Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • 제15권1호
    • /
    • pp.83-94
    • /
    • 2004
  • In this study we analyze categorical data by means of spatial statistics and echelon analysis. To do this, we first determine the hierarchical structure of a given contingency table by using echelon dendrogram then, we detect candidates of hotspots given as the top echelon in the dendrogram. Next, we evaluate spatial scan statistics for the zones of significantly high or low rates based on the likelihood ratio. Finally, we detect hotspots of any size and shape based on spatial scan statistics.

  • PDF

Unequal Size, Two-way Analysis of Variance for Categorical Data

  • Chung, Han-Yong
    • Journal of the Korean Statistical Society
    • /
    • 제5권1호
    • /
    • pp.29-34
    • /
    • 1976
  • The techniques about the analysis of variance for quantitative variables have been well-developed. But when the variable is categorical, we must switch to a completely different set of varied techniques. R.J. Light and B.H. Margolin presented one kind of techniques for categorical data in their paper, where there are G unordered experimental groups and I unordered response categories.

  • PDF

베이지안 네트워크를 이용한 다차원 범주형 분석 (Multi-dimension Categorical Data with Bayesian Network)

  • 김용철
    • 한국정보전자통신기술학회논문지
    • /
    • 제11권2호
    • /
    • pp.169-174
    • /
    • 2018
  • 일반적으로 자료의 효과 연속형인 경우 분산분석과 이산형인 경우 분할표 카이제곱 검정을 통계적 분석방법으로 사용한다. 다차원의 자료에서는 계층적 구조의 분석이 요구되어지며 자료간의 인과관계를 나타내기 위해 통계적 선형모형을 채택하여 분석한다. 선형모형의 구조에서는 자료의 정규성이 요구되어지며 일부 자료에서는 비 선형모형을 채택할 수도 있다. 특히, 설문조사 자료 구조는 문항의 특성상 이산형 자료의 형태가 많아 모형의 조건에 만족하지 않는 경우가 종종 발생한다. 자료구조의 차원이 높아질수록 인과관계, 교호작용, 연관성분석 등에 다차원 범주형 자료 분석 방법을 사용한다. 본 논문에서는 확률분포의 계산을 이용한 베이지안 네트워크 모형이 범주형 자료 분석에서 분석절차를 줄이고 교호작용 및 인과관계를 분석할 수 있다는 것을 제시하였다.

계절별 저수지 유입량의 확률예측 (Probabilistic Forecasting of Seasonal Inflow to Reservoir)

  • 강재원
    • 한국환경과학회지
    • /
    • 제22권8호
    • /
    • pp.965-977
    • /
    • 2013
  • Reliable long-term streamflow forecasting is invaluable for water resource planning and management which allocates water supply according to the demand of water users. It is necessary to get probabilistic forecasts to establish risk-based reservoir operation policies. Probabilistic forecasts may be useful for the users who assess and manage risks according to decision-making responding forecasting results. Probabilistic forecasting of seasonal inflow to Andong dam is performed and assessed using selected predictors from sea surface temperature and 500 hPa geopotential height data. Categorical probability forecast by Piechota's method and logistic regression analysis, and probability forecast by conditional probability density function are used to forecast seasonal inflow. Kernel density function is used in categorical probability forecast by Piechota's method and probability forecast by conditional probability density function. The results of categorical probability forecasts are assessed by Brier skill score. The assessment reveals that the categorical probability forecasts are better than the reference forecasts. The results of forecasts using conditional probability density function are assessed by qualitative approach and transformed categorical probability forecasts. The assessment of the forecasts which are transformed to categorical probability forecasts shows that the results of the forecasts by conditional probability density function are much better than those of the forecasts by Piechota's method and logistic regression analysis except for winter season data.

Integrated Partial Sufficient Dimension Reduction with Heavily Unbalanced Categorical Predictors

  • Yoo, Jae-Keun
    • 응용통계연구
    • /
    • 제23권5호
    • /
    • pp.977-985
    • /
    • 2010
  • In this paper, we propose an approach to conduct partial sufficient dimension reduction with heavily unbalanced categorical predictors. For this, we consider integrated categorical predictors and investigate certain conditions that the integrated categorical predictor is fully informative to partial sufficient dimension reduction. For illustration, the proposed approach is implemented on optimal partial sliced inverse regression in simulation and data analysis.