• Title/Summary/Keyword: categorical data analysis

Search Result 195, Processing Time 0.028 seconds

Complex Segregation Analysis of Categorical Traits in Farm Animals: Comparison of Linear and Threshold Models

  • Kadarmideen, Haja N.;Ilahi, H.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.18 no.8
    • /
    • pp.1088-1097
    • /
    • 2005
  • Main objectives of this study were to investigate accuracy, bias and power of linear and threshold model segregation analysis methods for detection of major genes in categorical traits in farm animals. Maximum Likelihood Linear Model (MLLM), Bayesian Linear Model (BALM) and Bayesian Threshold Model (BATM) were applied to simulated data on normal, categorical and binary scales as well as to disease data in pigs. Simulated data on the underlying normally distributed liability (NDL) were used to create categorical and binary data. MLLM method was applied to data on all scales (Normal, categorical and binary) and BATM method was developed and applied only to binary data. The MLLM analyses underestimated parameters for binary as well as categorical traits compared to normal traits; with the bias being very severe for binary traits. The accuracy of major gene and polygene parameter estimates was also very low for binary data compared with those for categorical data; the later gave results similar to normal data. When disease incidence (on binary scale) is close to 50%, segregation analysis has more accuracy and lesser bias, compared to diseases with rare incidences. NDL data were always better than categorical data. Under the MLLM method, the test statistics for categorical and binary data were consistently unusually very high (while the opposite is expected due to loss of information in categorical data), indicating high false discovery rates of major genes if linear models are applied to categorical traits. With Bayesian segregation analysis, 95% highest probability density regions of major gene variances were checked if they included the value of zero (boundary parameter); by nature of this difference between likelihood and Bayesian approaches, the Bayesian methods are likely to be more reliable for categorical data. The BATM segregation analysis of binary data also showed a significant advantage over MLLM in terms of higher accuracy. Based on the results, threshold models are recommended when the trait distributions are discontinuous. Further, segregation analysis could be used in an initial scan of the data for evidence of major genes before embarking on molecular genome mapping.

Categorical Data Clustering Analysis Using Association-based Dissimilarity (연관성 기반 비유사성을 활용한 범주형 자료 군집분석)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.2
    • /
    • pp.271-281
    • /
    • 2019
  • Purpose: The purpose of this study is to suggest a more efficient distance measure taking into account the relationship between categorical variables for categorical data cluster analysis. Methods: In this study, the association-based dissimilarity was employed to calculate the distance between two categorical data observations and the distance obtained from the association-based dissimilarity was applied to the PAM cluster algorithms to verify its effectiveness. The strength of association between two different categorical variables can be calculated using a mixture of dissimilarities between the conditional probability distributions of other categorical variables, given these two categorical values. In particular, this method is suitable for datasets whose categorical variables are highly correlated. Results: The simulation results using several real life data showed that the proposed distance which considered relationships among the categorical variables generally yielded better clustering performance than the Hamming distance. In addition, as the number of correlated variables was increasing, the difference in the performance of the two clustering methods based on different distance measures became statistically more significant. Conclusion: This study revealed that the adoption of the relationship between categorical variables using our proposed method positively affected the results of cluster analysis.

Categorical Data Analysis System in the Internet (인터넷상에서의 범주형 자료분석 시스템 개발)

  • Hong, Jong Seon;Kim, Dong Uk;O, Min Gwon
    • The Korean Journal of Applied Statistics
    • /
    • v.12 no.1
    • /
    • pp.81-81
    • /
    • 1999
  • A categorical data analysis system in the World Wide Web is proposed with an easy- to-use environment . This system is composed of four components. First, this system presents several graphical displays for Exploratory Data Analysis for categorical data. Second, it provides some measures of association Including dynamic graphics for mosaic plots of Hartigan and Kleiner (1981) and Friendly (1994). Dynamic graphics for mosaic plots give some useful informations. Third, this system can analyze categorical data with loglinear models. So we can select the best fitted loglinear model interactively.

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim;Kee-Jae Lee;Seung-Joo Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.6
    • /
    • pp.577-587
    • /
    • 2023
  • Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

On the clustering of huge categorical data

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.6
    • /
    • pp.1353-1359
    • /
    • 2010
  • Basic objective in cluster analysis is to discover natural groupings of items. In general, clustering is conducted based on some similarity (or dissimilarity) matrix or the original input data. Various measures of similarities between objects are developed. In this paper, we consider a clustering of huge categorical real data set which shows the aspects of time-location-activity of Korean people. Some useful similarity measure for the data set, are developed and adopted for the categorical variables. Hierarchical and nonhierarchical clustering method are applied for the considered data set which is huge and consists of many categorical variables.

Categorical Data Analysis by Means of Echelon Analysis with Spatial Scan Statistics

  • Moon, Sung-Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.15 no.1
    • /
    • pp.83-94
    • /
    • 2004
  • In this study we analyze categorical data by means of spatial statistics and echelon analysis. To do this, we first determine the hierarchical structure of a given contingency table by using echelon dendrogram then, we detect candidates of hotspots given as the top echelon in the dendrogram. Next, we evaluate spatial scan statistics for the zones of significantly high or low rates based on the likelihood ratio. Finally, we detect hotspots of any size and shape based on spatial scan statistics.

  • PDF

Unequal Size, Two-way Analysis of Variance for Categorical Data

  • Chung, Han-Yong
    • Journal of the Korean Statistical Society
    • /
    • v.5 no.1
    • /
    • pp.29-34
    • /
    • 1976
  • The techniques about the analysis of variance for quantitative variables have been well-developed. But when the variable is categorical, we must switch to a completely different set of varied techniques. R.J. Light and B.H. Margolin presented one kind of techniques for categorical data in their paper, where there are G unordered experimental groups and I unordered response categories.

  • PDF

Multi-dimension Categorical Data with Bayesian Network (베이지안 네트워크를 이용한 다차원 범주형 분석)

  • Kim, Yong-Chul
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.2
    • /
    • pp.169-174
    • /
    • 2018
  • In general, the methods of the analysis of variance(ANOVA) for the continuous data and the chi-square test for the discrete data are used for statistical analysis of the effect and the association. In multidimensional data, analysis of hierarchical structure is required and statistical linear model is adopted. The structure of the linear model requires the normality of the data. A multidimensional categorical data analysis methods are used for causal relations, interactions, and correlation analysis. In this paper, Bayesian network model using probability distribution is proposed to reduce analysis procedure and analyze interactions and causal relationships in categorical data analysis.

Probabilistic Forecasting of Seasonal Inflow to Reservoir (계절별 저수지 유입량의 확률예측)

  • Kang, Jaewon
    • Journal of Environmental Science International
    • /
    • v.22 no.8
    • /
    • pp.965-977
    • /
    • 2013
  • Reliable long-term streamflow forecasting is invaluable for water resource planning and management which allocates water supply according to the demand of water users. It is necessary to get probabilistic forecasts to establish risk-based reservoir operation policies. Probabilistic forecasts may be useful for the users who assess and manage risks according to decision-making responding forecasting results. Probabilistic forecasting of seasonal inflow to Andong dam is performed and assessed using selected predictors from sea surface temperature and 500 hPa geopotential height data. Categorical probability forecast by Piechota's method and logistic regression analysis, and probability forecast by conditional probability density function are used to forecast seasonal inflow. Kernel density function is used in categorical probability forecast by Piechota's method and probability forecast by conditional probability density function. The results of categorical probability forecasts are assessed by Brier skill score. The assessment reveals that the categorical probability forecasts are better than the reference forecasts. The results of forecasts using conditional probability density function are assessed by qualitative approach and transformed categorical probability forecasts. The assessment of the forecasts which are transformed to categorical probability forecasts shows that the results of the forecasts by conditional probability density function are much better than those of the forecasts by Piechota's method and logistic regression analysis except for winter season data.

Integrated Partial Sufficient Dimension Reduction with Heavily Unbalanced Categorical Predictors

  • Yoo, Jae-Keun
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.5
    • /
    • pp.977-985
    • /
    • 2010
  • In this paper, we propose an approach to conduct partial sufficient dimension reduction with heavily unbalanced categorical predictors. For this, we consider integrated categorical predictors and investigate certain conditions that the integrated categorical predictor is fully informative to partial sufficient dimension reduction. For illustration, the proposed approach is implemented on optimal partial sliced inverse regression in simulation and data analysis.