• Title/Summary/Keyword: Categorical data

Search Result 368, Processing Time 0.03 seconds

Comparing Accuracy of Imputation Methods for Incomplete Categorical Data

  • Shin, Hyung-Won;Sohn, So-Young
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.05a
    • /
    • pp.237-242
    • /
    • 2003
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include modal category method, logistic regression, and association rule. In this study, we propose two imputation methods (neural network fusion and voting fusion) that combine the results of individual imputation methods. A Monte-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data are (1) true model for the data, (2) data size, (3) noise size (4) percentage of missing data, and (5) missing pattern. Overall, neural network fusion performed the best while voting fusion is better than the individual imputation methods, although it was inferior to the neural network fusion. Result of an additional real data analysis confirms the simulation result.

  • PDF

Contour Plot to Explore the Structure of Categorical Data

  • Kim, Hyun Chul;Huh, Moon Yul;Chung, Hee Suk
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.2
    • /
    • pp.371-385
    • /
    • 2003
  • In this paper, contour plot is considered as a method to explore the structure of categorical data. For this purpose, the paper suggests a method to sort two-way contingency table with respect to the expected marginals. It is found that the suggested plot provides us with valuable information for the underlying data structure. Firstly, we can investigate independency between the categories by examining the differences of expected frequency contours and observed frequency contours. With the plot, we can also visually investigate the existence of outliers inherent in the data. These properties of the suggested contour plot will be demonstrated by several sets of real data.

Bayesian pooling for contingency tables from small areas

  • Jo, Aejung;Kim, Dal Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.6
    • /
    • pp.1621-1629
    • /
    • 2016
  • This paper studies Bayesian pooling for analysis of categorical data from small areas. Many surveys consist of categorical data collected on a contingency table in each area. Statistical inference for small areas requires considerable care because the subpopulation sample sizes are usually very small. Typically we use the hierarchical Bayesian model for pooling subpopulation data. However, the customary hierarchical Bayesian models may specify more exchangeability than warranted. We, therefore, investigate the effects of pooling in hierarchical Bayesian modeling for the contingency table from small areas. In specific, this paper focuses on the methods of direct or indirect pooling of categorical data collected on a contingency table in each area through Dirichlet priors. We compare the pooling effects of hierarchical Bayesian models by fitting the simulated data. The analysis is carried out using Markov chain Monte Carlo methods.

A Fuzzy Clustering Algorithm for Clustering Categorical Data (범주형 데이터의 분류를 위한 퍼지 군집화 기법)

  • Kim, Dae-Won;Lee, Kwang-H.
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.13 no.6
    • /
    • pp.661-666
    • /
    • 2003
  • In this paper, the conventional k-modes and fuzzy k-modes algorithms for clustering categorical data is extended by representing the clusters of categorical data with fuzzy centroids instead of the hard-type centroids used in the original algorithm. The hard-type centroids of the traditional algorithms had difficulties in dealing with ambiguous boundary data, which might be misclassified and lead to thelocal optima. Use of fuzzy centroids makes it possible to fully exploit the power of fuzzy sets in representing the uncertainty in the classification of categorical data. The distance measure between data and fuzzy centroids is more precise and effective than those of the k-modes and fuzzy k-modes. To test the proposed approach, the proposed algorithm and two conventional algorithms were used to cluster three categorical data sets. The proposed method was found to give markedly better clustering results.

The imitation patterns of adults and children on f0 intervals in North Kyungsang Korean

  • Kim, Jungsun
    • Phonetics and Speech Sciences
    • /
    • v.11 no.2
    • /
    • pp.23-31
    • /
    • 2019
  • The present study examines whether pitch range variation in North Kyunsang Korean shows a categorical or continuous function. Specifically, the study is focused on the data imitated by adults and children in the North Kyungsang region. To investigate pitch range variation, the log-produced f0 intervals were measured and statistically analyzed. The results of the study are as follows. First, both the adults' and children's imitations were more categorical than continuous, especially for the HL-LH patterns. For the other pitch accent patterns, such as HH-HL and HH-LH, the curves were continuous or flat for most of the speakers. Second, the children's imitations were poorer than those of the adults. That is, the children's imitative responses were shown as more continuous or flat curves than categorical. For the children, the HL-LH pattern showed a categorical function at the midpoint of the curves, though the shifts were not as distinctive as the adults' data. This implies that the imitative responses of children follow the perceptual and productive trace of adults' speech behavior.

Latent class model for mixed variables with applications to text data (혼합모드 잠재범주모형을 통한 텍스트 자료의 분석)

  • Shin, Hyun Soo;Seo, Byungtae
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.6
    • /
    • pp.837-849
    • /
    • 2019
  • Latent class models (LCM) are useful tools to draw hidden information from categorical data. This model can also be interpreted as a mixture model with multinomial component distributions. In some cases, however, an available dataset may contain both categorical and count or continuous data. For such cases, we can extend the LCM to a mixture model with both multinomial and other component distributions such as normal and Poisson distributions. In this paper, we consider a LCM for the data containing categorical and count data to analyze the Drug Review dataset which contains categorical responses and text review. From this data analysis, we show that we can obtain more specific hidden inforamtion than those from the LCM only with categorical responses.

Bayesian approach for categorical Table with Nonignorable Nonresponse

  • Choi, Bo-Seung;Park, You-Sung
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2005.11a
    • /
    • pp.59-65
    • /
    • 2005
  • We propose five Bayesian methods to estimate the cell expectation in an incomplete multi-way categorical table with nonignorable nonresponse mechanism. We study 3 Bayesian methods which were previously applied to one-way categorical tables. We extend them to multi-way tables and, in addition, develop 2 new Bayesian methods for multi-way categorical tables. These five methods are distinguished by different priors on the cell probabilities: two of them have the priors determined only by information of respondents; one has a constant prior; and the remaining two have priors reflecting the difference in the response mechanisms between respondent and non-respondent. We also compare the five Bayesian methods using a categorical data for a prospective study of pregnant women.

  • PDF

Predictive Spatial Data Fusion Using Fuzzy Object Representation and Integration: Application to Landslide Hazard Assessment

  • Park, No-Wook;Chi, Kwang-Hoon;Chung, Chang-Jo;Kwon, Byung-Doo
    • Korean Journal of Remote Sensing
    • /
    • v.19 no.3
    • /
    • pp.233-246
    • /
    • 2003
  • This paper presents a methodology to account for the partial or gradual changes of environmental phenomena in categorical map information for the fusion/integration of multiple spatial data. The fuzzy set based spatial data fusion scheme is applied in order to account for the fuzziness of boundaries in categorical information showing the partial or gradual environmental impacts. The fuzziness or uncertainty of boundary is represented as two kinds of fuzzy membership functions based on fuzzy object concept and the effects of them are quantitatively evaluated with the help of a cross validation procedure. A case study for landslide hazard assessment demonstrates the better performance of this scheme as compared to traditional crisp boundary representation.

A Study on Comparison with the Methods of Ordered Categorical Data of Analysis (순서 범주형 자료해석법의 비교 연구)

  • 김홍준;송서일
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.20 no.44
    • /
    • pp.207-215
    • /
    • 1997
  • This paper deals with a comparison between Taguchi's accumulation analysis method and Nair test on the ordered categorical data from an industrial experiment for quality improvement. a result of Taguchi's accumulation analysis method is shown to have reasonable power for detecting location effects, while Nair test identifies the location and dispersion effects separately, Accordingly, Taguchi's accumulation analysis needs to develop methods for detecting dispersion effects as well as location effects. In addition this paper rewmmends models for analyzing ordered categorical data, for examples, the cumulative legit model, mean response model etc Successively simple, reasonable methods should be introduced more likely to be used by the practitioners.

  • PDF

Parallel k-Modes Algorithm for Spark Framework (스파크 프레임워크를 위한 병렬적 k-Modes 알고리즘)

  • Chung, Jaehwa
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.10
    • /
    • pp.487-492
    • /
    • 2017
  • Clustering is a technique which is used to measure similarities between data in big data analysis and data mining field. Among various clustering methods, k-Modes algorithm is representatively used for categorical data. To increase the performance of iterative-centric tasks such as k-Modes, a distributed and concurrent framework Spark has been received great attention recently because it overcomes the limitation of Hadoop. Spark provides an environment that can process large amount of data in main memory using the concept of abstract objects called RDD. Spark provides Mllib, a dedicated library for machine learning, but Mllib only includes k-means that can process only continuous data, so there is a limitation that categorical data processing is impossible. In this paper, we design RDD for k-Modes algorithm for categorical data clustering in spark environment and implement an algorithm that can operate effectively. Experiments show that the proposed algorithm increases linearly in the spark environment.