• Title/Summary/Keyword: categorical data analysis

Search Result 195, Processing Time 0.026 seconds

Bayesian approach for categorical Table with Nonignorable Nonresponse

  • Choi, Bo-Seung;Park, You-Sung
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2005.11a
    • /
    • pp.59-65
    • /
    • 2005
  • We propose five Bayesian methods to estimate the cell expectation in an incomplete multi-way categorical table with nonignorable nonresponse mechanism. We study 3 Bayesian methods which were previously applied to one-way categorical tables. We extend them to multi-way tables and, in addition, develop 2 new Bayesian methods for multi-way categorical tables. These five methods are distinguished by different priors on the cell probabilities: two of them have the priors determined only by information of respondents; one has a constant prior; and the remaining two have priors reflecting the difference in the response mechanisms between respondent and non-respondent. We also compare the five Bayesian methods using a categorical data for a prospective study of pregnant women.

  • PDF

A Bayesian uncertainty analysis for nonignorable nonresponse in two-way contingency table

  • Woo, Namkyo;Kim, Dal Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.6
    • /
    • pp.1547-1555
    • /
    • 2015
  • We study the problem of nonignorable nonresponse in a two-way contingency table and there may be one or two missing categories. We describe a nonignorable nonresponse model for the analysis of two-way categorical table. One approach to analyze these data is to construct several tables (one complete and the others incomplete). There are nonidentifiable parameters in incomplete tables. We describe a hierarchical Bayesian model to analyze two-way categorical data. We use a nonignorable nonresponse model with Bayesian uncertainty analysis by placing priors in nonidentifiable parameters instead of a sensitivity analysis for nonidentifiable parameters. To reduce the effects of nonidentifiable parameters, we project the parameters to a lower dimensional space and we allow the reduced set of parameters to share a common distribution. We use the griddy Gibbs sampler to fit our models and compute DIC and BPP for model diagnostics. We illustrate our method using data from NHANES III data to obtain the finite population proportions.

A Bayesian model for two-way contingency tables with nonignorable nonresponse from small areas

  • Woo, Namkyo;Kim, Dal Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.1
    • /
    • pp.245-254
    • /
    • 2016
  • Many surveys provide categorical data and there may be one or more missing categories. We describe a nonignorable nonresponse model for the analysis of two-way contingency tables from small areas. There are both item and unit nonresponse. One approach to analyze these data is to construct several tables corresponding to missing categories. We describe a hierarchical Bayesian model to analyze two-way categorical data from different areas. This allows a "borrowing of strength" of the data from larger areas to improve the reliability in the estimates of the model parameters corresponding to the small areas. Also we use a nonignorable nonresponse model with Bayesian uncertainty analysis by placing priors in nonidentifiable parameters instead of a sensitivity analysis for nonidentifiable parameters. We use the griddy Gibbs sampler to fit our models and compute DIC and BPP for model diagnostics. We illustrate our method using data from NHANES III data on thirteen states to obtain the finite population proportions.

Evaluation Method of Quality of Service in Telecommunications Using Logit Model (로짓모형을 이용한 통신 서비스품질 평가방법)

  • Cho, Jae-Gyeun;Ahn, Hae-Sook
    • IE interfaces
    • /
    • v.15 no.2
    • /
    • pp.209-217
    • /
    • 2002
  • Quality of Service(QoS) in the telecommunications can be evaluated by analyzing the opinion data which result from the surveyed opinions of respondents and quantify subjective satisfaction on the QoS from the customers' viewpoints. For analyzing the opinion data, MOS(mean opinion score) method and Cumulative Probability Curve method are often used. The methods are based on the scoring method, and therefore, have the intrinsic deficiency due to the assignment of arbitrary scores. In this paper, we propose an analysis method of the opinion data using logit models which can be used to analyze the ordinal categorical data without assigning arbitrary scores to customers' opinion, and develop an analysis procedure considering the usage of procedures provided by SAS(Statistical Analysis System) statistical package. By the proposed method, we can estimate the relationship between customer satisfaction and network performance parameters, and provide guidelines for network planning. In addition, the proposed method is compared with Cumulative Probability Curve method with respect to prediction errors.

Nonlinear Canonical Correlation Analysis for Paralysis Disease Data

  • Shin, Yang-Kyu
    • Journal of the Korean Data and Information Science Society
    • /
    • v.15 no.3
    • /
    • pp.515-521
    • /
    • 2004
  • Categorical data are mostly found in oriental medical research. The nonlinear canonical correlation analysis does not assume an interval level of measurement. In this paper, we apply nonlinear canonical correlation analysis to quantification and explain how similar sets of variables are to one another for paralysis disease data.

  • PDF

Public Sentiment Analysis of Korean Top-10 Companies: Big Data Approach Using Multi-categorical Sentiment Lexicon (국내 주요 10대 기업에 대한 국민 감성 분석: 다범주 감성사전을 활용한 빅 데이터 접근법)

  • Kim, Seo In;Kim, Dong Sung;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.45-69
    • /
    • 2016
  • Recently, sentiment analysis using open Internet data is actively performed for various purposes. As online Internet communication channels become popular, companies try to capture public sentiment of them from online open information sources. This research is conducted for the purpose of analyzing pulbic sentiment of Korean Top-10 companies using a multi-categorical sentiment lexicon. Whereas existing researches related to public sentiment measurement based on big data approach classify sentiment into dimensions, this research classifies public sentiment into multiple categories. Dimensional sentiment structure has been commonly applied in sentiment analysis of various applications, because it is academically proven, and has a clear advantage of capturing degree of sentiment and interrelation of each dimension. However, the dimensional structure is not effective when measuring public sentiment because human sentiment is too complex to be divided into few dimensions. In addition, special training is needed for ordinary people to express their feeling into dimensional structure. People do not divide their sentiment into dimensions, nor do they need psychological training when they feel. People would not express their feeling in the way of dimensional structure like positive/negative or active/passive; rather they express theirs in the way of categorical sentiment like sadness, rage, happiness and so on. That is, categorial approach of sentiment analysis is more natural than dimensional approach. Accordingly, this research suggests multi-categorical sentiment structure as an alternative way to measure social sentiment from the point of the public. Multi-categorical sentiment structure classifies sentiments following the way that ordinary people do although there are possibility to contain some subjectiveness. In this research, nine categories: 'Sadness', 'Anger', 'Happiness', 'Disgust', 'Surprise', 'Fear', 'Interest', 'Boredom' and 'Pain' are used as multi-categorical sentiment structure. To capture public sentiment of Korean Top-10 companies, Internet news data of the companies are collected over the past 25 months from a representative Korean portal site. Based on the sentiment words extracted from previous researches, we have created a sentiment lexicon, and analyzed the frequency of the words coming up within the news data. The frequency of each sentiment category was calculated as a ratio out of the total sentiment words to make ranks of distributions. Sentiment comparison among top-4 companies, which are 'Samsung', 'Hyundai', 'SK', and 'LG', were separately visualized. As a next step, the research tested hypothesis to prove the usefulness of the multi-categorical sentiment lexicon. It tested how effective categorial sentiment can be used as relative comparison index in cross sectional and time series analysis. To test the effectiveness of the sentiment lexicon as cross sectional comparison index, pair-wise t-test and Duncan test were conducted. Two pairs of companies, 'Samsung' and 'Hanjin', 'SK' and 'Hanjin' were chosen to compare whether each categorical sentiment is significantly different in pair-wise t-test. Since category 'Sadness' has the largest vocabularies, it is chosen to figure out whether the subgroups of the companies are significantly different in Duncan test. It is proved that five sentiment categories of Samsung and Hanjin and four sentiment categories of SK and Hanjin are different significantly. In category 'Sadness', it has been figured out that there were six subgroups that are significantly different. To test the effectiveness of the sentiment lexicon as time series comparison index, 'nut rage' incident of Hanjin is selected as an example case. Term frequency of sentiment words of the month when the incident happened and term frequency of the one month before the event are compared. Sentiment categories was redivided into positive/negative sentiment, and it is tried to figure out whether the event actually has some negative impact on public sentiment of the company. The difference in each category was visualized, moreover the variation of word list of sentiment 'Rage' was shown to be more concrete. As a result, there was huge before-and-after difference of sentiment that ordinary people feel to the company. Both hypotheses have turned out to be statistically significant, and therefore sentiment analysis in business area using multi-categorical sentiment lexicons has persuasive power. This research implies that categorical sentiment analysis can be used as an alternative method to supplement dimensional sentiment analysis when figuring out public sentiment in business environment.

Understanding of the Misuse Cases of Quantitative and Qualitative Regression Analysis (정량적, 정성적 회귀분석의 오적용과 이해)

  • Choe, Seong-Un
    • Proceedings of the Safety Management and Science Conference
    • /
    • 2011.11a
    • /
    • pp.213-217
    • /
    • 2011
  • The research shows misuse cases of quantitative regression analysis used in QC circle activity and six sigma movement which presents guidelines of correct use for quality practitioners. Additionally, the qualitative regression analysis that responses nonconforming ratio of variable y, is reviewed based on misuse cases for proper use by practitioners in the field. In most cases, there are frequent errors that involve the correlation analysis or ANOVA, regardless of using quantitative regression analysis. In addition, qualitative regression analysis for the nonconforming ratio that has dependent variable of discrete and categorical data, is often applied with quantitative regression and result in ineffective quality improvement.

  • PDF

A Method for Reduction of Categorical Variables Based on a Concept of Pseudo-Correlation Coefficient (유사상관계수의 개념을 도입한 범주형 변수의 축약에 관한 연구)

  • Kwon, Cheol-Shin;Hong, Soon-Wook
    • IE interfaces
    • /
    • v.14 no.1
    • /
    • pp.79-83
    • /
    • 2001
  • In this paper, we propose a simple method to reduce categorical variables into smaller, but significant numbers, and also demonstrate how the proposed method can be applied to the problem of reduction that empirical research often faces in the course of data processing. For the purpose, we introduce a concept of pseudo-correlation coefficient to make it possible to use factor analysis (FA) as a tool for reducing variables. The main idea of the concept is to deal with the measures of association of categorical variables in the sense of the concept of Pearson's correlation coefficient in order to meet the input requirement of FA. Upon examination of existing measures that could play as pseudo-correlation coefficients, Cramer's V coefficient is selected for the best result among them. To show the detailed procedure of the proposed method, a specific demonstration with the data from 329 R&D projects conducted in 18 private laboratories in electric and electronics industry is presented.

  • PDF

Parallel k-Modes Algorithm for Spark Framework (스파크 프레임워크를 위한 병렬적 k-Modes 알고리즘)

  • Chung, Jaehwa
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.10
    • /
    • pp.487-492
    • /
    • 2017
  • Clustering is a technique which is used to measure similarities between data in big data analysis and data mining field. Among various clustering methods, k-Modes algorithm is representatively used for categorical data. To increase the performance of iterative-centric tasks such as k-Modes, a distributed and concurrent framework Spark has been received great attention recently because it overcomes the limitation of Hadoop. Spark provides an environment that can process large amount of data in main memory using the concept of abstract objects called RDD. Spark provides Mllib, a dedicated library for machine learning, but Mllib only includes k-means that can process only continuous data, so there is a limitation that categorical data processing is impossible. In this paper, we design RDD for k-Modes algorithm for categorical data clustering in spark environment and implement an algorithm that can operate effectively. Experiments show that the proposed algorithm increases linearly in the spark environment.

A multivariate latent class profile analysis for longitudinal data with a latent group variable

  • Lee, Jung Wun;Chung, Hwan
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.1
    • /
    • pp.15-35
    • /
    • 2020
  • In research on behavioral studies, significant attention has been paid to the stage-sequential process for multiple latent class variables. We now explore the stage-sequential process of multiple latent class variables using the multivariate latent class profile analysis (MLCPA). A latent profile variable, representing the stage-sequential process in MLCPA, is formed by a set of repeatedly measured categorical response variables. This paper proposes the extended MLCPA in order to explain an association between the latent profile variable and the latent group variable as a form of a two-dimensional contingency table. We applied the extended MLCPA to the National Longitudinal Survey on Youth 1997 (NLSY97) data to investigate the association between of developmental progression of depression and substance use behaviors among adolescents who experienced Authoritarian parental styles in their youth.