• 제목/요약/키워드: categorical data analysis

검색결과 195건 처리시간 0.028초

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

  • Lee, Keon Myung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • 제14권2호
    • /
    • pp.98-104
    • /
    • 2014
  • Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.

Bayesian pooling for contingency tables from small areas

  • Jo, Aejung;Kim, Dal Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • 제27권6호
    • /
    • pp.1621-1629
    • /
    • 2016
  • This paper studies Bayesian pooling for analysis of categorical data from small areas. Many surveys consist of categorical data collected on a contingency table in each area. Statistical inference for small areas requires considerable care because the subpopulation sample sizes are usually very small. Typically we use the hierarchical Bayesian model for pooling subpopulation data. However, the customary hierarchical Bayesian models may specify more exchangeability than warranted. We, therefore, investigate the effects of pooling in hierarchical Bayesian modeling for the contingency table from small areas. In specific, this paper focuses on the methods of direct or indirect pooling of categorical data collected on a contingency table in each area through Dirichlet priors. We compare the pooling effects of hierarchical Bayesian models by fitting the simulated data. The analysis is carried out using Markov chain Monte Carlo methods.

순서 범주형 자료해석법의 비교 연구 (A Study on Comparison with the Methods of Ordered Categorical Data of Analysis)

  • 김홍준;송서일
    • 산업경영시스템학회지
    • /
    • 제20권44호
    • /
    • pp.207-215
    • /
    • 1997
  • This paper deals with a comparison between Taguchi's accumulation analysis method and Nair test on the ordered categorical data from an industrial experiment for quality improvement. a result of Taguchi's accumulation analysis method is shown to have reasonable power for detecting location effects, while Nair test identifies the location and dispersion effects separately, Accordingly, Taguchi's accumulation analysis needs to develop methods for detecting dispersion effects as well as location effects. In addition this paper rewmmends models for analyzing ordered categorical data, for examples, the cumulative legit model, mean response model etc Successively simple, reasonable methods should be introduced more likely to be used by the practitioners.

  • PDF

순차 범주형 데이타의 최적 모수 설계를 위한 분석법 개발 (Development of Analysis Method of Ordered Categorical Data for Optimal Parameter Design)

  • 전태준;박호일;홍남표;최성조
    • 대한산업공학회지
    • /
    • 제20권1호
    • /
    • pp.27-38
    • /
    • 1994
  • Accumulation analysis is difficult to analyze the ordered categorical data except smaller-the-better type problem. The purpose of this paper is to develop the statistic and method that can be easily applied to general type of problem, including nominal-the-best type problem. The experimental data of contact window process is analyzed and new procedure is compared with accumulation analysis.

  • PDF

Sensitivity Analysis for Ordered Categorical Data

  • Cho, Il-Hyun;Park, Taesung
    • Communications for Statistical Applications and Methods
    • /
    • 제6권2호
    • /
    • pp.375-382
    • /
    • 1999
  • Linear-by-linear association models are commonly used to analyze ordered categorical data. To fit these models appropriate scores need to be chosen. In this paper we perform sensitivity analyses in two-way contingency tables to investigate the effect of scores on goodness-of-fits and on tests of significance. In addition we show that the best score which yields the best fit of data can be selected based on the sensitivity analysis results.

  • PDF

Comparing Accuracy of Imputation Methods for Incomplete Categorical Data

  • Shin, Hyung-Won;Sohn, So-Young
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2003년도 춘계 학술발표회 논문집
    • /
    • pp.237-242
    • /
    • 2003
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include modal category method, logistic regression, and association rule. In this study, we propose two imputation methods (neural network fusion and voting fusion) that combine the results of individual imputation methods. A Monte-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data are (1) true model for the data, (2) data size, (3) noise size (4) percentage of missing data, and (5) missing pattern. Overall, neural network fusion performed the best while voting fusion is better than the individual imputation methods, although it was inferior to the neural network fusion. Result of an additional real data analysis confirms the simulation result.

  • PDF

혼합모드 잠재범주모형을 통한 텍스트 자료의 분석 (Latent class model for mixed variables with applications to text data)

  • 신현수;서병태
    • 응용통계연구
    • /
    • 제32권6호
    • /
    • pp.837-849
    • /
    • 2019
  • 일종의 혼합다항분포 모형이라고 볼 수 있는 잠재범주모형은 범주형 자료에서 직접 관측되지 않은 중요한 정보를 얻어낼 수 있는 유용한 도구이다. 하지만 자료에 범주형 변수 뿐 아니라 연속형 변수 혹은 빈도형 변수가 함께 포함되어 있을 경우 이 모형을 직접적으로 사용할 수 없다. 본 논문에서는 특히 범주형 변수와 빈도형 변수가 함께 포함되어 있는 경우에 잠재범주모형인 혼합모드 잠재범주모형을 사용하여 텍스트 후기와 범주형 응답문항이 모두 포함된 의약품 사용 후기자료를 분석하였다. 이 분석을 통해 범주형 응답만을 사용한 보통의 잠재범주 모형에 비해 텍스트 자료를 함께 사용한 혼합모드 잠재범주모형을 사용했을때 잠재범주에 대한 보다 자세한 정보를 얻을 수 있는 것을 확인하였다.

On the Categorical Variable Clustering

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • 제7권2호
    • /
    • pp.219-226
    • /
    • 1996
  • Basic objective in cluster analysis is to discover natural groupings of items or variables. In general, variable clustering was conducted based on some similarity measures between variables which have binary characteristics. We propose a variable clustering method when variables have more categories ordered in some sense. We also consider some measures of association as a similarity between variables. Numerical example is included.

  • PDF

범주형 데이터의 인과관계분석에 관한 기초적 연구 (A Study on the Analysis of Causal Relation about Categorical Data)

  • 노형진
    • 한국컴퓨터정보학회논문지
    • /
    • 제5권2호
    • /
    • pp.143-151
    • /
    • 2000
  • 질적 데이터의 수량화를 통하여 통계분석이 가능한 수량화이론 중 인과관계분석을 위한 수량화 이론 I류와 II류에 대한 기초개념과 알고리즘을 소개한다. 또한 이들 두 기법을 Excel에 의해 처리할 수 있는 방법론을 제시함으로써 그 활용성을 시사하고자 한다.

  • PDF

범주형 다변량 데이터의 상관관계분석에 관한 기초적 연구(II) (A Study on the Correlation Analysis about Categorical Multivariate Data(II))

  • 노형진
    • 한국컴퓨터정보학회논문지
    • /
    • 제5권3호
    • /
    • pp.142-150
    • /
    • 2000
  • 범주형 다변량 데이터의 상관관계분석을 위하여 개발한 수량화이론 III류나 대응분석 등의 기법은 다차원 공간상에서 점간의 거리로써 두 요소집합간의 관련성을 설명하는 데 있어서 매우 유용하다. 본 연구에서는 상관관계분석을 위한 대응분석의 특성을 수량화이론 III류와 비교하여 설명하고 그 유용성을 논하기로 한다. 이 기법은 사회과학 분야의 상관관계분석에 널리 활용될 것으로 기대된다.

  • PDF