• 제목/요약/키워드: categorical data analysis

검색결과 195건 처리시간 0.022초

러프 엔트로피를 이용한 범주형 데이터의 클러스터링 (lustering of Categorical Data using Rough Entropy)

  • 박인규
    • 한국인터넷방송통신학회논문지
    • /
    • 제13권5호
    • /
    • pp.183-188
    • /
    • 2013
  • 객체를 분류하기 위하여 유사한 특징을 기반으로 하는 다양한 클러스터해석은 데이터 마이닝에서 필수적이다. 그러나 많은 데이터베이스에 포함되어 있는 범주형 데이터의 경우에 기존의 분할접근방법은 객체간의 불확실성을 처리하는데 한계가 있다. 범주형 데이터의 분할과정에서 식별불가능에 의한 동치류의 불확실성에 대한 접근논리가 러프집합의 대수학적인 논리에만 국한되어서 알고리즘의 안정성과 효율성이 떨어지는 요인으로 작용하고 있다. 본 논문에서는 범주형 데이터에 존재하는 속성의 의존도를 고려하기 위하여 정보이론적인 척도를 기반으로 러프엔트로피를 정의하고 MMMR이라는 알고리즘을 제안하여 분할속성을 추출한다. 제안된 방법의 성능을 분석하고 비교하기 위하여 K-means, 퍼지에 의한 방법과 표준편차를 이용한 기존의 방법과 비교우위를 ZOO데이터에 국한하여 알아본다. ZOO데이터를 이용하여 기존의 범주형 알고리즘과의 비교우위를 살펴보고 제안된 알고리즘의 효율성을 검증한다.

범주형 자료에서 경험적 베이지안 오분류 분석 (Empirical Bayesian Misclassification Analysis on Categorical Data)

  • 임한승;홍종선;서문섭
    • 응용통계연구
    • /
    • 제14권1호
    • /
    • pp.39-57
    • /
    • 2001
  • 범주형 자료에서 오분류는 자료를 수집하는 과정에서 발생될 수 있다. 오분류되어 있는 자료를 정확한 자료로 간주하여 분석한다면 추정결과에 편의가 발생하고 검정력이 약화되는 결과를 초래하게 되며, 정확하게 분류된 자료를 오분류하고 판단한다면 오분류의 수정을 위해 불필요한 비용과 시간을 낭비해야 할 것이다. 따라서 정확하게 분류된 표본인지 오분류된 표본인지를 판정하는 것은 자료를 분석하기 전에 이루어져야할 매우 중요한 과정이다. 본 논문은 I$\times$J 분할표로 주어지는 범주형 자료에서 두 변수 중 하나의 변수에서만 오분류가 발생되는 경우에 오분류 여부를 검정하기 위해서 오분류 가능성이 없는 변수에 대한 주변합은 고정시키고, 오분류 여부를 가능성이 있는 변수의 주변합을 Sebastiani와 Ramoni(1997)가 제안한 Bound와 외부정보로 표현되는 Collapse의 개념, 그리고 베이지안 방법을 확장하여 자료에 적합한 모형과 사전정보를 고려한 사전모수를 다양하게 설정하면서 재분류하는 연구를 하였다. 오분류에 대한 정보를 얻기 위해서 Tenenbein(1970)에 의해 연구된 이중추출법을 이용하여 오분류 검정을 위한 새로운 통계량을 제안하였으며, 제안된 오분류 검정통계량에 관한 분포를 다양한 모의실험을 통하여 연구하였다.

  • PDF

Effects of Uncertain Spatial Data Representation on Multi-source Data Fusion: A Case Study for Landslide Hazard Mapping

  • Park No-Wook;Chi Kwang-Hoon;Kwon Byung-Doo
    • 대한원격탐사학회지
    • /
    • 제21권5호
    • /
    • pp.393-404
    • /
    • 2005
  • As multi-source spatial data fusion mainly deal with various types of spatial data which are specific representations of real world with unequal reliability and incomplete knowledge, proper data representation and uncertainty analysis become more important. In relation to this problem, this paper presents and applies an advanced data representation methodology for different types of spatial data such as categorical and continuous data. To account for the uncertainties of both categorical data and continuous data, fuzzy boundary representation and smoothed kernel density estimation within a fuzzy logic framework are adopted, respectively. To investigate the effects of those data representation on final fusion results, a case study for landslide hazard mapping was carried out on multi-source spatial data sets from Jangheung, Korea. The case study results obtained from the proposed schemes were compared with the results obtained by traditional crisp boundary representation and categorized continuous data representation methods. From the case study results, the proposed scheme showed improved prediction rates than traditional methods and different representation setting resulted in the variation of prediction rates.

An Identification of Outlying Cells in Contingency Table via Correspondence Analysis Map

  • Hong, Chong Sun;Lee, Jong Cheol
    • Communications for Statistical Applications and Methods
    • /
    • 제8권1호
    • /
    • pp.39-49
    • /
    • 2001
  • When an appropriate model is fitted to explain a certain categorical data, outlying cell detection plays very important role to reduce the lack of fit. There exist many statistical methods to identify outlying cells in contingency table. In this paper, correspondence analysis is applied to identify one or two outlying cells. When corresponding relationships between categories of the row and columns are explored, we find that outlying cells could be identified via the correspondence analysis map.

  • PDF

Comparison of Parameter Estimation Methods in the Analysis of Multivariate Categorical Data with Logit Models

  • Song, Hae-Hiang
    • Journal of the Korean Statistical Society
    • /
    • 제12권1호
    • /
    • pp.24-35
    • /
    • 1983
  • In fitting models to data, selection of the most desirable estimation method and determination of the adequacy of fitted model are the central issues. This paper compares the maximum likelihood estimators and the minimum logit chi-square estimators, both being best asymptotically normal, when logit models are fitted to infant mortality data. Chi-square goodness-of-fit test and likelihood ratio one are also compared. The analysis infant mortality data shows that the outlying observations do not necessarily result in the same impact on goodness-of-fit measures.

  • PDF

k-Modes 분할 알고리즘에 의한 군집의 상관정보 기반 빅데이터 분석 (A Big Data Analysis by Between-Cluster Information using k-Modes Clustering Algorithm)

  • 박인규
    • 디지털융복합연구
    • /
    • 제13권11호
    • /
    • pp.157-164
    • /
    • 2015
  • 본 논문은 융복합을 위한 범주형 데이터의 부공간에 의한 군집화에 대해서 다룬다. 범주형 데이터는 수치형 데이터에만 국한되지 않기 때문에 기존의 범주형 데이터들의 평가척도들은 순서화(ordering)의 부재와 데이터의 고차원성과 희소성으로 인하여 한계를 가지기 마련이다. 따라서 각각의 군집에 존재하는 범주형 속성들의 상호 유사도을 보다 근접하게 측정할 수 있는 조건부 엔트로피 척도를 제안한다. 또한 군집의 최적화를 위하여 군집내의 발산을 최소화하고, 군집간의 독립성을 향상시킬 수 있는 새로운 목적함수를 제안한다. 제안된 알고리즘의 성능을 4개의 알고리즘과 비교검증하기 위하여 5가지의 데이터에 대하여 실험을 수행하였다. 비교검증을 위한 평가척도는 정확도, f-척도와 적응된 Rand 색인이다. 실험을 통하여 제안된 방법이 평가척도에 의한 결과에서 기존의 방법들보다 좋은 성능을 보였다.

Simultaneous Approach to Fuzzy Clustering and Quantification of Categorical Data with Missing Values

  • Honda, Katsuhiro;Nakamura, Yoshihito;Ichihashi, Hidetomo
    • 한국지능시스템학회:학술대회논문집
    • /
    • 한국퍼지및지능시스템학회 2003년도 ISIS 2003
    • /
    • pp.36-39
    • /
    • 2003
  • This paper proposes a simultaneous application of homogeneity analysis and fuzzy clustering with in complete data. Taking the similarity between the loss of homogeneity in homogeneity analysis and the least squares criterion in principal component analysis into account, the new objective function is defined in a similar formulation to the linear fuzzy clustering with missing values. Numerical experiment shows the characteristic properties of the proposed method.

  • PDF

조직지위의 이원성과 비정규직 고용: 한국대학의 평가형 지위와 범주형 지위가 비정규직 고용에 미치는 영향 (The Duality of Organizational Status and Temporary Employment: The Impact of Evaluated Status and Categorical Status on Temporary Employment in Korean Universities)

  • 정대훈
    • 아태비즈니스연구
    • /
    • 제14권3호
    • /
    • pp.89-101
    • /
    • 2023
  • Purpose - This paper discusses an impact of status on organization's temporary employment. Status not only offers various opportunities for organization but also places constrains on organization. In this perspective, we propose that organization's temporary employment will differ depending on the status. Design/methodology/approach - We predict that organization's evaluated status has a U-shaped relationship with temporary employment because organizational social insecurity varies by the status. Moreover, we predict that organization's categorical status has a positive effect on temporary employment since organizational legitimacy varies with the status and that the effect will be enhanced by an organizational niche. To verify these predictions, we examined a regression analysis using panel data of temporary employment in Korean universities. Findings - The results of regression analysis show that there is a U-shaped relationship between universities' evaluated status and temporary employment. This implies that the middle status university is likely to minimize temporary employment because of conformity pressures. In addition, the results show that university's categorical status has a positive effect on temporary employment and the effect is enhanced by university's market concentration. This suggests that the categorical status has a strong impact on specialist university. Research implications or Originality - This paper contributes the development of temporary employment theory by applying duality of organizational status and identifies the organizational determinants of temporary employment in Korean universities.

A Continuation-Ratio Logits Mixed Model for Structured Polytomous Data

  • Choi, Jae-Sung
    • Journal of the Korean Data and Information Science Society
    • /
    • 제17권1호
    • /
    • pp.187-193
    • /
    • 2006
  • This paper shows how to use continuation-ratio logits for the analysis of structured polytomous data. Here, response categories are considered to have a nested binary structure. Thus, conditionally nested binary random variables can be defined in each step. Two types of factors are considered as independent variables affecting response probabilities. For the purpose of analyzing categorical data with binary nested strutures a continuation-ratio mixed model is suggested. Estimation procedure for the unknown parameters in a suggested model is also discussed in detail by an example.

  • PDF

군집화 기반 프로세스 마이닝을 이용한 커리큘럼 마이닝 분석 (Curriculum Mining Analysis Using Clustering-Based Process Mining)

  • 주우민;최진영
    • 산업경영시스템학회지
    • /
    • 제38권4호
    • /
    • pp.45-55
    • /
    • 2015
  • In this paper, we consider curriculum mining as an application of process mining in the domain of education. The basic objective of the curriculum mining is to construct a registration pattern model by using logs of registration data. However, subject registration patterns of students are very unstructured and complicated, called a spaghetti model, because it has a lot of different cases and high diversity of behaviors. In general, it is typically difficult to develop and analyze registration patterns. In the literature, there was an effort to handle this issue by using clustering based on the features of students and behaviors. However, it is not easy to obtain them in general since they are private and qualitative. Therefore, in this paper, we propose a new framework of curriculum mining applying K-means clustering based on subject attributes to solve the problems caused by unstructured process model obtained. Specifically, we divide subject's attribute data into two parts : categorical and numerical data. Categorical attribute has subject name, class classification, and research field, while numerical attribute has ABEEK goal and semester information. In case of categorical attribute, we suggest a method to quantify them by using binarization. The number of clusters used for K-means clustering, we applied Elbow method using R-squared value representing the variance ratio that can be explained by the number of clusters. The performance of the suggested method was verified by using a log of student registration data from an 'A university' in terms of the simplicity and fitness, which are the typical performance measure of obtained process model in process mining.