• Title/Summary/Keyword: categorical data analysis

Search Result 195, Processing Time 0.019 seconds

A Study on Development Environments for Machine Learning (머신러닝 자동화를 위한 개발 환경에 관한 연구)

  • Kim, Dong Gil;Park, Yong-Soon;Park, Lae-Jeong;Chung, Tae-Yun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.15 no.6
    • /
    • pp.307-316
    • /
    • 2020
  • Machine learning model data is highly affected by performance. preprocessing is needed to enable analysis of various types of data, such as letters, numbers, and special characters. This paper proposes a development environment that aims to process categorical and continuous data according to the type of missing values in stage 1, implementing the function of selecting the best performing algorithm in stage 2 and automating the process of checking model performance in stage 3. Using this model, machine learning models can be created without prior knowledge of data preprocessing.

Challenging a Single-Factor Analysis of Case Drop in Korean

  • Chung, Eun Seon
    • Language and Information
    • /
    • v.19 no.1
    • /
    • pp.1-18
    • /
    • 2015
  • Korean marks case for subjects and objects, but it is well known that case-markers can be dropped in certain contexts. Kwon and Zribi-Hertz (2008) establishes the phenomenon of Korean case drop on a single factor of f(ocus)-structure visibility and claims that both subject and object case drop can fall under a single linguistic generalization of information structure. However, the supporting data is not empirically substantiated and the tenability of the f-structure analysis is still under question. In this paper, an experiment was conducted to show that the specific claims of Kwon and Zribi-Hertz's analysis that places exclusive importance on information structure cannot be adequately supported by empirical evidence. In addition, the present study examines H. Lee's (2006a, 2006c) multi-factor analysis of object case drop and investigates whether this approach can subsume both subject and object case drop under a unified analysis. The present findings indicate that the multi-factor analysis that involves the interaction of independent factors (Focus, Animacy, and Definiteness) is also compatible with subject case drop, and that judgments on case drop are not categorical but form gradient statistical preferences.

  • PDF

An Analysis of Panel Count Data from Multiple random processes

  • Park, You-Sung;Kim, Hee-Young
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2002.11a
    • /
    • pp.265-272
    • /
    • 2002
  • An Integer-valued autoregressive integrated (INARI) model is introduced to eliminate stochastic trend and seasonality from time series of count data. This INARI extends the previous integer-valued ARMA model. We show that it is stationary and ergodic to establish asymptotic normality for conditional least squares estimator. Optimal estimating equations are used to reflect categorical and serial correlations arising from panel count data and variations arising from three random processes for obtaining observation into estimation. Under regularity conditions for martingale sequence, we show asymptotic normality for estimators from the estimating equations. Using cancer mortality data provided by the U.S. National Center for Health Statistics (NCHS), we apply our results to estimate the probability of cells classified by 4 causes of death and 6 age groups and to forecast death count of each cell. We also investigate impact of three random processes on estimation.

  • PDF

FUZZY REGRESSION TOWARDS A GENERAL INSURANCE APPLICATION

  • Kim, Joseph H.T.;Kim, Joocheol
    • Journal of applied mathematics & informatics
    • /
    • v.32 no.3_4
    • /
    • pp.343-357
    • /
    • 2014
  • In many non-life insurance applications past data are given in a form known as the run-off triangle. Smoothing such data using parametric crisp regression models has long served as the basis of estimating future claim amounts and the reserves set aside to protect the insurer from future losses. In this article a fuzzy counterpart of the Hoerl curve, a well-known claim reserving regression model, is proposed to analyze the past claim data and to determine the reserves. The fuzzy Hoerl curve is more flexible and general than the one considered in the previous fuzzy literature in that it includes a categorical variable with multiple explanatory variables, which requires the development of the fuzzy analysis of covariance, or fuzzy ANCOVA. Using an actual insurance run-off claim data we show that the suggested fuzzy Hoerl curve based on the fuzzy ANCOVA gives reasonable claim reserves without stringent assumptions needed for the traditional regression approach in claim reserving.

A Study on Extraction of Useful Information from Big dataset of Multi-attributes - Focus on Single Household in Seoul - (다속성 빅데이터로부터 유용한 정보 추출에 관한 연구 - 서울시 1인 가구를 중심으로 -)

  • Choi, Jung-Min;Kim, Kun-Woo
    • Journal of the Korean housing association
    • /
    • v.25 no.4
    • /
    • pp.59-72
    • /
    • 2014
  • This study proposes a data-mining analysis method for examining variable multi-attribute big-data, which is considered to be more applicable in social science using a Correspondence Analysis of variables obtained by AIC model selection. The proposed method was applied on the Seoul Survey from 2005 to 2010 in order to extract interesting rules or patterns on characteristics of single household. The results found as follows. Firstly, this paper illustrated that the proposed method is efficiently able to apply on a big dataset of huge categorical multi attributes variables. Secondly, as a result of Seoul Survey analysis, it has been found that the more dissatisfied with residential environment the higher tendency of residential mobility in single household. Thirdly, it turned out that there are three types of single households based on the characteristics of their demographic characteristics, and it was different from recognition of home and partner of counselling by the three types of single households. Fourthly, this paper extracted eight significant variables with a spatial aggregated dataset which are highly correlated to the ratio of occupancy of single household in 25 Seoul Municipals, and to conclude, it investigated the relation between spatial distribution of single households and their demographic statistics based on the six divided groups obtained by Cluster Analysis.

Segmentation of Movie Consumption : An Application of Latent Class Analysis to Korean Film Industry (잠재계층분석기법(Latent Class Analysis)을 활용한 영화 소비자 세분화에 관한 연구)

  • Koo, Kay-Ryung;Lee, Jang-Hyuk
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.36 no.4
    • /
    • pp.161-184
    • /
    • 2011
  • As movie demands become more and more diversified, it is necessary for movie related firms to segment a whole heterogeneous market into a number of small homogeneous markets in order to identify the specific needs of consumer groups. Relevant market segmentation helps them to develop valuable offer to target segments through effective marketing planning. In this article, we introduce various segmentation methods and compare their advantages and disadvantages. In particular, we analyze "2009~2010 consumer survey data of Korean Film Industry" by using Latent Class Analysis(LCA), a statistical segmentation method which identifies exclusive set of latent classes based on consumers' responses to an observed categorical and numerical variables. It is applied PROC LCA, a new SAS procedure for conducting LCA and finally get the result of 11 distinctive clusters showing unique characteristics on their buying behaviors.

Bayesian Analysis of Korean Alcohol Consumption Data Using a Zero-Inflated Ordered Probit Model (영 과잉 순서적 프로빗 모형을 이용한 한국인의 음주자료에 대한 베이지안 분석)

  • Oh, Man-Suk;Oh, Hyun-Tak;Park, Se-Mi
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.2
    • /
    • pp.363-376
    • /
    • 2012
  • Excessive zeroes are often observed in ordinal categorical response variables. An ordinary ordered Probit model is not appropriate for zero-inflated data especially when there are many different sources of generating 0 observations. In this paper, we apply a two-stage zero-inflated ordered Probit (ZIOP) model which incorporate the zero-flated nature of data, propose a Bayesian analysis of a ZIOP model, and apply the method to alcohol consumption data collected by the National Bureau of Statistics, Korea. In the first stage of a ZIOP model, a Probit model is introduced to divide the non-drinkers into genuine non-drinkers who do not participate in drinking due to personal beliefs or permanent health problems and potential drinkers who did not drink at the time of the survey but have the potential to become drinkers. In the second stage, an ordered probit model is applied to drinkers that consists of zero-consumption potential drinkers and positive consumption drinkers. The analysis results show that about 30% of non-drinkers are genuine non-drinkers and hence the Korean alcohol consumption data has the feature of zero-inflated data. A study on the marginal effect of each explanatory variable shows that certain explanatory variables have effects on the genuine non-drinkers and potential drinkers in opposite directions, which may not be detected by an ordered Probit model.

A Data-Mining-based Methodology for Military Occupational Specialty Assignment (데이터 마이닝 기반의 군사특기 분류 방법론 연구)

  • 민규식;정지원;최인찬
    • Journal of the military operations research society of Korea
    • /
    • v.30 no.1
    • /
    • pp.1-14
    • /
    • 2004
  • In this paper, we propose a new data-mining-based methodology for military occupational specialty assignment. The proposed methodology consists of two phases, feature selection and man-power assignment. In the first phase, the k-means partitioning algorithm and the optimal variable weighting algorithm are used to determine attribute weights. We address limitations of the optimal variable weighting algorithm and suggest a quadratic programming model that can handle categorical variables and non-contributory trivial variables. In the second phase, we present an integer programming model to deal with a man-power assignment problem. In the model, constraints on demand-supply requirements and training capacity are considered. Moreover, the attribute weights obtained in the first phase for each specialty are used to measure dissimilarity. Results of a computational experiment using real-world data are provided along with some analysis.

The Transform of Multidimensional Categorical Data and its Applications (다차원 범주형 자료의 변환과 그의 응용)

  • Ahn, Ju-Sun
    • The Korean Journal of Applied Statistics
    • /
    • v.20 no.3
    • /
    • pp.585-595
    • /
    • 2007
  • The squared Euclid distance of the values which is transformed by P-matrix of Ahn et al. (2003) is in proportion to the squared Euclid distance of cell's relative frequencies in two Contingency Tables. We propose the method of using the PP-values for the analysis of modern poems and questionnaire data.

Error cause analysis of Pearson test statistics for k-population homogeneity test (k-모집단 동질성검정에서 피어슨검정의 오차성분 분석에 관한 연구)

  • Heo, Sunyeong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.4
    • /
    • pp.815-824
    • /
    • 2013
  • Traditional Pearson chi-squared test is not appropriate for the data collected by the complex sample design. When one uses the traditional Pearson chi-squared test to the complex sample categorical data, it may give wrong test results, and the error may occur not only due to the biased variance estimators but also due to the biased point estimators of cell proportions. In this study, the design based consistent Wald test statistics was derived for k-population homogeneity test, and the traditional Pearson chi-squared test statistics was partitioned into three parts according to the causes of error; the error due to the bias of variance estimator, the error due to the bias of cell proportion estimator, and the unseparated error due to the both bias of variance estimator and bias of cell proportion estimator. An analysis was conducted for empirical results of the relative size of each error component to the Pearson chi-squared test statistics. The second year data from the fourth Korean national health and nutrition examination survey (KNHANES, IV-2) was used for the analysis. The empirical results show that the relative size of error from the bias of variance estimator was relatively larger than the size of error from the bias of cell proportion estimator, but its degrees were different variable by variable.