• 제목/요약/키워드: categorical data analysis

검색결과 196건 처리시간 0.027초

머신러닝 자동화를 위한 개발 환경에 관한 연구 (A Study on Development Environments for Machine Learning)

  • 김동길;박용순;박래정;정태윤
    • 대한임베디드공학회논문지
    • /
    • 제15권6호
    • /
    • pp.307-316
    • /
    • 2020
  • Machine learning model data is highly affected by performance. preprocessing is needed to enable analysis of various types of data, such as letters, numbers, and special characters. This paper proposes a development environment that aims to process categorical and continuous data according to the type of missing values in stage 1, implementing the function of selecting the best performing algorithm in stage 2 and automating the process of checking model performance in stage 3. Using this model, machine learning models can be created without prior knowledge of data preprocessing.

Challenging a Single-Factor Analysis of Case Drop in Korean

  • Chung, Eun Seon
    • 한국언어정보학회지:언어와정보
    • /
    • 제19권1호
    • /
    • pp.1-18
    • /
    • 2015
  • Korean marks case for subjects and objects, but it is well known that case-markers can be dropped in certain contexts. Kwon and Zribi-Hertz (2008) establishes the phenomenon of Korean case drop on a single factor of f(ocus)-structure visibility and claims that both subject and object case drop can fall under a single linguistic generalization of information structure. However, the supporting data is not empirically substantiated and the tenability of the f-structure analysis is still under question. In this paper, an experiment was conducted to show that the specific claims of Kwon and Zribi-Hertz's analysis that places exclusive importance on information structure cannot be adequately supported by empirical evidence. In addition, the present study examines H. Lee's (2006a, 2006c) multi-factor analysis of object case drop and investigates whether this approach can subsume both subject and object case drop under a unified analysis. The present findings indicate that the multi-factor analysis that involves the interaction of independent factors (Focus, Animacy, and Definiteness) is also compatible with subject case drop, and that judgments on case drop are not categorical but form gradient statistical preferences.

  • PDF

An Analysis of Panel Count Data from Multiple random processes

  • 박유성;김희영
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2002년도 추계 학술발표회 논문집
    • /
    • pp.265-272
    • /
    • 2002
  • An Integer-valued autoregressive integrated (INARI) model is introduced to eliminate stochastic trend and seasonality from time series of count data. This INARI extends the previous integer-valued ARMA model. We show that it is stationary and ergodic to establish asymptotic normality for conditional least squares estimator. Optimal estimating equations are used to reflect categorical and serial correlations arising from panel count data and variations arising from three random processes for obtaining observation into estimation. Under regularity conditions for martingale sequence, we show asymptotic normality for estimators from the estimating equations. Using cancer mortality data provided by the U.S. National Center for Health Statistics (NCHS), we apply our results to estimate the probability of cells classified by 4 causes of death and 6 age groups and to forecast death count of each cell. We also investigate impact of three random processes on estimation.

  • PDF

FUZZY REGRESSION TOWARDS A GENERAL INSURANCE APPLICATION

  • Kim, Joseph H.T.;Kim, Joocheol
    • Journal of applied mathematics & informatics
    • /
    • 제32권3_4호
    • /
    • pp.343-357
    • /
    • 2014
  • In many non-life insurance applications past data are given in a form known as the run-off triangle. Smoothing such data using parametric crisp regression models has long served as the basis of estimating future claim amounts and the reserves set aside to protect the insurer from future losses. In this article a fuzzy counterpart of the Hoerl curve, a well-known claim reserving regression model, is proposed to analyze the past claim data and to determine the reserves. The fuzzy Hoerl curve is more flexible and general than the one considered in the previous fuzzy literature in that it includes a categorical variable with multiple explanatory variables, which requires the development of the fuzzy analysis of covariance, or fuzzy ANCOVA. Using an actual insurance run-off claim data we show that the suggested fuzzy Hoerl curve based on the fuzzy ANCOVA gives reasonable claim reserves without stringent assumptions needed for the traditional regression approach in claim reserving.

다속성 빅데이터로부터 유용한 정보 추출에 관한 연구 - 서울시 1인 가구를 중심으로 - (A Study on Extraction of Useful Information from Big dataset of Multi-attributes - Focus on Single Household in Seoul -)

  • 최정민;김건우
    • 한국주거학회논문집
    • /
    • 제25권4호
    • /
    • pp.59-72
    • /
    • 2014
  • This study proposes a data-mining analysis method for examining variable multi-attribute big-data, which is considered to be more applicable in social science using a Correspondence Analysis of variables obtained by AIC model selection. The proposed method was applied on the Seoul Survey from 2005 to 2010 in order to extract interesting rules or patterns on characteristics of single household. The results found as follows. Firstly, this paper illustrated that the proposed method is efficiently able to apply on a big dataset of huge categorical multi attributes variables. Secondly, as a result of Seoul Survey analysis, it has been found that the more dissatisfied with residential environment the higher tendency of residential mobility in single household. Thirdly, it turned out that there are three types of single households based on the characteristics of their demographic characteristics, and it was different from recognition of home and partner of counselling by the three types of single households. Fourthly, this paper extracted eight significant variables with a spatial aggregated dataset which are highly correlated to the ratio of occupancy of single household in 25 Seoul Municipals, and to conclude, it investigated the relation between spatial distribution of single households and their demographic statistics based on the six divided groups obtained by Cluster Analysis.

잠재계층분석기법(Latent Class Analysis)을 활용한 영화 소비자 세분화에 관한 연구 (Segmentation of Movie Consumption : An Application of Latent Class Analysis to Korean Film Industry)

  • 구교령;이장혁
    • 한국경영과학회지
    • /
    • 제36권4호
    • /
    • pp.161-184
    • /
    • 2011
  • As movie demands become more and more diversified, it is necessary for movie related firms to segment a whole heterogeneous market into a number of small homogeneous markets in order to identify the specific needs of consumer groups. Relevant market segmentation helps them to develop valuable offer to target segments through effective marketing planning. In this article, we introduce various segmentation methods and compare their advantages and disadvantages. In particular, we analyze "2009~2010 consumer survey data of Korean Film Industry" by using Latent Class Analysis(LCA), a statistical segmentation method which identifies exclusive set of latent classes based on consumers' responses to an observed categorical and numerical variables. It is applied PROC LCA, a new SAS procedure for conducting LCA and finally get the result of 11 distinctive clusters showing unique characteristics on their buying behaviors.

영 과잉 순서적 프로빗 모형을 이용한 한국인의 음주자료에 대한 베이지안 분석 (Bayesian Analysis of Korean Alcohol Consumption Data Using a Zero-Inflated Ordered Probit Model)

  • 오만숙;오현탁;박세미
    • 응용통계연구
    • /
    • 제25권2호
    • /
    • pp.363-376
    • /
    • 2012
  • 순서적 다항 반응변수의 경우 종종 과도하게 많은 수의 관측치가 0 범주에서 발생하는 영 과잉 특성을 지닌다. 이러한 영 과잉 자료에서 0범주를 발생시키는 요인이 여러 개 존재할 때 일반적인 순서적 프로빗 모형은 자료를 설명함에 있어서 한계를 지닌다. 본 논문에서는 영 과잉 특성을 반영한 이 단계 영 과잉 순서적 프로빗 모형의 베이지안 분석기법을 제시하고 이를 2008년도 통계청에서 조사한 한국인의 음주소비 자료에 적용시킨다. 첫 번째 단계에서는 음주소비가 하나도 없다고 답한 0 범주에 속하는 비음주자들을 신념 또는 영구적 건강상의 문제 등으로 상황에 관계없이 음주를 하지 않는 절대적 비음주자(genuine non-drinker, non-participant)와 현재 소비가 없지만 상황에 따라 음주자가 될 가능성이 있는 잠재적 음주자(zero consumption potential drinker)로 구분하는 프로빗 모형을 적용시켜 분석한다. 두 번째 단계에서는 잠재적 음주자와 1 이상의 범주에 속하는 실제적 음주자를 합하여 음주자 집단으로 보고 이에 대하여 순서적 프로빗 모형을 적용하여 분석한다. 분석결과, 비음주자 중 약 30%가 절대적 비음주자로 음주자료가 일반적 순서적 자료에 비하여 뚜렷한 영 과잉 특성을 가짐을 알 수 있었다. 각 변수의 한계효과를 분석함으로써 같은 설명변수가 절대적 비음주자와 잠재적 음주자에 미치는 영향이 서로 반대로 나타날 수 있음을 발견하였고, 따라서 한국인의 음주자료에 대하여 제안된 영 과잉 순서적 프로빗 모형이 유용함을 보여주었다.

데이터 마이닝 기반의 군사특기 분류 방법론 연구 (A Data-Mining-based Methodology for Military Occupational Specialty Assignment)

  • 민규식;정지원;최인찬
    • 한국국방경영분석학회지
    • /
    • 제30권1호
    • /
    • pp.1-14
    • /
    • 2004
  • In this paper, we propose a new data-mining-based methodology for military occupational specialty assignment. The proposed methodology consists of two phases, feature selection and man-power assignment. In the first phase, the k-means partitioning algorithm and the optimal variable weighting algorithm are used to determine attribute weights. We address limitations of the optimal variable weighting algorithm and suggest a quadratic programming model that can handle categorical variables and non-contributory trivial variables. In the second phase, we present an integer programming model to deal with a man-power assignment problem. In the model, constraints on demand-supply requirements and training capacity are considered. Moreover, the attribute weights obtained in the first phase for each specialty are used to measure dissimilarity. Results of a computational experiment using real-world data are provided along with some analysis.

다차원 범주형 자료의 변환과 그의 응용 (The Transform of Multidimensional Categorical Data and its Applications)

  • 안주선
    • 응용통계연구
    • /
    • 제20권3호
    • /
    • pp.585-595
    • /
    • 2007
  • Ahn등 (2003)의 P-행렬을 사용한 두 $c^d$-분할표의 변환자료들의 유클리드 거리제곱은 두 분할표의 셀 (cell) 상대도수벡터들 사이의 유클리드 거리 제곱에 비례함을 보이고, PP-자료의 플롯을 현대시분석과 설문자료의 탐색에 사용하는 방법을 제안한다.

k-모집단 동질성검정에서 피어슨검정의 오차성분 분석에 관한 연구 (Error cause analysis of Pearson test statistics for k-population homogeneity test)

  • 허순영
    • Journal of the Korean Data and Information Science Society
    • /
    • 제24권4호
    • /
    • pp.815-824
    • /
    • 2013
  • 국가단위의 조사와 같은 대규모 표본조사에서는 표본의 대표성을 확보하기 위해 층화, 집락, 계통, 불균등확률추출 등을 종합적으로 사용하는 복합표본설계가 일반화되어 있다. 이러한 복합표본설계에 기초한 범주형 자료분석에서는 자료의 독립성과 다항분포를 가정하는 전통적인 피어슨검정이 왜곡된 검정결과를 가져올 수 있다. 본 연구는 복합표본설계에 의한 범주형조사자료의 k-모집단 동질성검정에서 설계기반 일치통계량인 Wald 검정통계량을 유도하고, 전통적인 피어슨검정통계량을 사용할 경우 발생할 수 있는 오차요인을 항목별로 분해하여, 분산의 편의에 의한 영향, 추정량의 편의에 의한 영향, 기타 분산의 편의와 추정량의 편의가 교락되어 미치는 영향으로 각각 분해하는 식을 도출하였다. 또한, 도출된 식의 각 항목이 피어슨 카이제곱검정통계량에 미치는 상대적 크기를 경험적으로 확인하기 위해 국민건강영양조사 제4기 2차년도 자료를 이용해 경험분석 하였다. 분석결과, 변수에 따른 차이는 있지만 대체로 분산의 편의가 미치는 영향이 추정량의 편의가 미치는 영향보다 크다는 것을 명확히 확인할 수 있었다.