• Title/Summary/Keyword: categorical variable

Search Result 104, Processing Time 0.029 seconds

Ordering Variables and Categories on the Mosaic Plot (모자이크 플롯에서 변수와 범주의 순서화)

  • Lee, Moon-Joo;Huh, Myung-Hoe
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.5
    • /
    • pp.875-888
    • /
    • 2008
  • Mosaic plots, proposed by Hartigan and Kleiner (1981, 1984), are very useful in visualizing categorical data. In mosaic plot, multi-way classified cell frequencies are represented by rectangles with proportional area. The plot is easy to understand while preserving the information contained in the data. Plot's appearance, however, does change substantially depending on the order of variables and the orders of categories with variable put into the plot. In this study, we propose the algorithms for ordering variables and categories of the categorical data to be explored via mosaic plots. We demonstrate our methods to three well-known datasets: Titanic, Housing and PreSex.

Tree-structured Clustering for Mixed Data (혼합형 데이터에 대한 나무형 군집화)

  • Yang Kyung-Sook;Huh Myung-Hoe
    • The Korean Journal of Applied Statistics
    • /
    • v.19 no.2
    • /
    • pp.271-282
    • /
    • 2006
  • The aim of this study is to propose a tree-structured clustering for mixed data. We suggest a scaling method to reduce the variable selection bias among categorical variables. In numerical examples such as credit data, German credit data, we note several differences between tree-structured clustering and K-means clustering.

A Data-Mining-based Methodology for Military Occupational Specialty Assignment (데이터 마이닝 기반의 군사특기 분류 방법론 연구)

  • 민규식;정지원;최인찬
    • Journal of the military operations research society of Korea
    • /
    • v.30 no.1
    • /
    • pp.1-14
    • /
    • 2004
  • In this paper, we propose a new data-mining-based methodology for military occupational specialty assignment. The proposed methodology consists of two phases, feature selection and man-power assignment. In the first phase, the k-means partitioning algorithm and the optimal variable weighting algorithm are used to determine attribute weights. We address limitations of the optimal variable weighting algorithm and suggest a quadratic programming model that can handle categorical variables and non-contributory trivial variables. In the second phase, we present an integer programming model to deal with a man-power assignment problem. In the model, constraints on demand-supply requirements and training capacity are considered. Moreover, the attribute weights obtained in the first phase for each specialty are used to measure dissimilarity. Results of a computational experiment using real-world data are provided along with some analysis.

A numerical study on group quantile regression models

  • Kim, Doyoen;Jung, Yoonsuh
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.4
    • /
    • pp.359-370
    • /
    • 2019
  • Grouping structures in covariates are often ignored in regression models. Recent statistical developments considering grouping structure shows clear advantages; however, reflecting the grouping structure on the quantile regression model has been relatively rare in the literature. Treating the grouping structure is usually conducted by employing a group penalty. In this work, we explore the idea of group penalty to the quantile regression models. The grouping structure is assumed to be known, which is commonly true for some cases. For example, group of dummy variables transformed from one categorical variable can be regarded as one group of covariates. We examine the group quantile regression models via two real data analyses and simulation studies that reveal the beneficial performance of group quantile regression models to the non-group version methods if there exists grouping structures among variables.

The Effect of Air Pollution on Professional Sports in South Korea

  • LEE, Seomgyun;OH, Taeyeon
    • Journal of Sport and Applied Science
    • /
    • v.4 no.4
    • /
    • pp.27-32
    • /
    • 2020
  • Purpose: This study sought to explore the effects of air pollution on professional sports in South Korea. Research design, data, and methodology: The dependent variable, the number of attendances, was comprised of 2013-2017 K-league, 2015-2017 KBO, 2014-2017 KBL regular season games, resulting in 1,063, 2,121, 810 individual match-level observations, respectively. With the actual data collected from each place across the country, we created a categorical variable which identify the air quality index divided into four categories by K-eco (i.e., good, moderate, unhealthy, hazardous). To analyze data, ANOVA was employed. Results: First, there was a significant group effect on K-league attendance. Second, there was a significant group effect of KBO attendance. Lastly, there was a significant group effect on KBL attendance. Conclusions: Summary of above results showed that each professional sport leagues' attendance was significantly different depending on the levels of air pollution. Implications were also discussed. Keywords: air pollution, sport spectatorship, professional sports.

An introductory study on the urban functions using CHAID technique (CHAID 技法에 의한 都市機能의 試論的 硏究)

  • ;Yang, Soon-Jeong
    • Journal of the Korean Geographical Society
    • /
    • v.29 no.3
    • /
    • pp.360-368
    • /
    • 1994
  • To this day, a number of quantitative analytical methods have been employed in clarifying regional characteristics in the discipline of geography. This paper attempted, as a part of application of those quantitative analyses, to make clear the urban functions and consequently the urban characteristics statistically by adopting newly-introduced CHAID, a sort of discriminant analyis technique. The processing of data was sonducted in two phases. To begin with, the urban functions were classified after designating twenty cities - the population of each city counting 250, 000 or more - as predictor variable, and at the same time four major urban functions like administration, marketing, finance and production as response variable. And then, preeminent functions of individual region were discriminated and concurrently classified by treating the remaining traffic, education, medicare, culture and transportation functions as predictor variable, and the following five regions as response variable: Metropolitan Seoul Area. Pusan region, Taegu region, Kwangju region and Chungcheong region. According to the result of this analysis, marketing and administration are emereed as meaningful functions in Seoul and Taegu respectively. As for the finance function only Pusan and Pucheon can be discriminated. Seoul, Pusan and Seongnam reveal their dominancy in production function. To take a look at the result of the latter analysis, the Metropolitan Seoul area shows, among other functions, strong traffic and finance functions. When it comes fo Pusan region, adminstration, education and finance functions are recorded as a leading ones, and Taegu region is preferable in education, medicare and transportation functions. In case of Kwangju region adminstration, production and education functions are discriminated from any other functions. Chungcheong region shows similar aspect with only traffic function replacing the production function of Kwangju region. Based on aforementioned anlysis, it can be said that the CHAID technique, which is capable of processing large amount of categorical data and, by presenting its outcome in the form of dendrogram, facilitates the interpretation work, is an effective, meaningful means to classify and discriminate certain geographical regions and their characteristics.

  • PDF

An educational tool for regression models with dummy variables using Excel VBA (엑셀 VBA을 이용한 가변수 회귀모형 교육도구 개발)

  • Choi, Hyun Seok;Park, Cheolyong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.3
    • /
    • pp.593-601
    • /
    • 2013
  • We often need to include categorial variables as explanatory variables in regression models. The categorial variables in regression models can be quantified through dummy variables. In this study, we provide an education tool using Excel VBA for displaying regression lines along with test results for regression models with a continuous explanatory variable and one or two categorical explanatory variables. The regression lines with test results are provided step by step for the model(s) with interaction(s), the model(s) without interaction(s) but with dummy variables, and the model without dummy variable(s). With this tool, we can easily understand the meaning of dummy variables and interaction effect through graphics and further decide which model is more suited to the data on hand.

Latent class analysis with multiple latent group variables

  • Lee, Jung Wun;Chung, Hwan
    • Communications for Statistical Applications and Methods
    • /
    • v.24 no.2
    • /
    • pp.173-191
    • /
    • 2017
  • This study develops a new type of latent class analysis (LCA) in order to explain the associations between one latent variable and several other categorical latent variables. Our model postulates that the prevalence of the latent variable of interest is affected by another latent variable composed of other several latent variables. For the parameter estimation, we propose deterministic annealing EM (DAEM) to deal with local maxima problem in the proposed model. We perform simulation study to demonstrate how DAEM can find the set of parameter estimates at the global maximum of the likelihood over the repeated samples. We apply the proposed LCA model in an investigation of the effect of and joint patterns for drug-using behavior to violent behavior among US high school male students using data from the Youth Risk Behavior Surveillance System 2015. Considering the age of male adolescents as a covariate influencing violent behavior, we identified three classes of violent behavior and three classes of drug-using behavior. We also discovered that the prevalence of violent behavior is affected by the type of drug used for drug-using behavior.

Prognostic Evaluation of Categorical Platelet-based Indices Using Clustering Methods Based on the Monte Carlo Comparison for Hepatocellular Carcinoma

  • Guo, Pi;Shen, Shun-Li;Zhang, Qin;Zeng, Fang-Fang;Zhang, Wang-Jian;Hu, Xiao-Min;Zhang, Ding-Mei;Peng, Bao-Gang;Hao, Yuan-Tao
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.15 no.14
    • /
    • pp.5721-5727
    • /
    • 2014
  • Objectives: To evaluate the performance of clustering methods used in the prognostic assessment of categorical clinical data for hepatocellular carcinoma (HCC) patients in China, and establish a predictable prognostic nomogram for clinical decisions. Materials and Methods: A total of 332 newly diagnosed HCC patients treated with hepatic resection during 2006-2009 were enrolled. Patients were regularly followed up at outpatient clinics. Clustering methods including the Average linkage, k-modes, fuzzy k-modes, PAM, CLARA, protocluster, and ROCK were compared by Monte Carlo simulation, and the optimal method was applied to investigate the clustering pattern of the indices including platelet count, platelet/lymphocyte ratio (PLR) and serum aspartate aminotransferase activity/platelet count ratio index (APRI). Then the clustering variable, age group, tumor size, number of tumor and vascular invasion were studied in a multivariable Cox regression model. A prognostic nomogram was constructed for clinical decisions. Results: The ROCK was best in both the overlapping and non-overlapping cases performed to assess the prognostic value of platelet-based indices. Patients with categorical platelet-based indices significantly split across two clusters, and those with high values, had a high risk of HCC recurrence (hazard ratio [HR] 1.42, 95% CI 1.09-1.86; p<0.01). Tumor size, number of tumor and blood vessel invasion were also associated with high risk of HCC recurrence (all p< 0.01). The nomogram well predicted HCC patient survival at 3 and 5 years. Conclusions: A cluster of platelet-based indices combined with other clinical covariates could be used for prognosis evaluation in HCC.

Estimating Average Causal Effect in Latent Class Analysis (잠재범주분석을 이용한 원인적 영향력 추론에 관한 연구)

  • Park, Gayoung;Chung, Hwan
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.7
    • /
    • pp.1077-1095
    • /
    • 2014
  • Unlike randomized trial, statistical strategies for inferring the unbiased causal relationship are required in the observational studies. Recently, new methods for the causal inference in the observational studies have been proposed such as the matching with the propensity score or the inverse probability treatment weighting. They have focused on how to control the confounders and how to evaluate the effect of the treatment on the result variable. However, these conventional methods are valid only when the treatment variable is categorical and both of the treatment and the result variables are directly observable. Research on the causal inference can be challenging in part because it may not be possible to directly observe the treatment and/or the result variable. To address this difficulty, we propose a method for estimating the average causal effect when both of the treatment and the result variables are latent. The latent class analysis has been applied to calculate the propensity score for the latent treatment variable in order to estimate the causal effect on the latent result variable. In this work, we investigate the causal effect of adolescents delinquency on their substance use using data from the 'National Longitudinal Study of Adolescent Health'.