• Title/Summary/Keyword: categorical variable

Search Result 104, Processing Time 0.038 seconds

Classification of High Dimensionality Data through Feature Selection Using Markov Blanket

  • Lee, Junghye;Jun, Chi-Hyuck
    • Industrial Engineering and Management Systems
    • /
    • v.14 no.2
    • /
    • pp.210-219
    • /
    • 2015
  • A classification task requires an exponentially growing amount of computation time and number of observations as the variable dimensionality increases. Thus, reducing the dimensionality of the data is essential when the number of observations is limited. Often, dimensionality reduction or feature selection leads to better classification performance than using the whole number of features. In this paper, we study the possibility of utilizing the Markov blanket discovery algorithm as a new feature selection method. The Markov blanket of a target variable is the minimal variable set for explaining the target variable on the basis of conditional independence of all the variables to be connected in a Bayesian network. We apply several Markov blanket discovery algorithms to some high-dimensional categorical and continuous data sets, and compare their classification performance with other feature selection methods using well-known classifiers.

Multidimensional scaling of categorical data using the partition method (분할법을 활용한 범주형자료의 다차원척도법)

  • Shin, Sang Min;Chun, Sun-Kyung;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.1
    • /
    • pp.67-75
    • /
    • 2018
  • Multidimensional scaling (MDS) is an exploratory analysis of multivariate data to represent the dissimilarity among objects in the geometric low-dimensional space. However, a general MDS map only shows the information of objects without any information about variables. In this study, we used MDS based on the algorithm of Torgerson (Theory and Methods of Scaling, Wiley, 1958) to visualize some clusters of objects in categorical data. For this, we convert given data into a multiple indicator matrix. Additionally, we added the information of levels for each categorical variable on the MDS map by applying the partition method of Shin et al. (Korean Journal of Applied Statistics, 28, 1171-1180, 2015). Therefore, we can find information on the similarity among objects as well as find associations among categorical variables using the proposed MDS map.

Input Variable Importance in Supervised Learning Models

  • Huh, Myung-Hoe;Lee, Yong Goo
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.1
    • /
    • pp.239-246
    • /
    • 2003
  • Statisticians, or data miners, are often requested to assess the importances of input variables in the given supervised learning model. For the purpose, one may rely on separate ad hoc measures depending on modeling types, such as linear regressions, the neural networks or trees. Consequently, the conceptual consistency in input variable importance measures is lacking, so that the measures cannot be directly used in comparing different types of models, which is often done in data mining processes, In this short communication, we propose a unified approach to the importance measurement of input variables. Our method uses sensitivity analysis which begins by perturbing the values of input variables and monitors the output change. Research scope is limited to the models for continuous output, although it is not difficult to extend the method to supervised learning models for categorical outcomes.

Sample-spacing Approach for the Estimation of Mutual Information (SAMPLE-SPACING 방법에 의한 상호정보의 추정)

  • Huh, Moon-Yul;Cha, Woon-Ock
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.2
    • /
    • pp.301-312
    • /
    • 2008
  • Mutual information is a measure of association of explanatory variable for predicting target variable. It is used for variable ranking and variable subset selection. This study is about the Sample-spacing approach which can be used for the estimation of mutual information from data consisting of continuous explanation variables and categorical target variable without estimating a joint probability density function. The results of Monte-Carlo simulation and experiments with real-world data show that m = 1 is preferable in using Sample-spacing.

A Sequence of Models for Categorical Data with Compound Scales (복합척도의 범주형 자료에 대한 연속 모형)

  • 최재성
    • The Korean Journal of Applied Statistics
    • /
    • v.14 no.1
    • /
    • pp.103-110
    • /
    • 2001
  • This paper considers a multistage experiment. Response scales can be same or different from stage to stage. When variables are of nested structure, the response variable at each stage can be defined conditionally. For analysing such data with compound scales, this paper suggests a sequnce of dependence models and shows how to set up a sequence of models for the driver's liscense test data.

  • PDF

Suppression and Collapsibility for Log-linear Models

  • Sun, Hong-Chong
    • Communications for Statistical Applications and Methods
    • /
    • v.11 no.3
    • /
    • pp.519-527
    • /
    • 2004
  • Relationship between the partial likelihood ratio statistics for logisitic models and the partial goodness-of-fit statistics for corresponding log-linear models is discussed. This paper shows how definitions of suppression in logistic model can be adapted for log-linear model and how they are related to confounding in terms of collapsibility for categorical data. Several $2{times}2{times}2$ contingency tables are illustrated.

A Prediction of Work-life Balance Using Machine Learning

  • Youngkeun Choi
    • Asia pacific journal of information systems
    • /
    • v.34 no.1
    • /
    • pp.209-225
    • /
    • 2024
  • This research aims to use machine learning technology in human resource management to predict employees' work-life balance. The study utilized a dataset from IBM Watson Analytics in the IBM Community for the machine learning analysis. Multinomial dependent variables concerning workers' work-life balance were examined, categorized into continuous and categorical types using the Generalized Linear Model. The complexity of assessing variable roles and their varied impact based on the type of model used was highlighted. The study's outcomes are academically and practically relevant, showcasing how machine learning can offer further understanding of psychological variables like work-life balance through analyzing employee profiles.

Empirical Bayesian Misclassification Analysis on Categorical Data (범주형 자료에서 경험적 베이지안 오분류 분석)

  • 임한승;홍종선;서문섭
    • The Korean Journal of Applied Statistics
    • /
    • v.14 no.1
    • /
    • pp.39-57
    • /
    • 2001
  • Categorical data has sometimes misclassification errors. If this data will be analyzed, then estimated cell probabilities could be biased and the standard Pearson X2 tests may have inflated true type I error rates. On the other hand, if we regard wellclassified data with misclassified one, then we might spend lots of cost and time on adjustment of misclassification. It is a necessary and important step to ask whether categorical data is misclassified before analyzing data. In this paper, when data is misclassified at one of two variables for two-dimensional contingency table and marginal sums of a well-classified variable are fixed. We explore to partition marginal sums into each cells via the concepts of Bound and Collapse of Sebastiani and Ramoni (1997). The double sampling scheme (Tenenbein 1970) is used to obtain informations of misclassification. We propose test statistics in order to solve misclassification problems and examine behaviors of the statistics by simulation studies.

  • PDF

The Correlational Study of Health Promotion Lifestyle and Body Composition in a University Students (일개 대학생의 건강증진 생활양식과 신체조성간의 관계 연구)

  • Park, Yeon-Suk;Lee, Hye-Gyeong
    • Journal of the Korean Society of School Health
    • /
    • v.19 no.1
    • /
    • pp.67-78
    • /
    • 2006
  • Purporse : The purpose of this study was to examine the relationship between a health promoting lifestyle and body composition in university students. The study subjects were 194 university students who attended K-university located in Chungnam. Methods : The data was collected between March 2 and May 31, 2004. The instrument used for this study was the modified Health Promoting Lifestyle Profile(HPLP) developed by Walker, Sechrist, & Pender(1987). The body composition was measured by In Body 3.0, a Bioelectrical Impedance Analyzer. The data was analyzed using the SPSS/WIN 10.0 program by t-test, ANOVA and pearson correlation coefficients. Results : The results of this study are as follows: 1) The scores of the Health Promoting Lifestyle(HPL) ranged from 79 to 170, with a mean score of 110(±15.8). The mean scores of sub-categorical HPL were self-actualization 31.8(±4.9), health responsibility 17.0(±4.0), exercise 8.3(±3.2), nutrition 15.4(±3.7), interpersonal relationships 20.3(±3.5) and stress management 17.2(±3.4). 2) The HPL according to the subjects' general characteristics had significant correlation to exercise amount(F=8.09, p<.01), drinking amount(F=6.56, p<.01), perceived health status(F=19.2, p<.01) and perceived health knowledge (F=15.9, p<.01). 3) The total HPL did not significantly correlate with any categories in body composition. The exercise area of sub-categorical HPL had significant positive correlation to height (r=.199, p<.01), weight(r=.181, p<.05) and soft lean mass(r=.257, p<.01), and negative correlation to percent body fat(r=-.255, p<.01) in body composition. Conclusion : The results suggest that the exercise area of sub-categorical HPL was an important variable for an exercise program's development such as nursing intervention for the health promotion of university students.

Information Theory and Data Visualization Approach to Poll Analysis (정보이론과 시각화 방법에 의한 여론조사 분석의 새로운 접근방법)

  • Huh, Moon-Yul;Cha, Woon-Ock
    • The Korean Journal of Applied Statistics
    • /
    • v.20 no.1
    • /
    • pp.61-78
    • /
    • 2007
  • A method for poll analysis using information theory and data visualization is proposed in this paper. Questions of opinion poll consist of a target variable and many explanation variables. The type of explanation variables is either numerical or categorical. In this study, explanation variables of mixed types have been ranked according to the magnitude of their effect on target variable by using mutual information. Likewise, the order of explanation variables has been evaluated using data visualization. This is the first study to quantify the impact of specific explanation variable on the related target variable.