• Title/Summary/Keyword: categorical data analysis

Search Result 195, Processing Time 0.029 seconds

Candidate Marker Identification from Gene Expression Data with Attribute Value Discretization and Negation (속성값 이산화 및 부정값 허용을 하는 의사결정트리 기반의 유전자 발현 데이터의 마커 후보 식별)

  • Lee, Kyung-Mi;Lee, Keon-Myung
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.21 no.5
    • /
    • pp.575-580
    • /
    • 2011
  • With the increasing expectation on personalized medicine, it is getting importance to analyze medical information in molecular biology perspective. Gene expression data are one of representative ones to show the microscopic phenomena of biological activities. In gene expression data analysis, one of major concerns is to identify markers which can be used to predict disease occurrence, progression or recurrence in the molecular level. Existing markers candidate identification methods mainly depend on statistical hypothesis test methods. This paper proposes a search method based decision tree induction to identify candidate markers which consist of multiple genes. The propose method discretizes numeric expression level into three categorical values and allows candidate markers' genes to be expressed by their negation as well as categorical values. It is desirable to have some number of genes to be included in markers. Hence the method is devised to try to find candidate markers with restricted number of genes.

A comparison of imputation methods using machine learning models

  • Heajung Suh;Jongwoo Song
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.3
    • /
    • pp.331-341
    • /
    • 2023
  • Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

A New Similarity Measure for Categorical Attribute-Based Clustering (범주형 속성 기반 군집화를 위한 새로운 유사 측도)

  • Kim, Min;Jeon, Joo-Hyuk;Woo, Kyung-Gu;Kim, Myoung-Ho
    • Journal of KIISE:Databases
    • /
    • v.37 no.2
    • /
    • pp.71-81
    • /
    • 2010
  • The problem of finding clusters is widely used in numerous applications, such as pattern recognition, image analysis, market analysis. The important factors that decide cluster quality are the similarity measure and the number of attributes. Similarity measures should be defined with respect to the data types. Existing similarity measures are well applicable to numerical attribute values. However, those measures do not work well when the data is described by categorical attributes, that is, when no inherent similarity measure between values. In high dimensional spaces, conventional clustering algorithms tend to break down because of sparsity of data points. To overcome this difficulty, a subspace clustering approach has been proposed. It is based on the observation that different clusters may exist in different subspaces. In this paper, we propose a new similarity measure for clustering of high dimensional categorical data. The measure is defined based on the fact that a good clustering is one where each cluster should have certain information that can distinguish it with other clusters. We also try to capture on the attribute dependencies. This study is meaningful because there has been no method to use both of them. Experimental results on real datasets show clusters obtained by our proposed similarity measure are good enough with respect to clustering accuracy.

The Application of Machine Learning Algorithm In The Analysis of Tissue Microarray; for the Prediction of Clinical Status

  • Cho, Sung-Bum;Kim, Woo-Ho;Kim, Ju-Han
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2005.09a
    • /
    • pp.366-370
    • /
    • 2005
  • Tissue microarry is one of the high throughput technologies in the post-genomic era. Using tissue microarray, the researchers are able to investigate large amount of gene expressions at the level of DNA, RNA, and protein The important aspect of tissue microarry is its ability to assess a lot of biomarkers which have been used in clinical practice. To manipulate the categorical data of tissue microarray, we applied Bayesian network classifier algorithm. We identified that Bayesian network classifier algorithm could analyze tissue microarray data and integrating prior knowledge about gastric cancer could achieve better performance result. The results showed that relevant integration of prior knowledge promote the prediction accuracy of survival status of the immunohistochemical tissue microarray data of 18 tumor suppressor genes. In conclusion, the application of Bayesian network classifier seemed appropriate for the analysis of the tissue microarray data with clinical information.

  • PDF

A Machine Learning Model Learning and Utilization Education Curriculum for Non-majors (비전공자 대상 머신러닝 모델 학습 및 활용교육 커리큘럼)

  • Kyeong Hur
    • Journal of Practical Engineering Education
    • /
    • v.15 no.1
    • /
    • pp.31-38
    • /
    • 2023
  • In this paper, a basic machine learning model learning and utilization education curriculum for non-majors is proposed, and an education method using Orange machine learning model learning and analysis tools is proposed. Orange is an open-source machine learning and data visualization tool that can create machine learning models by learning data using visual widgets without complex programming. Orange is a platform that is widely used by non-major undergraduates to expert groups. In this paper, a basic machine learning model learning and utilization education curriculum and weekly practice contents for one semester are proposed. In addition, in order to demonstrate the reality of practice contents for machine learning model learning and utilization, we used the Orange tool to learn machine learning models from categorical data samples and numerical data samples, and utilized the models. Thus, use cases for predicting the outcome of the population were proposed. Finally, the educational satisfaction of this curriculum is surveyed and analyzed for non-majors.

Outlying Cell Identification Method Using Interaction Estimates of Log-linear Models

  • Hong, Chong Sun;Jung, Min Jung
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.2
    • /
    • pp.291-303
    • /
    • 2003
  • This work is proposed an alternative identification method of outlying cell which is one of important issues in categorical data analysis. One finds that there is a strong relationship between the location of an outlying cell and the corresponding parameter estimates of the well-fitted log-linear model. Among parameters of log-linear model, an outlying cell is affected by interaction terms rather than main effect terms. Hence one could identify an outlying cell by investigating of parameter estimates in an appropriate log-linear model.

A Prediction of Work-life Balance Using Machine Learning

  • Youngkeun Choi
    • Asia pacific journal of information systems
    • /
    • v.34 no.1
    • /
    • pp.209-225
    • /
    • 2024
  • This research aims to use machine learning technology in human resource management to predict employees' work-life balance. The study utilized a dataset from IBM Watson Analytics in the IBM Community for the machine learning analysis. Multinomial dependent variables concerning workers' work-life balance were examined, categorized into continuous and categorical types using the Generalized Linear Model. The complexity of assessing variable roles and their varied impact based on the type of model used was highlighted. The study's outcomes are academically and practically relevant, showcasing how machine learning can offer further understanding of psychological variables like work-life balance through analyzing employee profiles.

Experiences of convergence external appraisal of competency in core basic nursing skills in final year nursing students (졸업학년 간호학생의 핵심기본간호술 역량에 대한 융합적 외부평가 경험)

  • Hong, Eunhee;Kim, Myo-Gyeong
    • Journal of the Korea Convergence Society
    • /
    • v.8 no.9
    • /
    • pp.93-104
    • /
    • 2017
  • This is a qualitative study to explore the experiences of convergence external appraisal of competency in core basic nursing skills in final year nursing students. Eight nursing students who experienced the evaluation were intentionally sampled. Data were analyzed using constant comparative and categorical content analysis after collecting data through focus group interview. The results of this study showed that nursing students sublimated mental stress such as burden and pressure of the evaluation into a positive experience that they skillfully improved their nursing skills through mind control and cooperation with peers. Therefore, this study enabled us to understand the experience of students receiving external appraisal. In addition, intervention studies to alleviate mental stress on convergence external appraisal are needed.

Variable selection for latent class analysis using clustering efficiency (잠재변수 모형에서의 군집효율을 이용한 변수선택)

  • Kim, Seongkyung;Seo, Byungtae
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.6
    • /
    • pp.721-732
    • /
    • 2018
  • Latent class analysis (LCA) is an important tool to explore unseen latent groups in multivariate categorical data. In practice, it is important to select a suitable set of variables because the inclusion of too many variables in the model makes the model complicated and reduces the accuracy of the parameter estimates. Dean and Raftery (Annals of the Institute of Statistical Mathematics, 62, 11-35, 2010) proposed a headlong search algorithm based on Bayesian information criteria values to choose meaningful variables for LCA. In this paper, we propose a new variable selection procedure for LCA by utilizing posterior probabilities obtained from each fitted model. We propose a new statistic to measure the adequacy of LCA and develop a variable selection procedure. The effectiveness of the proposed method is also presented through some numerical studies.

Segmentation of Cooperatives' Mutuality Bank for Effective Risk Management using Factor Analysis and Cluster Analysis

  • Cho, Yong-Jun;Ko, Seoung-Gon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.19 no.3
    • /
    • pp.831-844
    • /
    • 2008
  • Since cooperatives consist of many distinct members in the management environment and characteristics, it is necessary to make similar cooperatives into a few groups for the effective risk management of cooperatives' mutuality bank. This paper is a priori research for suggesting a guidance for effective risk management of cooperatives with different management strategy. For such purpose, we propose a way to group the members of cooperative's mutuality bank. The 30 continuous variables which is relative to cooperatives' management status are considered and six factors are extracted from those variables through factor analysis with empirical consideration to avoid wrong grouping and to enhance the practical interpretation. Based on extracted six factors and additional 3 categorical variables, six representative groups are derived by the two step clustering analysis. These findings are useful to execute a discriminatory risk management and other management strategy for a mutuality bank and others.

  • PDF