• Title/Summary/Keyword: Selection Methods

Search Result 4,008, Processing Time 0.035 seconds

Performance Comparison of Classication Methods with the Combinations of the Imputation and Gene Selection Methods

  • Kim, Dong-Uk;Nam, Jin-Hyun;Hong, Kyung-Ha
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.6
    • /
    • pp.1103-1113
    • /
    • 2011
  • Gene expression data is obtained through many stages of an experiment and errors produced during the process may cause missing values. Due to the distinctness of the data so called 'small n large p', genes have to be selected for statistical analysis, like classification analysis. For this reason, imputation and gene selection are important in a microarray data analysis. In the literature, imputation, gene selection and classification analysis have been studied respectively. However, imputation, gene selection and classification analysis are sequential processing. For this aspect, we compare the performance of classification methods after imputation and gene selection methods are applied to microarray data. Numerical simulations are carried out to evaluate the classification methods that use various combinations of the imputation and gene selection methods.

Ensemble Gene Selection Method Based on Multiple Tree Models

  • Mingzhu Lou
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.652-662
    • /
    • 2023
  • Identifying highly discriminating genes is a critical step in tumor recognition tasks based on microarray gene expression profile data and machine learning. Gene selection based on tree models has been the subject of several studies. However, these methods are based on a single-tree model, often not robust to ultra-highdimensional microarray datasets, resulting in the loss of useful information and unsatisfactory classification accuracy. Motivated by the limitations of single-tree-based gene selection, in this study, ensemble gene selection methods based on multiple-tree models were studied to improve the classification performance of tumor identification. Specifically, we selected the three most representative tree models: ID3, random forest, and gradient boosting decision tree. Each tree model selects top-n genes from the microarray dataset based on its intrinsic mechanism. Subsequently, three ensemble gene selection methods were investigated, namely multipletree model intersection, multiple-tree module union, and multiple-tree module cross-union, were investigated. Experimental results on five benchmark public microarray gene expression datasets proved that the multiple tree module union is significantly superior to gene selection based on a single tree model and other competitive gene selection methods in classification accuracy.

Simulation Optimization with Statistical Selection Method

  • Kim, Ju-Mi
    • Management Science and Financial Engineering
    • /
    • v.13 no.1
    • /
    • pp.1-24
    • /
    • 2007
  • I propose new combined randomized methods for global optimization problems. These methods are based on the Nested Partitions(NP) method, a useful method for simulation optimization which guarantees global optimal solution but has several shortcomings. To overcome these shortcomings I hired various statistical selection methods and combined with NP method. I first explain the NP method and statistical selection method. And after that I present a detail description of proposed new combined methods and show the results of an application. As well as, I show how these combined methods can be considered in case of computing budget limit problem.

Selection probability of multivariate regularization to identify pleiotropic variants in genetic association studies

  • Kim, Kipoong;Sun, Hokeun
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.5
    • /
    • pp.535-546
    • /
    • 2020
  • In genetic association studies, pleiotropy is a phenomenon where a variant or a genetic region affects multiple traits or diseases. There have been many studies identifying cross-phenotype genetic associations. But, most of statistical approaches for detection of pleiotropy are based on individual tests where a single variant association with multiple traits is tested one at a time. These approaches fail to account for relations among correlated variants. Recently, multivariate regularization methods have been proposed to detect pleiotropy in analysis of high-dimensional genomic data. However, they suffer a problem of tuning parameter selection, which often results in either too many false positives or too small true positives. In this article, we applied selection probability to multivariate regularization methods in order to identify pleiotropic variants associated with multiple phenotypes. Selection probability was applied to individual elastic-net, unified elastic-net and multi-response elastic-net regularization methods. In simulation studies, selection performance of three multivariate regularization methods was evaluated when the total number of phenotypes, the number of phenotypes associated with a variant, and correlations among phenotypes are different. We also applied the regularization methods to a wild bean dataset consisting of 169,028 variants and 17 phenotypes.

A Study on Unbiased Methods in Constructing Classification Trees

  • Lee, Yoon-Mo;Song, Moon Sup
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.3
    • /
    • pp.809-824
    • /
    • 2002
  • we propose two methods which separate the variable selection step and the split-point selection step. We call these two algorithms as CHITES method and F&CHITES method. They adapted some of the best characteristics of CART, CHAID, and QUEST. In the first step the variable, which is most significant to predict the target class values, is selected. In the second step, the exhaustive search method is applied to find the splitting point based on the selected variable in the first step. We compared the proposed methods, CART, and QUEST in terms of variable selection bias and power, error rates, and training times. The proposed methods are not only unbiased in the null case, but also powerful for selecting correct variables in non-null cases.

Evaluation of Attribute Selection Methods and Prior Discretization in Supervised Learning

  • Cha, Woon Ock;Huh, Moon Yul
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.3
    • /
    • pp.879-894
    • /
    • 2003
  • We evaluated the efficiencies of applying attribute selection methods and prior discretization to supervised learning, modelled by C4.5 and Naive Bayes. Three databases were obtained from UCI data archive, which consisted of continuous attributes except for one decision attribute. Four methods were used for attribute selection : MDI, ReliefF, Gain Ratio and Consistency-based method. MDI and ReliefF can be used for both continuous and discrete attributes, but the other two methods can be used only for discrete attributes. Discretization was performed using the Fayyad and Irani method. To investigate the effect of noise included in the database, noises were introduced into the data sets up to the extents of 10 or 20%, and then the data, including those either containing the noises or not, were processed through the steps of attribute selection, discretization and classification. The results of this study indicate that classification of the data based on selected attributes yields higher accuracy than in the case of classifying the full data set, and prior discretization does not lower the accuracy.

Two variations of cross-distance selection algorithm in hybrid sufficient dimension reduction

  • Jae Keun Yoo
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.2
    • /
    • pp.179-189
    • /
    • 2023
  • Hybrid sufficient dimension reduction (SDR) methods to a weighted mean of kernel matrices of two different SDR methods by Ye and Weiss (2003) require heavy computation and time consumption due to bootstrapping. To avoid this, Park et al. (2022) recently develop the so-called cross-distance selection (CDS) algorithm. In this paper, two variations of the original CDS algorithm are proposed depending on how well and equally the covk-SAVE is treated in the selection procedure. In one variation, which is called the larger CDS algorithm, the covk-SAVE is equally and fairly utilized with the other two candiates of SIR-SAVE and covk-DR. But, for the final selection, a random selection should be necessary. On the other hand, SIR-SAVE and covk-DR are utilized with completely ruling covk-SAVE out, which is called the smaller CDS algorithm. Numerical studies confirm that the original CDS algorithm is better than or compete quite well to the two proposed variations. A real data example is presented to compare and interpret the decisions by the three CDS algorithms in practice.

An Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criteria and Weighting Methods (자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구)

  • Lee Jae-Yun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.39 no.2
    • /
    • pp.123-146
    • /
    • 2005
  • This study aims to find consistent strategies for feature selection and feature weighting methods, which can improve the effectiveness and efficiency of kNN text classifier. Feature selection criteria and feature weighting methods are as important factor as classification algorithms to achieve good performance of text categorization systems. Most of the former studies chose conflicting strategies for feature selection criteria and weighting methods. In this study, the performance of several feature selection criteria are measured considering the storage space for inverted index records and the classification time. The classification experiments in this study are conducted to examine the performance of IDF as feature selection criteria and the performance of conventional feature selection criteria, e.g. mutual information, as feature weighting methods. The results of these experiments suggest that using those measures which prefer low-frequency features as feature selection criterion and also as feature weighting method. we can increase the classification speed up to three or five times without loosing classification accuracy.

Efficient variable selection method using conditional mutual information (조건부 상호정보를 이용한 분류분석에서의 변수선택)

  • Ahn, Chi Kyung;Kim, Donguk
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.5
    • /
    • pp.1079-1094
    • /
    • 2014
  • In this paper, we study efficient gene selection methods by using conditional mutual information. We suggest gene selection methods using conditional mutual information based on semiparametric methods utilizing multivariate normal distribution and Edgeworth approximation. We compare our suggested methods with other methods such as mutual information filter, SVM-RFE, Cai et al. (2009)'s gene selection (MIGS-original) in SVM classification. By these experiments, we show that gene selection methods using conditional mutual information based on semiparametric methods have better performance than mutual information filter. Furthermore, we show that they take far less computing time than Cai et al. (2009)'s gene selection but have similar performance.

Effects of selection index coefficients that ignore reliability on economic weights and selection responses during practical selection

  • Togashi, Kenji;Adachi, Kazunori;Yasumori, Takanori;Kurogi, Kazuhito;Nozaki, Takayoshi;Onogi, Akio;Atagi, Yamato;Takahashi, Tsutomu
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.31 no.1
    • /
    • pp.19-25
    • /
    • 2018
  • Objective: In practical breeding, selection is often performed by ignoring the accuracy of evaluations and applying economic weights directly to the selection index coefficients of genetically standardized traits. The denominator of the standardized component trait of estimated genetic evaluations in practical selection varies with its reliability. Whereas theoretical methods for calculating the selection index coefficients of genetically standardized traits account for this variation, practical selection ignores reliability and assumes that it is equal to unity for each trait. The purpose of this study was to clarify the effects of ignoring the accuracy of the standardized component trait in selection criteria on selection responses and economic weights in retrospect. Methods: Theoretical methods were presented accounting for reliability of estimated genetic evaluations for the selection index composed of genetically standardized traits. Results: Selection responses and economic weights in retrospect resulting from practical selection were greater than those resulting from theoretical selection accounting for reliability when the accuracy of the estimated breeding value (EBV) or genomically enhanced breeding value (GEBV) was lower than those of the other traits in the index, but the opposite occurred when the accuracy of the EBV or GEBV was greater than those of the other traits. This trend was more conspicuous for traits with low economic weights than for those with high weights. Conclusion: Failure of the practical index to account for reliability yielded economic weights in retrospect that differed from those obtained with the theoretical index. Our results indicated that practical indices that ignore reliability delay genetic improvement. Therefore, selection practices need to account for reliability, especially when the reliabilities of the traits included in the index vary widely.