• Title/Summary/Keyword: Gene Selection

Search Result 871, Processing Time 0.024 seconds

Ensemble Gene Selection Method Based on Multiple Tree Models

  • Mingzhu Lou
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.652-662
    • /
    • 2023
  • Identifying highly discriminating genes is a critical step in tumor recognition tasks based on microarray gene expression profile data and machine learning. Gene selection based on tree models has been the subject of several studies. However, these methods are based on a single-tree model, often not robust to ultra-highdimensional microarray datasets, resulting in the loss of useful information and unsatisfactory classification accuracy. Motivated by the limitations of single-tree-based gene selection, in this study, ensemble gene selection methods based on multiple-tree models were studied to improve the classification performance of tumor identification. Specifically, we selected the three most representative tree models: ID3, random forest, and gradient boosting decision tree. Each tree model selects top-n genes from the microarray dataset based on its intrinsic mechanism. Subsequently, three ensemble gene selection methods were investigated, namely multipletree model intersection, multiple-tree module union, and multiple-tree module cross-union, were investigated. Experimental results on five benchmark public microarray gene expression datasets proved that the multiple tree module union is significantly superior to gene selection based on a single tree model and other competitive gene selection methods in classification accuracy.

Informative Gene Selection Method in Tumor Classification

  • Lee, Hyosoo;Park, Jong Hoon
    • Genomics & Informatics
    • /
    • v.2 no.1
    • /
    • pp.19-29
    • /
    • 2004
  • Gene expression profiles may offer more information than morphology and provide an alternative to morphology- based tumor classification systems. Informative gene selection is finding gene subsets that are able to discriminate between tumor types, and may have clear biological interpretation. Gene selection is a fundamental issue in gene expression based tumor classification. In this report, techniques for selecting informative genes are illustrated and supervised shaving introduced as a gene selection method in the place of a clustering algorithm. The supervised shaving method showed good performance in gene selection and classification, even though it is a clustering algorithm. Almost selected genes are related to leukemia disease. The expression profiles of 3051 genes were analyzed in 27 acute lymphoblastic leukemia and 11 myeloid leukemia samples. Through these examples, the supervised shaving method has been shown to produce biologically significant genes of more than $94\%$ accuracy of classification. In this report, SVM has also been shown to be a practicable method for gene expression-based classification.

Performance Comparison of Classication Methods with the Combinations of the Imputation and Gene Selection Methods

  • Kim, Dong-Uk;Nam, Jin-Hyun;Hong, Kyung-Ha
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.6
    • /
    • pp.1103-1113
    • /
    • 2011
  • Gene expression data is obtained through many stages of an experiment and errors produced during the process may cause missing values. Due to the distinctness of the data so called 'small n large p', genes have to be selected for statistical analysis, like classification analysis. For this reason, imputation and gene selection are important in a microarray data analysis. In the literature, imputation, gene selection and classification analysis have been studied respectively. However, imputation, gene selection and classification analysis are sequential processing. For this aspect, we compare the performance of classification methods after imputation and gene selection methods are applied to microarray data. Numerical simulations are carried out to evaluate the classification methods that use various combinations of the imputation and gene selection methods.

A review of gene selection methods based on machine learning approaches (기계학습 접근법에 기반한 유전자 선택 방법들에 대한 리뷰)

  • Lee, Hajoung;Kim, Jaejik
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.5
    • /
    • pp.667-684
    • /
    • 2022
  • Gene expression data present the level of mRNA abundance of each gene, and analyses of gene expressions have provided key ideas for understanding the mechanism of diseases and developing new drugs and therapies. Nowadays high-throughput technologies such as DNA microarray and RNA-sequencing enabled the simultaneous measurement of thousands of gene expressions, giving rise to a characteristic of gene expression data known as high dimensionality. Due to the high-dimensionality, learning models to analyze gene expression data are prone to overfitting problems, and to solve this issue, dimension reduction or feature selection techniques are commonly used as a preprocessing step. In particular, we can remove irrelevant and redundant genes and identify important genes using gene selection methods in the preprocessing step. Various gene selection methods have been developed in the context of machine learning so far. In this paper, we intensively review recent works on gene selection methods using machine learning approaches. In addition, the underlying difficulties with current gene selection methods as well as future research directions are discussed.

Biological Feature Selection and Disease Gene Identification using New Stepwise Random Forests

  • Hwang, Wook-Yeon
    • Industrial Engineering and Management Systems
    • /
    • v.16 no.1
    • /
    • pp.64-79
    • /
    • 2017
  • Identifying disease genes from human genome is a critical task in biomedical research. Important biological features to distinguish the disease genes from the non-disease genes have been mainly selected based on traditional feature selection approaches. However, the traditional feature selection approaches unnecessarily consider many unimportant biological features. As a result, although some of the existing classification techniques have been applied to disease gene identification, the prediction performance was not satisfactory. A small set of the most important biological features can enhance the accuracy of disease gene identification, as well as provide potentially useful knowledge for biologists or clinicians, who can further investigate the selected biological features as well as the potential disease genes. In this paper, we propose a new stepwise random forests (SRF) approach for biological feature selection and disease gene identification. The SRF approach consists of two stages. In the first stage, only important biological features are iteratively selected in a forward selection manner based on one-dimensional random forest regression, where the updated residual vector is considered as the current response vector. We can then determine a small set of important biological features. In the second stage, random forests classification with regard to the selected biological features is applied to identify disease genes. Our extensive experiments show that the proposed SRF approach outperforms the existing feature selection and classification techniques in terms of biological feature selection and disease gene identification.

Marker-Assisted Foreground and Background Selection of Near Isogenic Lines for Bacterial Leaf Pustule Resistant Gene in Soybean

  • Kim, Kil-Hyun;Kim, Moon-Young;Van, Kyu-Jung;Moon, Jung-Kyung;Kim, Dong-Hyun;Lee, Suk-Ha
    • Journal of Crop Science and Biotechnology
    • /
    • v.11 no.4
    • /
    • pp.263-268
    • /
    • 2008
  • Bacterial leaf pustule (BLP) caused by Xanthomonas axonopodis pv. glycines is a serious disease to make pustule and chlorotic haloes in soybean [Glycine max (L). Merr.]. While inheritance mode and map positions of the BLP resistance gene, rxp are known, no sequence information of the gene was reported. In this study, we made five near isogenic lines (NILs) from separate backcrosses (BCs) of BLP-susceptible Hwangkeumkong $\times$ BLP-resistant SS2-2 (HS) and BLP-susceptible Taekwangkong$\times$ SS2-2 (TS) through foreground and background selection based on the four-stage selection strategy. First, 15 BC individuals were selected through foreground selection using the simple sequence repeat (SSR) markers Satt486 and Satt372 flanking the rxp gene. Among them, 11 BC plants showed the BLP-resistant response. The HS and TS lines chosen in foreground selection were again screened by background selection using 118 and 90 SSR markers across all chromosomes, respectively. Eventually, five individuals showing greater than 90% recurrent parent genome content were selected in both HS and TS lines. These NILs will be a unique biological material to characterize the rxp gene.

  • PDF

Rank-based Multiclass Gene Selection for Cancer Classification with Naive Bayes Classifiers based on Gene Expression Profiles (나이브 베이스 분류기를 이용한 유전발현 데이타기반 암 분류를 위한 순위기반 다중클래스 유전자 선택)

  • Hong, Jin-Hyuk;Cho, Sung-Bae
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.35 no.8
    • /
    • pp.372-377
    • /
    • 2008
  • Multiclass cancer classification has been actively investigated based on gene expression profiles, where it determines the type of cancer by analyzing the large amount of gene expression data collected by the DNA microarray technology. Since gene expression data include many genes not related to a target cancer, it is required to select informative genes in order to obtain highly accurate classification. Conventional rank-based gene selection methods often use ideal marker genes basically devised for binary classification, so it is difficult to directly apply them to multiclass classification. In this paper, we propose a novel method for multiclass gene selection, which does not use ideal marker genes but directly analyzes the distribution of gene expression. It measures the class-discriminability by discretizing gene expression levels into several regions and analyzing the frequency of training samples for each region, and then classifies samples by using the naive Bayes classifier. We have demonstrated the usefulness of the proposed method for various representative benchmark datasets of multiclass cancer classification.

Changes in Reproductive Traits of Large White Pigs after Estrogen Receptor Gene-based Selection in Slovakia: Preliminary Results

  • Chvojkova, Zuzana;Hraska, S.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.21 no.3
    • /
    • pp.320-324
    • /
    • 2008
  • We investigated the effect of ESR gene-based selection on an improvement of litter size in the herds in real (non-experimental) conditions. The pigs were selected for three years. In the tested population the pigs were mated according to a breeding scheme where the individuals with at least one ESR-B allele were preferred in the selection. In the control group (CP; n = 140) the pigs were mated just according to a breeding scheme without knowledge of the ESR genotype. We observed a significant increase in litter size (total number of born, number of born alive and number of weaned piglets per litter) in the final tested ESR-selected population (LP; n = 184) and an insignificant increase in CP as compared with the original population (OP; n = 155). After the selection we could observe a significant increase in the frequency of allele B in LP. Frequency of the genotypes AB and BB increased in both LP and CP; the distribution of the genotypes changed significantly only in LP. An association analysis of the ESR gene effects on reproductive traits in LP showed no significant differences between the genotypes. The results of our study suggest that ESR gene-based selection can be successful also in small herds, under real (non-experimental) conditions with a respect for general breeding principles and limitations and during a short period. An examination of a larger sample population as well as an analysis of selection consequences on other traits (meat and carcass quality) could bring a more conclusive evaluation of ESR-based selection. Nevertheless, the results are encouraging especially for small breeding farms taking a perspective of better litter size improvement.

Efficient variable selection method using conditional mutual information (조건부 상호정보를 이용한 분류분석에서의 변수선택)

  • Ahn, Chi Kyung;Kim, Donguk
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.5
    • /
    • pp.1079-1094
    • /
    • 2014
  • In this paper, we study efficient gene selection methods by using conditional mutual information. We suggest gene selection methods using conditional mutual information based on semiparametric methods utilizing multivariate normal distribution and Edgeworth approximation. We compare our suggested methods with other methods such as mutual information filter, SVM-RFE, Cai et al. (2009)'s gene selection (MIGS-original) in SVM classification. By these experiments, we show that gene selection methods using conditional mutual information based on semiparametric methods have better performance than mutual information filter. Furthermore, we show that they take far less computing time than Cai et al. (2009)'s gene selection but have similar performance.

Feature Selection via Embedded Learning Based on Tangent Space Alignment for Microarray Data

  • Ye, Xiucai;Sakurai, Tetsuya
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.4
    • /
    • pp.121-129
    • /
    • 2017
  • Feature selection has been widely established as an efficient technique for microarray data analysis. Feature selection aims to search for the most important feature/gene subset of a given dataset according to its relevance to the current target. Unsupervised feature selection is considered to be challenging due to the lack of label information. In this paper, we propose a novel method for unsupervised feature selection, which incorporates embedded learning and $l_{2,1}-norm$ sparse regression into a framework to select genes in microarray data analysis. Local tangent space alignment is applied during embedded learning to preserve the local data structure. The $l_{2,1}-norm$ sparse regression acts as a constraint to aid in learning the gene weights correlatively, by which the proposed method optimizes for selecting the informative genes which better capture the interesting natural classes of samples. We provide an effective algorithm to solve the optimization problem in our method. Finally, to validate the efficacy of the proposed method, we evaluate the proposed method on real microarray gene expression datasets. The experimental results demonstrate that the proposed method obtains quite promising performance.