• 제목/요약/키워드: gene tree

검색결과 407건 처리시간 0.021초

Ensemble Gene Selection Method Based on Multiple Tree Models

  • Mingzhu Lou
    • Journal of Information Processing Systems
    • /
    • 제19권5호
    • /
    • pp.652-662
    • /
    • 2023
  • Identifying highly discriminating genes is a critical step in tumor recognition tasks based on microarray gene expression profile data and machine learning. Gene selection based on tree models has been the subject of several studies. However, these methods are based on a single-tree model, often not robust to ultra-highdimensional microarray datasets, resulting in the loss of useful information and unsatisfactory classification accuracy. Motivated by the limitations of single-tree-based gene selection, in this study, ensemble gene selection methods based on multiple-tree models were studied to improve the classification performance of tumor identification. Specifically, we selected the three most representative tree models: ID3, random forest, and gradient boosting decision tree. Each tree model selects top-n genes from the microarray dataset based on its intrinsic mechanism. Subsequently, three ensemble gene selection methods were investigated, namely multipletree model intersection, multiple-tree module union, and multiple-tree module cross-union, were investigated. Experimental results on five benchmark public microarray gene expression datasets proved that the multiple tree module union is significantly superior to gene selection based on a single tree model and other competitive gene selection methods in classification accuracy.

Gene Content Tree를 이용한 Archaebacteria와 Bacteria 분류 (Classification of Archaebacteria and Bacteria using a Gene Content Tree Approach)

  • 이동근;김수호;이상현;김철민;김상진;이재화
    • KSBB Journal
    • /
    • 제18권1호
    • /
    • pp.39-44
    • /
    • 2003
  • 유전자보유 유무에 따른 계통수와 16S rRNA에 의한 계통수를 염기서열 분석이 완료된 33종의 미생물에 대하여neighbor joining method와 bootstrap method(n=1,000)를 이용하여 상관관계를 분석하였다. 각 분류그룹에서 공통적으로 보존된 COG와 각 미생물이 보유하고 있는 ortholog 수에 대한 비율을 조사한 결과, Mezorhiaobium lot의 4.60% Mycoplasma genitalium의 56.57% 사이에 분포하는 것으로 파악되었다. 이는 미생물 종류에 따라서 공통 유전자의 보유정도가 차이를 보이는 것으로 독특한 유전자를 탐색할 수 있는 가능성을 제시하는 결과로 사료되었다. 그리고 같은 종 내에 서도 20% 이상의 ortholog가 서로 독립적인 것을 알 수 있었다. Archaeabacteria와 Proteobacteria 그리고 Firmicutes모두 유전자보유 계통수와 16S rRNA 계통수가 일치하는 부분과 일치하지 않는 부분으로 나뉘어진다는 것을 알 수 있었다. 이러한 결과는 165 rDNA처림 보존적이지 않은 유전자까지 고려한 결과이거나 horizontal gene transfer에 의한 영향 등으로 사료되었다. COC에 기초한 유전자보유 계통수는 생화학 적 실험과 염기서열에 기초한 분류의 중간자적 입장에서 유용유전자 탐색에 이용될 수 있을 것이다.

기능 도메인 예측을 위한 유전자 서열 클러스터링 (Gene Sequences Clustering for the Prediction of Functional Domain)

  • 한상일;이성근;허보경;변윤섭;황규석
    • 제어로봇시스템학회논문지
    • /
    • 제12권10호
    • /
    • pp.1044-1049
    • /
    • 2006
  • Multiple sequence alignment is a method to compare two or more DNA or protein sequences. Most of multiple sequence alignment tools rely on pairwise alignment and Smith-Waterman algorithm to generate an alignment hierarchy. Therefore, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST and CDD (Conserved Domain Database)search were combined with a clustering tool. Our clustering and annotating tool consists of constructing suffix tree, overlapping common subsequences, clustering gene sequences and annotating gene clusters by BLAST and CDD search. The system was successfully evaluated with 36 gene sequences in the pentose phosphate pathway, clustering 10 clusters, finding out representative common subsequences, and finally identifying functional domains by searching CDD database.

서픽스트리 클러스터링 방법과 블라스트를 통합한 유전자 서열의 클러스터링과 기능검색에 관한 연구 (A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST)

  • 한상일;이성근;김경훈;이주영;김영한;황규석
    • 제어로봇시스템학회논문지
    • /
    • 제11권10호
    • /
    • pp.851-856
    • /
    • 2005
  • The DNA and protein data of diverse species have been daily discovered and deposited in the public archives according to each established format. Database systems in the public archives provide not only an easy-to-use, flexible interface to the public, but also in silico analysis tools of unidentified sequence data. Of such in silico analysis tools, multiple sequence alignment [1] methods relying on pairwise alignment and Smith-Waterman algorithm [2] enable us to identify unknown DNA, protein sequences or phylogenetic relation among several species. However, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST was combined with a clustering tool. Our clustering and annotating tool is summarized as the following steps: (1) construction of suffix tree; (2) masking of cross-matching pairs; (3) clustering of gene sequences and (4) annotating gene clusters by BLAST search. The system was successfully evaluated with 22 gene sequences in the pyrubate pathway of bacteria, clustering 7 clusters and finding out representative common subsequences of each cluster

유전자발현데이터의 군집분석을 위한 나무 의존 성분 분석 (Tree-Dependent Components of Gene Expression Data for Clustering)

  • 김종경;최승진
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2006년도 한국컴퓨터종합학술대회 논문집 Vol.33 No.1 (A)
    • /
    • pp.4-6
    • /
    • 2006
  • Tree-dependent component analysis (TCA) is a generalization of independent component analysis (ICA), the goal of which is to model the multivariate data by a linear transformation of latent variables, while latent variables fit by a tree-structured graphical model. In contrast to ICA, TCA allows dependent structure of latent variables and also consider non-spanning trees (forests). In this paper, we present a TCA-based method of clustering gene expression data. Empirical study with yeast cell cycle-related data, yeast metaboiic shift data, and yeast sporulation data, shows that TCA is more suitable for gene clustering, compared to principal component analysis (PCA) as well as ICA.

  • PDF

Comparative Genome-Scale Expression Analysis of Growth Phase-dependent Genes in Wild Type and rpoS Mutant of Escherichia coli

  • Oh, Tae-Jeong;Jung, Il-Lae;Woo, Sook-Kyung;Kim, Myung-Soon;Lee, Sun-Woo;Kim, Keun-Ha;Kim, In-Gyu;An, Sung-Whan
    • 한국미생물생명공학회:학술대회논문집
    • /
    • 한국미생물생명공학회 2004년도 Annual Meeting BioExibition International Symposium
    • /
    • pp.258-265
    • /
    • 2004
  • Numerous genes of Escherichia coli have been shown to growth phase-dependent expression throughout growth. The global patterns of growth phase-dependent gene expression of E. coli throughout growth using oligonucleotide microarrays containing a nearly complete set of 4,289 annotated open reading frames. To determine the change of gene expression throughout growth, we compared RNAs taken from timecourses with common reference RNA, which is combined with equal amount of RNA pooled from each time point. The hierarchical clustering of the conditions in accordance with timecourse expression revealed that growth phases were clustered into four classes, consistent with known physiological growth status. We analyzed the differences of expression levels at genome level in both exponential and stationary growth phase cultures. Statistical analysis showed that 213 genes are shown to, growth phase-dependent expression. We also analyzed the expression of 256 known operons and 208 regulatory genes. To assess the global impact of RpoS, we identified 193 genes coregulated with rpoS and their expression levels were examined in the isogenic rpoS mutant. The results revealed that 99 of 193 were novel RpoS-dependent stationary phase-induced genes and the majority of those are functionally unknown. Our data provide that global changes and adjustments of gene expression are coordinately regulated by growth transition in E. coli.

  • PDF

돼지 유행성 설사 바이러스 국내분리주의 유전학적 특성 규명 (Genetic Characteristics of Porcine Epidemic Diarrhea Virus Isolated in Korea)

  • 지영철;권혁무;정현규;한정희
    • 대한수의학회지
    • /
    • 제43권2호
    • /
    • pp.219-230
    • /
    • 2003
  • Porcine epidemic diarrhea virus(PED), a member of Coronaviridea, is the etiological agent of enteropathogenic diarrhea in swine. The purpose of this study was to investigate genetic characteristic of PEDV isolated in Korea. Nucleocapsid(N) gene and membrane (M) gene of recent Korean PEDV strains isolated in 2001 were amplified, cloned, sequenced and analyzed. N gene of seven Korean PEDV field isolates bad 94.5% to 99.4% nucleotide and 92.4% to 99.4% amino acid sequence homology each other. Nucleotide and amino acid sequences of Korean field PEDVs were different from published foreign PEDVs, showing 95.1% to 98.0% nucleotide and 93.5% to 97.6% amino acid sequence homology. By phylogenetic tree analysis on based nucleotide sequences, PEDVs were clustered into four groups. By phylogenetic tree analysis based on amino acid sequences. PEDVs were clustered into five groups. M gene of our Korean PEDV field isolates had 99.6% to 100% nucleotide and 98.7% to 100% amino acid sequence homology each other. Nuclotide and amino acid sequences of Korean field PEDVs were different from published foreign PEDVs, showing 98.5% to 98.8% nucleotide and 97.3% to 97.8% amino acid sequence homology. By phylogenetic tree analysis based on nucleotide and amino acid sequences, PEDVs were clustered into two groups which were Korean PEDV isolate group and foreign PEDV isolate group.

Complete Chloroplast Genome assembly and Annotation of Milk Thistle (Silybum marianum) and Phylogenetic Analysis

  • Hwajin Jung;Yedomon Ange Bovys Zoclanclounon;Jeongwoo Lee;Taeho Lee;Jeonggu Kim;Guhwang Park;Keunpyo Lee;Kwanghoon An;Jeehyoung Shim;Joonghyoun Chin;Suyoung Hong
    • 한국작물학회:학술대회논문집
    • /
    • 한국작물학회 2022년도 추계학술대회
    • /
    • pp.210-210
    • /
    • 2022
  • Silybum marianum is an annual or biennial plant from the Asteraceae family. It can grow in low-nutrient soil and drought conditions, making it easy to cultivate. From the seed, a specialized plant metabolite called silymarin (flavonolignan complex) is produced and is known to alleviate the liver from hepatitis and toxins damages. To infer the phylogenetic placement of a Korean milk thistle, we conducted a chloroplast assembly and annotation following by a comparison with existing Chinese reference genome (NC_028027). The chloroplast genome structure was highly similar with an assembly size of 152,642 bp, an 153,202 bp for Korean and Chinese milk thistle respectively. Moreover, there were similarities at the gene level, coding sequence (n = 82), transfer RNA (n = 31) and ribosomal RNA (n = 4). From all coding sequences gene set, the phylogenetic tree inference placed the Korean cultivar into the milk thistle clade; corroborating the expected tree. Moreover, an investigation the tree based only on the ycf1 gene confirmed the same tree; suggesting that ycf1 gene is a potential marker for DNA barcoding and population diversity study in milk thistle genus. Overall, the provided data represents a valuable resource for population genomics and species-centered determination since several species have been reported in the Silybum genus.

  • PDF

Genomic Tree of Gene Contents Based on Functional Groups of KEGG Orthology

  • Kim Jin-Sik;Lee Sang-Yup
    • Journal of Microbiology and Biotechnology
    • /
    • 제16권5호
    • /
    • pp.748-756
    • /
    • 2006
  • We propose a genome-scale clustering approach to identify whole genome relationships using the functional groups given by the Kyoto Encyclopedia of Genes and Genomes Orthology (KO) database. The metabolic capabilities of each organism were defined by the number of genes in each functional category. The archaeal, bacterial, and eukaryotic genomes were compared by simultaneously applying a two-step clustering method, comprised of a self-organizing tree algorithm followed by unsupervised hierarchical clustering. The clustering results were consistent with various phenotypic characteristics of the organisms analyzed and, additionally, showed a different aspect of the relationship between genomes that have previously been established through rRNA-based comparisons. The proposed approach to collect and cluster the metabolic functional capabilities of organisms should make it a useful tool in predicting relationships among organisms.