• Title/Summary/Keyword: genome annotation

Search Result 179, Processing Time 0.03 seconds

Standard-based Integration of Heterogeneous Large-scale DNA Microarray Data for Improving Reusability

  • Jung, Yong;Seo, Hwa-Jeong;Park, Yu-Rang;Kim, Ji-Hun;Bien, Sang Jay;Kim, Ju-Han
    • Genomics & Informatics
    • /
    • v.9 no.1
    • /
    • pp.19-27
    • /
    • 2011
  • Gene Expression Omnibus (GEO) has kept the largest amount of gene-expression microarray data that have grown exponentially. Microarray data in GEO have been generated in many different formats and often lack standardized annotation and documentation. It is hard to know if preprocessing has been applied to a dataset or not and in what way. Standard-based integration of heterogeneous data formats and metadata is necessary for comprehensive data query, analysis and mining. We attempted to integrate the heterogeneous microarray data in GEO based on Minimum Information About a Microarray Experiment (MIAME) standard. We unified the data fields of GEO Data table and mapped the attributes of GEO metadata into MIAME elements. We also discriminated non-preprocessed raw datasets from others and processed ones by using a two-step classification method. Most of the procedures were developed as semi-automated algorithms with some degree of text mining techniques. We localized 2,967 Platforms, 4,867 Series and 103,590 Samples with covering 279 organisms, integrated them into a standard-based relational schema and developed a comprehensive query interface to extract. Our tool, GEOQuest is available at http://www.snubi.org/software/GEOQuest/.

Constructing Proteome Reference Map of the Porcine Jejunal Cell Line (IPEC-J2) by Label-Free Mass Spectrometry

  • Kim, Sang Hoon;Pajarillo, Edward Alain B.;Balolong, Marilen P.;Lee, Ji Yoon;Kang, Dae-Kyung
    • Journal of Microbiology and Biotechnology
    • /
    • v.26 no.6
    • /
    • pp.1124-1131
    • /
    • 2016
  • In this study, the global proteome of the IPEC-J2 cell line was evaluated using ultra-high performance liquid chromatography coupled to a quadrupole Q Exactive Orbitrap mass spectrometer. Proteins were isolated from highly confluent IPEC-J2 cells in biological replicates and analyzed by label-free mass spectrometry prior to matching against a porcine genomic dataset. The results identified 1,517 proteins, accounting for 7.35% of all genes in the porcine genome. The highly abundant proteins detected, such as actin, annexin A2, and AHNAK nucleoprotein, are involved in structural integrity, signaling mechanisms, and cellular homeostasis. The high abundance of heat shock proteins indicated their significance in cellular defenses, barrier function, and gut homeostasis. Pathway analysis and annotation using the Kyoto Encyclopedia of Genes and Genomes database resulted in a putative protein network map of the regulation of immunological responses and structural integrity in the cell line. The comprehensive proteome analysis of IPEC-J2 cells provides fundamental insights into overall protein expression and pathway dynamics that might be useful in cell adhesion studies and immunological applications.

Computational Detection of Prokaryotic Core Promoters in Genomic Sequences

  • Kim Ki-Bong;Sim Jeong Seop
    • Journal of Microbiology
    • /
    • v.43 no.5
    • /
    • pp.411-416
    • /
    • 2005
  • The high-throughput sequencing of microbial genomes has resulted in the relatively rapid accumulation of an enormous amount of genomic sequence data. In this context, the problem posed by the detection of promoters in genomic DNA sequences via computational methods has attracted considerable research attention in recent years. This paper addresses the development of a predictive model, known as the dependence decomposition weight matrix model (DDWMM), which was designed to detect the core promoter region, including the -10 region and the transcription start sites (TSSs), in prokaryotic genomic DNA sequences. This is an issue of some importance with regard to genome annotation efforts. Our predictive model captures the most significant dependencies between positions (allowing for non­adjacent as well as adjacent dependencies) via the maximal dependence decomposition (MDD) procedure, which iteratively decomposes data sets into subsets, based on the significant dependence between positions in the promoter region to be modeled. Such dependencies may be intimately related to biological and structural concerns, since promoter elements are present in a variety of combinations, which are separated by various distances. In this respect, the DDWMM may prove to be appropriate with regard to the detection of core promoter regions and TSSs in long microbial genomic contigs. In order to demonstrate the effectiveness of our predictive model, we applied 10-fold cross-validation experiments on the 607 experimentally-verified promoter sequences, which evidenced good performance in terms of sensitivity.

OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Liu, Yusha;Yao, Xinzhi;Xia, Jingbo
    • Genomics & Informatics
    • /
    • v.19 no.3
    • /
    • pp.27.1-27.4
    • /
    • 2021
  • Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pretrained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.

Improving classification of low-resource COVID-19 literature by using Named Entity Recognition

  • Lithgow-Serrano, Oscar;Cornelius, Joseph;Kanjirangat, Vani;Mendez-Cruz, Carlos-Francisco;Rinaldi, Fabio
    • Genomics & Informatics
    • /
    • v.19 no.3
    • /
    • pp.22.1-22.5
    • /
    • 2021
  • Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

Tag-SNP selection and online database construction for haplotype-based marker development in tomato (유전자 단위 haplotype을 대변하는 토마토 Tag-SNP 선발 및 웹 데이터베이스 구축)

  • Jeong, Hye-ri;Lee, Bo-Mi;Lee, Bong-Woo;Oh, Jae-Eun;Lee, Jeong-Hee;Kim, Ji-Eun;Jo, Sung-Hwan
    • Journal of Plant Biotechnology
    • /
    • v.47 no.3
    • /
    • pp.218-226
    • /
    • 2020
  • This report describes methods for selecting informative single nucleotide polymorphisms (SNPs), and the development of an online Solanaceae genome database, using 234 tomato resequencing data entries deposited in the NCBI SRA database. The 126 accessions of Solanum lycopersicum, 68 accessions of Solanum lycopersicum var. cerasiforme, and 33 accessions of Solanum pimpinellifolium, which are frequently used for breeding, and some wild-species tomato accessions were included in the analysis. To select tag-SNPs, we identified 29,504,960 SNPs in 234 tomatoes and then separated the SNPs in the genic and intergenic regions according to gene annotation. All tag-SNP were selected from non-synonymous SNPs among the SNPs present in the gene region and, as a result, we obtained tag-SNP from 13,845 genes. When there were no non-synonymous SNPs in the gene, the genes were selected from synonymous SNPs. The total number of tag-SNPs selected was 27,539. To increase the usefulness of the information, a Solanaceae genome database website, TGsol (http://tgsol. seeders.co.kr/), was constructed to allow users to search for detailed information on resources, SNPs, haplotype, and tag-SNPs. The user can search the tag-SNP and flanking sequences for each gene by searching for a gene name or gene position through the genome browser. This website can be used to efficiently search for genes related to traits or to develop molecular markers.

TEST DB: The intelligent data management system for Toxicogenomics (독성유전체학 연구를 위한 지능적 데이터 관리 시스템)

  • Lee, Wan-Seon;Jeon, Ki-Seon;Um, Chan-Hwi;Hwang, Seung-Young;Jung, Jin-Wook;Kim, Seung-Jun;Kang, Kyung-Sun;Park, Joon-Suk;Hwang, Jae-Woong;Kang, Jong-Soo;Lee, Gyoung-Jae;Chon, Kum-Jin;Kim, Yang-Suk
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2003.10a
    • /
    • pp.66-72
    • /
    • 2003
  • Toxicogenomics is now emerging as one of the most important genomics application because the toxicity test based on gene expression profiles is expected more precise and efficient than current histopathological approach in pre-clinical phase. One of the challenging points in Toxicogenomics is the construction of intelligent database management system which can deal with very heterogeneous and complex data from many different experimental and information sources. Here we present a new Toxicogenomics database developed as a part of 'Toxicogenomics for Efficient Safety Test (TEST) project'. The TEST database is especially focused on the connectivity of heterogeneous data and intelligent query system which enables users to get inspiration from the complex data sets. The database deals with four kinds of information; compound information, histopathological information, gene expression information, and annotation information. Currently, TEST database has Toxicogenomics information fer 12 molecules with 4 efficacy classes; anti cancer, antibiotic, hypotension, and gastric ulcer. Users can easily access all kinds of detailed information about there compounds and simultaneously, users can also check the confidence of retrieved information by browsing the quality of experimental data and toxicity grade of gene generated from our toxicology annotation system. Intelligent query system is designed for multiple comparisons of experimental data because the comparison of experimental data according to histopathological toxicity, compounds, efficacy, and individual variation is crucial to find common genetic characteristics .Our presented system can be a good information source for the study of toxicology mechanism in the genome-wide level and also can be utilized fur the design of toxicity test chip.

  • PDF

A Eukaryotic Gene Structure Prediction Program Using Duration HMM (Duration HMM을 이용한 진핵생물 유전자 예측 프로그램 개발)

  • Tae, Hong-Seok;Park, Gi-Jeong
    • Korean Journal of Microbiology
    • /
    • v.39 no.4
    • /
    • pp.207-215
    • /
    • 2003
  • Gene structure prediction, which is to predict protein coding regions in a given nucleotide sequence, is the most important process in annotating genes and greatly affects gene analysis and genome annotation. As eukaryotic genes have more complicated stuructures in DNA sequences than those of prokaryotic genes, analysis programs for eukaryotic gene structure prediction have more diverse and more complicated computational models. We have developed EGSP, a eukaryotic gene structure program, using duration hidden markov model. The program consists of two major processes, one of which is a training process to produce parameter values from training data sets and the other of which is to predict protein coding regions based on the parameter values. The program predicts multiple genes rather than a single gene from a DNA sequence. A few computational models were implemented to detect signal pattern and their scanning efficiency was tested. Prediction performance was calculated and was compared with those of a few commonly used programs, GenScan, GeneID and Morgan based on a few criteria. The results show that the program can be practically used as a stand-alone program and a module in a system. For gene prediction of eukaryotic microbial genomes, training and prediction analysis was done with Saccharomyces chromosomes and the result shows the program is currently practically applicable to real eukaryotic microbial genomes.

Locating QTLs controlling overwintering seedling rate in perennial glutinous rice 89-1 (Oryza sativa L.)

  • Deng, Xiaoshu;Gan, Lu;Liu, Yan;Luo, Ancai;Jin, Liang;Chen, Jiao;Tang, Ruyu;Lei, Lixia;Tang, Jianghong;Zhang, Jiani;Zhao, Zhengwu
    • Genes and Genomics
    • /
    • v.40 no.12
    • /
    • pp.1351-1361
    • /
    • 2018
  • A new cold tolerant germplasm resource named glutinous rice 89-1 (Gr89-1, Oryza sativa L.) can overwinter using axillary buds, with these buds being ratooned the following year. The overwintering seedling rate (OSR) is an important factor for evaluating cold tolerance. Many quantitative trait loci (QTLs) controlling cold tolerance at different growth stages in rice have been identified, with some of these QTLs being successfully cloned. However, no QTLs conferring to the OSR trait have been located in the perennial O. sativa L. To identify QTLs associated with OSR and to evaluate cold tolerance. 286 $F_{12}$ recombinant inbred lines (RILs) derived from a cross between the cold tolerant variety Gr89-1 and cold sensitive variety Shuhui527 (SH527) were used. A total of 198 polymorphic simple sequence repeat (SSR) markers that were distributed uniformly on 12 chromosomes were used to construct the linkage map. The gene ontology (GO) annotation of the major QTL was performed through the rice genome annotation project system. Three main-effect QTLs (qOSR2, qOSR3, and qOSR8) were detected and mapped on chromosomes 2, 3, and 8, respectively. These QTLs were located in the interval of RM14208 (35,160,202 base pairs (bp))-RM208 (35,520,147 bp), RM218 (8,375,236 bp)-RM232 (9,755,778 bp), and RM5891 (24,626,930 bp)-RM23608 (25,355,519 bp), and explained 19.6%, 9.3%, and 11.8% of the phenotypic variations, respectively. The qOSR2 QTL displayed the largest effect, with a logarithm of odds score (LOD) of 5.5. A total of 47 candidate genes on the qOSR2 locus were associated with 219 GO terms. Among these candidate genes, 11 were related to cell membrane, 7 were associated with cold stress, and 3 were involved in response to stress and biotic stimulus. OsPIP1;3 was the only one candidate gene related to stress, biotic stimulus, cold stress, and encoding a cell membrane protein. After QTL mapping, a total of three main-effect QTLs-qOSR2, qOSR3, and qOSR8-were detected on chromosomes 2, 3, and 8, respectively. Among these, qOSR2 explained the highest phenotypic variance. All the QTLs elite traits come from the cold resistance parent Gr89-1. OsPIP1;3 might be a candidate gene of qOSR2.

Current status and prospects of molecular marker development for systematic breeding program in citrus (감귤 분자육종을 위한 분자표지 개발 현황 및 전망)

  • Kim, Ho Bang;Kim, Jae Joon;Oh, Chang Jae;Yun, Su-Hyun;Song, Kwan Jeong
    • Journal of Plant Biotechnology
    • /
    • v.43 no.3
    • /
    • pp.261-271
    • /
    • 2016
  • Citrus is an economically important fruit crop widely growing worldwide. However, citrus production largely depends on natural hybrid selection and bud sport mutation. Unique botanical features including long juvenility, polyembryony, and QTL that controls major agronomic traits can hinder the development of superior variety by conventional breeding. Diverse factors including drastic changes of citrus production environment due to global warming and changes in market trends require systematic molecular breeding program for early selection of elite candidates with target traits, sustainable production of high quality fruits, cultivar diversification, and cost-effective breeding. Since the construction of the first genetic linkage map using isozymes, citrus scientists have constructed linkage maps using various DNA-based markers and developed molecular markers related to biotic and abiotic stresses, polyembryony, fruit coloration, seedlessness, male sterility, acidless, morphology, fruit quality, seed number, yield, early fruit setting traits, and QTL mapping on genetic maps. Genes closely related to CTV resistance and flesh color have been cloned. SSR markers for identifying zygotic and nucellar individuals will contribute to cost-effective breeding. The two high quality citrus reference genomes recently released are being efficiently used for genomics-based molecular breeding such as construction of reference linkage/physical maps and comparative genome mapping. In the near future, the development of DNA molecular markers tightly linked to various agronomic traits and the cloning of useful and/or variant genes will be accelerated through comparative genome analysis using citrus core collection and genome-wide approaches such as genotyping-by-sequencing and genome wide association study.