• Title/Summary/Keyword: protein sequences

Search Result 1,064, Processing Time 0.031 seconds

A management Technique for Protein Version Information based on Local Sequence Alignment and Trigger (로컬 서열 정렬과 트리거 기반의 단백질 버전 정보 관리 기법)

  • Jung Kwang-Su;Park Sung-Hee;Ryu Keun-Ho
    • The KIPS Transactions:PartD
    • /
    • v.12D no.1 s.97
    • /
    • pp.51-62
    • /
    • 2005
  • After figuring out the function of an amino acid sequence, we can infer the function of the other amino acids that have similar sequence composition. Besides, it is possible that we alter protein whose function we know, into useful protein using genetic engineering method. In this process. an original protein amino sequence produces various protein sequences that have different sequence composition. Here, a systematic technique is needed to manage protein version sequences and reference data of those sequences. Thus, in this paper we proposed a technique of managing protein version sequences based on local sequence alignment and a technique of managing protein historical reference data using Trigger This method automatically determines the similarity between an original sequence and each version sequence while the protein version sequences are stored into database. When this technique is employed, the storage space that stores protein sequences is also reduced. After storing the historical information of protein and analyzing the change of protein sequence, we expect that a new useful protein and drug are able to be discovered based on analysis of version sequence.

Protein Sequence Search based on N-gram Indexing

  • Hwang, Mi-Nyeong;Kim, Jin-Suk
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.46-50
    • /
    • 2006
  • According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

  • PDF

Prediction of Protein Secondary Structure Using the Weighted Combination of Homology Information of Protein Sequences (단백질 서열의 상동 관계를 가중 조합한 단백질 이차 구조 예측)

  • Chi, Sang-mun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.9
    • /
    • pp.1816-1821
    • /
    • 2016
  • Protein secondary structure is important for the study of protein evolution, structure and function of proteins which play crucial roles in most of biological processes. This paper try to effectively extract protein secondary structure information from the large protein structure database in order to predict the protein secondary structure of a query protein sequence. To find more remote homologous sequences of a query sequence in the protein database, we used PSI-BLAST which can perform gapped iterative searches and use profiles consisting of homologous protein sequences of a query protein. The secondary structures of the homologous sequences are weighed combined to the secondary structure prediction according to their relative degree of similarity to the query sequence. When homologous sequences with a neural network predictor were used, the accuracies were higher than those of current state-of-art techniques, achieving a Q3 accuracy of 92.28% and a Q8 accuracy of 88.79%.

Retrieving Protein Domain Encoding DNA Sequences Automatically Through Database Cross-referencing

  • Choi, Yoon-Sup;Yang, Jae-Seong;Ryu, Sung-Ho;Kim, Sang-Uk
    • Bioinformatics and Biosystems
    • /
    • v.1 no.2
    • /
    • pp.95-98
    • /
    • 2006
  • Recent proteomic studies of protein domains require high-throughput and systematic approaches. Since most experiments using protein domains, the modules of protein-protein interactions, require gene cloning, the first experimental step should be retrieving DNA sequences of domain encoding regions from databases. For a large scale proteomic research, however, it is a laborious task to extract a large number of domain sequences manually from several inter-linked databases. We present a new methodology to retrieve DNA sequences of domain encoding regions through automatic database cross-referencing. To extract protein domain encoding regions, it traverses several inter-connected database with validation process. And we applied this method to retrieve all the EGF domain encoding DNA sequences of homo sapiens. This new algorithm was implemented using Python library PAMIE, which enables to cross-reference across distinct databases automatically.

  • PDF

Identification of Viral Taxon-Specific Genes (VTSG): Application to Caliciviridae

  • Kang, Shinduck;Kim, Young-Chang
    • Genomics & Informatics
    • /
    • v.16 no.4
    • /
    • pp.23.1-23.5
    • /
    • 2018
  • Virus taxonomy was initially determined by clinical experiments based on phenotype. However, with the development of sequence analysis methods, genotype-based classification was also applied. With the development of genome sequence analysis technology, there is an increasing demand for virus taxonomy to be extended from in vivo and in vitro to in silico. In this study, we verified the consistency of the current International Committee on Taxonomy of Viruses taxonomy using an in silico approach, aiming to identify the specific sequence for each virus. We applied this approach to norovirus in Caliciviridae, which causes 90% of gastroenteritis cases worldwide. First, based on the dogma "protein structure determines its function," we hypothesized that the specific sequence can be identified by the specific structure. Firstly, we extracted the coding region (CDS). Secondly, the CDS protein sequences of each genus were annotated by the conserved domain database (CDD) search. Finally, the conserved domains of each genus in Caliciviridae are classified by RPS-BLAST with CDD. The analysis result is that Caliciviridae has sequences including RNA helicase in common. In case of Norovirus, Calicivirus coat protein C terminal and viral polyprotein N-terminal appears as a specific domain in Caliciviridae. It does not include in the other genera in Caliciviridae. If this method is utilized to detect specific conserved domains, it can be used as classification keywords based on protein functional structure. After determining the specific protein domains, the specific protein domain sequences would be converted to gene sequences. This sequences would be re-used one of viral bio-marks.

Nucleotide and protein researches on anaerobic fungi during four decades

  • Chang, Jongsoo;Park, Hyunjin
    • Journal of Animal Science and Technology
    • /
    • v.62 no.2
    • /
    • pp.121-140
    • /
    • 2020
  • Anaerobic fungi habitat in the gastrointestinal tract of foregut fermenters or hindgut fermenters and degrade fibrous plant biomass through the hydrolysis reactions with a wide variety of cellulolytic enzymes and physical penetration through fiber matrix with their rhizoids. To date, seventeen genera have been described in family Neocallimasticaceae, class Neocallimastigomycetes, phylum Neocallimastigomycota and one genus has been described in phylum Neocallimastigomycota. In National Center for Biotechnology Information (NCBI) database (DB), 23,830 nucleotide sequences and 59,512 protein sequences have been deposited and most of them were originated from Piromyces, Neocallimastix and Anaeromyces. Most of protein sequences (44,025) were acquired with PacBio next generation sequencing system. The whole genome sequences of Anaeromyces robustus, Neocallimastix californiae, Pecoramyces ruminantium, Piromyces finnis and Piromyces sp. E2 are available in Joint Genome Institute (JGI) database. According to the results of protein prediction, average Isoelectric points (pIs) were ranged from 5.88 (Anaeromyces) to 6.57 (Piromyces) and average molecular weights were ranged from 38.7 kDa (Orpinomyces) to 56.6 kDa (Piromyces). In Carbohydrate-Active enZYmes (CAZY) database, glycoside hydrolases (36), carbohydrate binding module (11), carbohydrate esterases (8), glycosyltransferase (5) and polysaccharide lyases (3) from anaerobic fungi were registered. During four decades, 1,031 research articles about anaerobic fungi were published and 444 and 719 articles were available in PubMed (PM) and PubMed Central (PMC) DB.

Identification of Salmonella pullorum Genomic Sequences Using Suppression Subtractive Hybridization

  • Li, Qiuchun;Xu, Yaohui;Jiao, Xinan
    • Journal of Microbiology and Biotechnology
    • /
    • v.19 no.9
    • /
    • pp.898-903
    • /
    • 2009
  • Pullorum disease affecting poultry is caused by Salmonella enterica serovar Pullorum and results in severe economic loss every year, especially in countries with a developing poultry industry. The pathogenesis of S. Pullorum is not yet well defined, as the specific virulence factors still need to be identified. Thus, to isolate specific DNA fragments belonging to S. Pullorum, this study used suppression subtractive hybridization. As such, the genome of the S. Pullorum C79-13 strain was subtracted from the genome of Salmonella enterica serovar Gallinarum 9 and Salmonella enterica serovar Enteritidis CMCC(B) 50041, respectively, resulting in the identification of 20 subtracted fragments. A sequence homology analysis then revealed three types of fragment: phage sequences, plasmid sequences, and sequences with an unknown function. As a result, several important virulence-related genes encoding the IpaJ protein, colicin Y, tailspike protein, excisionase, and Rhs protein were identified that may play a role in the pathogenesis of S. Pullorum.

Bioinformatics Analysis of Hsp20 Sequences in Proteobacteria

  • Heine, Michelle;Chandra, Sathees B.C.
    • Genomics & Informatics
    • /
    • v.7 no.1
    • /
    • pp.26-31
    • /
    • 2009
  • Heat shock proteins are a class of molecular chaperones that can be found in nearly all organisms from Bacteria, Archaea and Eukarya domains. Heat shock proteins experience increased transcription during periods of heat induced osmotic stress and are involved in protein disaggregation and refolding as part of a cell's danger signaling cascade. Heat shock protein, Hsp20 is a small molecular chaperone that is approximately 20kDa in weight and is hypothesized to prevent aggregation and denaturation. Hsp20 can be found in several strains of Proteobacteria, which comprises the largest phyla of the Bacteria domain and also contains several medically significant bacterial strains. Genomic analyses were performed to determine a common evolutionary pattern among Hsp20 sequences in Proteobacteria. It was found that Hsp20 shared a common ancestor within and among the five subclasses of Proteobacteria. This is readily apparent from the amount of sequence similarities within and between Hsp20 protein sequences as well as phylogenetic analysis of sequences from proteobacterial and non-proteobacterial species.

2-D graphical representation of protein sequences and its application to coronavirus phylogeny

  • Li, Chun;Xing, Lili;Wang, Xin
    • BMB Reports
    • /
    • v.41 no.3
    • /
    • pp.217-222
    • /
    • 2008
  • Based on a five-letter model of the 20 amino acids, we propose a new 2-D graphical representation of protein sequence. Then we transform the 2-D graphical representation into a numerical characterization that will facilitate quantitative comparisons of protein sequences. As an application, we construct the phylogenetic tree of 56 coronavirus spike proteins. The resulting tree agrees well with the established taxonomic groups.

Proteomics Data Analysis using Representative Database

  • Kwon, Kyung-Hoon;Park, Gun-Wook;Kim, Jin-Young;Park, Young-Mok;Yoo, Jong-Shin
    • Bioinformatics and Biosystems
    • /
    • v.2 no.2
    • /
    • pp.46-51
    • /
    • 2007
  • In the proteomics research using mass spectrometry, the protein database search gives the protein information from the peptide sequences that show the best match with the tandem mass spectra. The protein sequence database has been a powerful knowledgebase for this protein identification. However, as we accumulate the protein sequence information in the database, the database size gets to be huge. Now it becomes hard to consider all the protein sequences in the database search because it consumes much computing time. For the high-throughput analysis of the proteome, usually we have used the non-redundant refined database such as IPI human database of European Bioinformatics Institute. While the non-redundant database can supply the search result in high speed, it misses the variation of the protein sequences. In this study, we have concerned the proteomics data in the point of protein similarities and used the network analysis tool to build a new analysis method. This method will be able to save the computing time for the database search and keep the sequence variation to catch the modified peptides.

  • PDF