• Title/Summary/Keyword: Protein Sequence Prediction

Search Result 85, Processing Time 0.032 seconds

A New Approach to Find Orthologous Proteins Using Sequence and Protein-Protein Interaction Similarity

  • Kim, Min-Kyung;Seol, Young-Joo;Park, Hyun-Seok;Jang, Seung-Hwan;Shin, Hang-Cheol;Cho, Kwang-Hwi
    • Genomics & Informatics
    • /
    • v.7 no.3
    • /
    • pp.141-147
    • /
    • 2009
  • Developed proteome-scale ortholog and paralog prediction methods are mainly based on sequence similarity. However, it is known that even the closest BLAST hit often does not mean the closest neighbor. For this reason, we added conserved interaction information to find orthologs. We propose a genome-scale, automated ortholog prediction method, named OrthoInterBlast. The method is based on both sequence and interaction similarity. When we applied this method to fly and yeast, 17% of the ortholog candidates were different compared with the results of Inparanoid. By adding protein-protein interaction information, proteins that have low sequence similarity still can be selected as orthologs, which can not be easily detected by sequence homology alone.

Comparison of External Information Performance Predicting Subcellular Localization of Proteins (단백질의 세포내 위치를 예측하기 위한 외부정보의 성능 비교)

  • Chi, Sang-Mun
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.11
    • /
    • pp.803-811
    • /
    • 2010
  • Since protein subcellular location and biological function are highly correlated, the prediction of protein subcellular localization can provide information about the function of a protein. In order to enhance the prediction performance, external information other than amino acids sequence information is actively exploited in many researches. This paper compares the prediction capabilities resided in amino acid sequence similarity, protein profile, gene ontology, motif, and textual information. In the experiments using PLOC dataset which has proteins less than 80% sequence similarity, sequence similarity information and gene ontology are effective information, achieving a classification accuracy of 94.8%. In the experiments using BaCelLo IDS dataset with low sequence similarity less than 30%, using gene ontology gives the best prediction accuracies, 93.2% for animals and 86.6% for fungi.

The Grammatical Structure of Protein Sequences

  • Bystroff, Chris
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2000.11a
    • /
    • pp.28-31
    • /
    • 2000
  • We describe a hidden Markov model, HMMTIR, for general protein sequence based on the I-sites library of sequence-structure motifs. Unlike the linear HMMs used to model individual protein families, HMMSTR has a highly branched topology and captures recurrent local features of protein sequences and structures that transcend protein family boundaries. The model extends the I-sites library by describing the adjacencies of different sequence-structure motifs as observed in the database, and achieves a great reduction in parameters by representing overlapping motifs in a much more compact form. The HMM attributes a considerably higher probability to coding sequence than does an equivalent dipeptide model, predicts secondary structure with an accuracy of 74.6% and backbone torsion angles better than any previously reported method, and predicts the structural context of beta strands and turns with an accuracy that should be useful for tertiary structure prediction. HMMSTR has been incorporated into a public, fully-automated protein structure prediction server.

  • PDF

Sequence driven features for prediction of subcellular localization of proteins

  • Kim, Jong-Kyoung;Bang, Sung-Yang;Choi, Seung-Jin
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2005.09a
    • /
    • pp.237-242
    • /
    • 2005
  • Predicting the cellular location of an unknown protein gives a valuable information for inferring the possible function of the protein. For more accurate prediction system, we need a good feature extraction method that transforms the raw sequence data into the numerical feature vector, minimizing information loss. In this paper, we propose new methods of extracting underlying features only from the sequence data by computing pairwise sequence alignment scores. In addition, we use composition based features to improve prediction accuracy. To construct an SVM ensemble from separately trained SVM classifiers, we propose specificity based weighted majority voting. The overall prediction accuracy evaluated by the 5-fold cross-validation reached 88.53% for the eukaryotic animal data set. By comparing the prediction accuracy of various feature extraction methods, we could get the biological insight on the location of targeting information. Our numerical experiments confirm that our new feature extraction methods are very useful for predicting subcellular localization of proteins.

  • PDF

A Protein Sequence Prediction Method by Mining Sequence Data (서열 데이타마이닝을 통한 단백질 서열 예측기법)

  • Cho, Sun-I;Lee, Do-Heon;Cho, Kwang-Hwi;Won, Yong-Gwan;Kim, Byoung-Ki
    • The KIPS Transactions:PartD
    • /
    • v.10D no.2
    • /
    • pp.261-266
    • /
    • 2003
  • A protein, which is a linear polymer of amino acids, is one of the most important bio-molecules composing biological structures and regulating bio-chemical reactions. Since the characteristics and functions of proteins are determined by their amino acid sequences in principle, protein sequence determination is the starting point of protein function study. This paper proposes a protein sequence prediction method based on data mining techniques, which can overcome the limitation of previous bio-chemical sequencing methods. After applying multiple proteases to acquire overlapped protein fragments, we can identify candidate fragment sequences by comparing fragment mass values with peptide databases. We propose a method to construct multi-partite graph and search maximal paths to determine the protein sequence by assembling proper candidate sequences. In addition, experimental results based on the SWISS-PROT database showing the validity of the proposed method is presented.

NOGSEC: A NOnparametric method for Genome SEquence Clustering (녹섹(NOGSEC): A NOnparametric method for Genome SEquence Clustering)

  • 이영복;김판규;조환규
    • Korean Journal of Microbiology
    • /
    • v.39 no.2
    • /
    • pp.67-75
    • /
    • 2003
  • One large topic in comparative genomics is to predict functional annotation by classifying protein sequences. Computational approaches for function prediction include protein structure prediction, sequence alignment and domain prediction or binding site prediction. This paper is on another computational approach searching for sets of homologous sequences from sequence similarity graph. Methods based on similarity graph do not need previous knowledges about sequences, but largely depend on the researcher's subjective threshold settings. In this paper, we propose a genome sequence clustering method of iterative testing and graph decomposition, and a simple method to calculate a strict threshold having biochemical meaning. Proposed method was applied to known bacterial genome sequences and the result was shown with the BAG algorithm's. Result clusters are lacking some completeness, but the confidence level is very high and the method does not need user-defined thresholds.

In Silico Functional Assessment of Sequence Variations: Predicting Phenotypic Functions of Novel Variations

  • Won, Hong-Hee;Kim, Jong-Won
    • Genomics & Informatics
    • /
    • v.6 no.4
    • /
    • pp.166-172
    • /
    • 2008
  • A multitude of protein-coding sequence variations (CVs) in the human genome have been revealed as a result of major initiatives, including the Human Variome Project, the 1000 Genomes Project, and the International Cancer Genome Consortium. This naturally has led to debate over how to accurately assess the functional consequences of CVs, because predicting the functional effects of CVs and their relevance to disease phenotypes is becoming increasingly important. This article surveys and compares variation databases and in silico prediction programs that assess the effects of CVs on protein function. We also introduce a combinatorial approach that uses machine learning algorithms to improve prediction performance.

Genome Scale Protein Secondary Structure Prediction Using a Data Distribution on a Grid Computing

  • Cho, Min-Kyu;Lee, Soojin;Jung, Jin-Won;Kim, Jai-Hoon;Lee, Weontae
    • Proceedings of the Korean Biophysical Society Conference
    • /
    • 2003.06a
    • /
    • pp.65-65
    • /
    • 2003
  • After many genome projects, algorithms and software to process explosively growing biological information have been developed. To process huge amount of biological information, high performance computing equipments are essential. If we use the remote resources such as computing power, storages etc., through a Grid to share the resources in the Internet environment, we will be able to obtain great efficiency to process data at a low cost. Here we present the performance improvement of the protein secondary structure prediction (PSIPred) by using the Grid platform, distributing protein sequence data on the Grid where each computer node analyzes its own part of protein sequence data to speed up the structure prediction. On the Grid, genome scale secondary structure prediction for Mycoplasma genitalium, Escherichia coli, Helicobacter pylori, Saccharomyces cerevisiae and Caenorhabditis slogans were performed and analyzed by a statistical way to show the protein structural deviation and comparison between the genomes. Experimental results show that the Grid is a viable platform to speed up the protein structure prediction and from the predicted structures.

  • PDF

AllEC: An Implementation of Application for EC Numbers Prediction based on AEC Algorithm

  • Park, Juyeon;Park, Mingyu;Han, Sora;Kim, Jeongdong;Oh, Taejin;Lee, Hyun
    • International Journal of Advanced Culture Technology
    • /
    • v.10 no.2
    • /
    • pp.201-212
    • /
    • 2022
  • With the development of sequencing technology, there is a need for technology to predict the function of the protein sequence. Enzyme Commission (EC) numbers are becoming markers that distinguish the function of the sequence. In particular, many researchers are researching various methods of predicting the EC numbers of protein sequences based on deep learning. However, as studies using various methods exist, a problem arises, in which the exact prediction result of the sequence is unknown. To solve this problem, this paper proposes an All Enzyme Commission (AEC) algorithm. The proposed AEC is an algorithm that executes various prediction methods and integrates the results when predicting sequences. This algorithm uses duplicates to give more weights when duplicate values are obtained from multiple methods. The largest value, among the final prediction result values for each method to which the weight is applied, is the final prediction result. Moreover, for the convenience of researchers, the proposed algorithm is provided through the AllEC web services. They can use the algorithms regardless of the operating systems, installation, or operating environment.

Sequence driven features for prediction of subcellular localization of proteins (단백질의 세포내 소 기관별 분포 예측을 위한 서열 기반의 특징 추출 방법)

  • Kim, Jong-Kyoung;Choi, Seung-Jin
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07b
    • /
    • pp.226-228
    • /
    • 2005
  • Predicting the cellular location of an unknown protein gives valuable information for inferring the possible function of the protein. For more accurate Prediction system, we need a good feature extraction method that transforms the raw sequence data into the numerical feature vector, minimizing information loss. In this paper we propose new methods of extracting underlying features only from the sequence data by computing pairwise sequence alignment scores. In addition, we use composition based features to improve prediction accuracy. To construct an SVM ensemble from separately trained SVM classifiers, we propose specificity based weighted majority voting . The overall prediction accuracy evaluated by the 5-fold cross-validation reached $88.53\%$ for the eukaryotic animal data set. By comparing the prediction accuracy of various feature extraction methods, we could get the biological insight on the location of targeting information. Our numerical experiments confirm that our new feature extraction methods are very useful forpredicting subcellular localization of proteins.

  • PDF