• Title/Summary/Keyword: sequence alignment

Search Result 350, Processing Time 0.028 seconds

Linear-Time Korean Morphological Analysis Using an Action-based Local Monotonic Attention Mechanism

  • Hwang, Hyunsun;Lee, Changki
    • ETRI Journal
    • /
    • v.42 no.1
    • /
    • pp.101-107
    • /
    • 2020
  • For Korean language processing, morphological analysis is a critical component that requires extensive work. This morphological analysis can be conducted in an end-to-end manner without requiring a complicated feature design using a sequence-to-sequence model. However, the sequence-to-sequence model has a time complexity of O(n2) for an input length n when using the attention mechanism technique for high performance. In this study, we propose a linear-time Korean morphological analysis model using a local monotonic attention mechanism relying on monotonic alignment, which is a characteristic of Korean morphological analysis. The proposed model indicates an extreme improvement in a single threaded environment and a high morphometric F1-measure even for a hard attention model with the elimination of the attention mechanism formula.

Molecular Cloning of a cDNA Encoding a Ferritin Subunit from the Spider, Araneus ventricosus

  • Jin, Byung-Rea;Han, Ji-Hee;Kim, Seong-Ryul;Sohn, Hung-Dae
    • International Journal of Industrial Entomology and Biomaterials
    • /
    • v.4 no.2
    • /
    • pp.163-168
    • /
    • 2002
  • We report for the first time the cDNA sequence encoding a ferritin subunit from the spiders Araneus ventricosus. The complete cDNA sequence of A. ventricosus ferritin subunit comprised 516 bp with 172 amino acid residues. The A. ventricosus ferritin subunit cDNA contained a conserved iron responsive element sequence in the 5 untranslated region. An alignment of the deduced protein sequence of the A. ventricosus ferritin subunit gene to that of other heavy chain ferritin molecules showed that A. ventricosus ferritin subunit is most similar to the great pond snail, Lymnaea stagnalis, ferritin with 70.2% of protein sequence identity.

An Efficient Local Alignment Algorithm for DNA Sequences including N and X (N과 X를 포함하는 DNA 서열을 위한 효율적인 지역정렬 알고리즘)

  • Kim, Jin-Wook
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.3
    • /
    • pp.275-280
    • /
    • 2010
  • A local alignment algorithm finds a substring pair of given two strings where two substrings of the pair are similar to each other. A DNA sequence can consist of not only A, C, G, and T but also N and X where N and X are used when the original bases lose their information for various reasons. In this paper, we present an efficient local alignment algorithm for two DNA sequences including N and X using the affine gap penalty metric. Our algorithm is an extended version of the Kim-Park algorithm and can be extended in case of including other characters which have similar properties to N and X.

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST (서픽스트리 클러스터링 방법과 블라스트를 통합한 유전자 서열의 클러스터링과 기능검색에 관한 연구)

  • Han, Sang-Il;Lee, Sung-Gun;Kim, Kyung-Hoon;Lee, Ju-Yeong;Kim, Young-Han;Hwang, Kyu-Suk
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.11 no.10
    • /
    • pp.851-856
    • /
    • 2005
  • The DNA and protein data of diverse species have been daily discovered and deposited in the public archives according to each established format. Database systems in the public archives provide not only an easy-to-use, flexible interface to the public, but also in silico analysis tools of unidentified sequence data. Of such in silico analysis tools, multiple sequence alignment [1] methods relying on pairwise alignment and Smith-Waterman algorithm [2] enable us to identify unknown DNA, protein sequences or phylogenetic relation among several species. However, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST was combined with a clustering tool. Our clustering and annotating tool is summarized as the following steps: (1) construction of suffix tree; (2) masking of cross-matching pairs; (3) clustering of gene sequences and (4) annotating gene clusters by BLAST search. The system was successfully evaluated with 22 gene sequences in the pyrubate pathway of bacteria, clustering 7 clusters and finding out representative common subsequences of each cluster

Physiological and Phylogenetic Analysis of Burkholderia sp. HY1 Capable of Aniline Degradation

  • Kahng, Hyung-Yeel;Jerome J. Kukor;Oh, Kye-Heon
    • Journal of Microbiology and Biotechnology
    • /
    • v.10 no.5
    • /
    • pp.643-650
    • /
    • 2000
  • A new aniline-utilizing microorganism, strain HY1 obtained from an orchard soil, was characterized by using the BIOLOG system, an analysis of the total cellular fatty acids, and a 16S rDNA sequence. Strain HY1 was identified as a Burkholderia species, and was designated Burkholderia sp. HY1. GC and HPLC analyses revealed that Burkholderia sp. HY1 was able to degrade aniline to produce catechol, which was subsequently converted to cis,cis-muconic acid through an ortho-ring fission pathway under aerobic conditions. Strain HY1 exhibited a drastic reduction in the rate of aniline degradation when glucose was added to the aniline media. However, the addition of peptone or nitrate to the aniline media dramatically accelerated the rate of aniline degradation. A fatty acid analysis showed that strain HY1 was able to produce lipids 16:0 2OH, and 11 methyl 18:1 ${\omega}7c$ approximately 3.7-, 2.2-, and 6-fold more, respectively, when grown on aniline media than when grown on TSA. An analysison the alignment of a 1,435 bp fragment. A phylogenetic analysis of the 16S rDNA sequence based on a 1,420 bp multi-alignment sowed of the 16s rDNA sequence revealed that strain HY1 was very closely related to Burkholderia graminis with 95% similarity based that strain HY1 was placed among three major clonal types of $\beta$-Proteobacteria, including Burkholderia graminis, Burkholderia phenazinium, and Burkholderia glathei. The sequence GAT(C or G)${\b{G}}$, which is highly conserved in several locations in the 16S rDNA gene among the major clonal type strains of $\beta$-Proteobacteria, was frequently replaced with GAT(C or G)${\b{A}}$ in the 16S rDNA sequence from strain HY1.

  • PDF

An effective method for comparing similarity of document with Multi-Level alignment (다단계정렬을 활용한 효율적인 문서 유사도 비교법)

  • Seo, Jong-Kyu;Hwang, Hae-Lyen;Cho, Hwan-Gue
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.04a
    • /
    • pp.402-405
    • /
    • 2012
  • 문서와 문서간의 유사도들 측정하는 방법 은 크게 지문법 (fingerprint)을 이용한 방법과 서열 정렬(sequence alignment)알고리즘을 이용한 방법이 있다. 두 방법은 각각 속도와 정확도라는 장점을 가지고 있다. 다단계정렬(MLA, Multi-Level alignment))는 이러한 두 방법을 조합하여 탐색 속도와 정확도 사이의 비중을 사용자가 결정할 수 있도록 하기 위한 방법이다.[1] 다단계 정렬은 두 문서를 단위 블록(basis block)로 나누고 블록 간의 벡터를 비교하여 유사도를 측정하게 되는데, 본 연구에서는 초성 추출 및 어간 추출을 통해 단위 블록의 벡터를 빠른 시 간에 생성하고 비교하는 방법과 다단계 탐색을 통해 정확도를 유지하면서 빠르게 유사도를 측정하는 방식에 대해 설명한다. 실험결과 제안 방법을 통해 다단계 정렬 방법을 이용한 대용량 문서 비교의 속도가 2 배 이상 빨라짐을 보인다.

Sequence driven features for prediction of subcellular localization of proteins

  • Kim, Jong-Kyoung;Bang, Sung-Yang;Choi, Seung-Jin
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2005.09a
    • /
    • pp.237-242
    • /
    • 2005
  • Predicting the cellular location of an unknown protein gives a valuable information for inferring the possible function of the protein. For more accurate prediction system, we need a good feature extraction method that transforms the raw sequence data into the numerical feature vector, minimizing information loss. In this paper, we propose new methods of extracting underlying features only from the sequence data by computing pairwise sequence alignment scores. In addition, we use composition based features to improve prediction accuracy. To construct an SVM ensemble from separately trained SVM classifiers, we propose specificity based weighted majority voting. The overall prediction accuracy evaluated by the 5-fold cross-validation reached 88.53% for the eukaryotic animal data set. By comparing the prediction accuracy of various feature extraction methods, we could get the biological insight on the location of targeting information. Our numerical experiments confirm that our new feature extraction methods are very useful for predicting subcellular localization of proteins.

  • PDF

Isolation and Characterization of Two Amino Acid-activating Domains of Peptide Synthetase Gene from Bacillus subtilis 713

  • Lee, Youl-Soon;You, Sang-Bae;Lee, Ji-Wan;Kim, Tae-Young;Kim, Sung-Uk;Bok, Song-Hae
    • Journal of Microbiology and Biotechnology
    • /
    • v.8 no.4
    • /
    • pp.399-405
    • /
    • 1998
  • From the sequence alignment of various non-ribosomal peptide synthetases, several motifs of highly conserved sequences have been identified within each domain of peptide synthetases. We designed PCR primers based on the highly conserved nucleotide sequences to amplify and isolate a ∼7.2-kb DNA fragment of the Bacillus subtilis 713 which was isolated and reported to produce an antifungal peptide compound. Nucleotide sequence analysis of 4.8 kb of the predicted amino acids revealed significant homology to various peptide synthetases over the whole sequence and also revealed two amino acid-activating domains with highly conserved Core 1 to Core 6 and spacer motif. This suggests that the isolated DNA fragment is part of a peptide synthetase gene for antifungal peptide.

  • PDF

cDNA Sequence and mRNA Expression of a Novel Peroxiredoxin from the Firefly, pyrocoelia rufa

  • Jin, Byung-Rae;Lee, Kwang-Sik;Kim, Seong-Ryul;Sohn, Hung-Dae
    • International Journal of Industrial Entomology and Biomaterials
    • /
    • v.4 no.2
    • /
    • pp.101-107
    • /
    • 2002
  • We describe here the cDNA sequence and mRNA expression of a novel family of the antioxidant protein, peroxiredoxin, from the firefly, Pyracoetia ruin. The 555 bp cDNA sequence codes for a 185 amino acid protein with a calculated molecular mass of approximately 21 kDa. The deduced protein of P. rufa peroxiredoxin gene contains two conserved cysteine residues. Alignment of the deduced protein of P. rufa peroxiredoxin gene showed 71.1% protein sequenceidentity to known insect Drosophila melanogaster peroxiredoxin. Northern blot analysis revealed that the P. rufa peroxiredoxin is specifically expressed in the fat body of P. rufa larvae.

cDNA Sequence and mRNA Expression of a Novel Serine Protease from the Firefly, Pyrocoelia rufa

  • Lee, Kwang-Sik;Kim, Seong-Ryul;Sohn, Hung-Dae;Jin, Byung-Rae
    • International Journal of Industrial Entomology and Biomaterials
    • /
    • v.5 no.1
    • /
    • pp.103-108
    • /
    • 2002
  • We describe here the cDNA sequence and mRNA expression of a novel serine pretense from the firefly, Pyrocoelia rufa. The 771 bp cDNA encodes for 257 amino acid residues. The deduced protein of P. rufa serine pretense gene contains the catalytic triad and six-conserved cysteine residues. Alignment of the deduced protein of P. rufa serine pretense gene showed 47.4% protein sequence identity to known coleopteran insect Rhyzopertha dominica midgut trpsin-like enzyme. Northern blot analysis revealed that the P. rufa serine pretense is specifically expressed in the midgut of P. rufa larvae.