• Title/Summary/Keyword: DNA strings

Search Result 14, Processing Time 0.02 seconds

Efficient Indexing for Large DNA Sequence Databases (대용량 DNA 시퀀스 데이타베이스를 위한 효율적인 인덱싱)

  • Won Jung-Im;Yoon Jee-Hee;Park Sang-Hyun;Kim Sang-Wook
    • Journal of KIISE:Databases
    • /
    • v.31 no.6
    • /
    • pp.650-663
    • /
    • 2004
  • In molecular biology, DNA sequence searching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, a fast indexing mechanism is essential for efficient processing of DNA sequence searches. In this paper, we first identify the problems of the suffix tree in aspects of the storage overhead, search performance, and integration with DBMSs. Then, we propose a new index structure that solves those problems. The proposed index consists of two parts: the primary part represents the trie as bit strings without any pointers, and the secondary part helps fast accesses of the leaf nodes of the trio that need to be accessed for post processing. We also suggest an efficient algorithm based on that index for DNA sequence searching. To verify the superiority of the proposed approach, we conducted a performance evaluation via a series of experiments. The results revealed that the proposed approach, which requires smaller storage space, achieves 13 to 29 times performance improvement over the suffix tree.

A Local Alignment Algorithm using Normalization by Functions (함수에 의한 정규화를 이용한 local alignment 알고리즘)

  • Lee, Sun-Ho;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.5_6
    • /
    • pp.187-194
    • /
    • 2007
  • A local alignment algorithm does comparing two strings and finding a substring pair with size l and similarity s. To find a pair with both sufficient size and high similarity, existing normalization approaches maximize the ratio of the similarity to the size. In this paper, we introduce normalization by functions that maximizes f(s)/g(l), where f and g are non-decreasing functions. These functions, f and g, are determined by experiments comparing DNA sequences. In the experiments, our normalization by functions finds appropriate local alignments. For the previous algorithm, which evaluates the similarity by using the longest common subsequence, we show that the algorithm can also maximize the score normalized by functions, f(s)/g(l) without loss of time.

Identifying Variable-Length Palindromic Pairs in DNA Sequences (DNA사슬 내에서 다양한 길이의 팰린드롬쌍 검색 연구)

  • Kim, Hyoung-Rae;Jeong, Kyoung-Hee;Jeon, Do-Hong
    • The KIPS Transactions:PartB
    • /
    • v.14B no.6
    • /
    • pp.461-472
    • /
    • 2007
  • The emphasis in genome projects has Been moving towards the sequence analysis in order to extract biological "meaning"(e.g., evolutionary history of particular molecules or their functions) from the sequence. Especially. palindromic or direct repeats that appear in a sequence have a biophysical meaning and the problem is to recognize interesting patterns and configurations of words(strings of characters) over complementary alphabets. In this paper, we propose an algorithm to identify variable length palindromic pairs(longer than a threshold), where we can allow gaps(distance between words). The algorithm is called palindrome algorithm(PA) and has O(N) time complexity. A palindromic pair consists of a hairpin structure. By composing collected palindromic pairs we build n-pair palindromic patterns. In addition, we dot some of the longest pairs in a circle to represent the structure of a DNA sequence. We run the algorithm over several selected genomes and the results of E.coli K12 are presented. There existed very long palindromic pair patterns in the genomes, which hardly occur in a random sequence.

Finding Approximate Covers of Strings (문자열의 근사커버 찾기)

  • Sim, Jeong-Seop;Park, Kun-Soo;Kim, Sung-Ryul;Lee, Jee-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.1
    • /
    • pp.16-21
    • /
    • 2002
  • Repetitive strings have been studied in such diverse fields as molecular biology data compression etc. Some important regularities that have been studied are perods, covers seeds and squares. A natural extension of the repetition problems is to allow errors. Among the four notions above aproximate squares and approximate periodes have been studied. In this paper, we introduce the notion of approximate covers which is an approximate version of covers. Given two strings P(|P|=m) and T(|T|=n) we propose and algorithm with finds the minimum distance t such that P is a t-approximate cover of T. The algorithm take O(m,n) time for the edit distance and $O(mn^2)$ time of finding a string which is an approximate cover of T is minimum distance is NP-complete.