• Title/Summary/Keyword: Suffix Tree

Search Result 50, Processing Time 0.025 seconds

Time and Space Efficient Search with Suffix Arrays (접미사 배열을 이용한 시간과 공간 효율적인 검색)

  • Choi, Yong-Wook;Sim, Jeong-Seop;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.5
    • /
    • pp.260-267
    • /
    • 2005
  • To search efficiently a text T of length n for a pattern P over an alphabet 5, suffix trees and suffix arrays are widely used. In case of a large text, suffix arrays are preferred to suffix trees because suffix ways take less space than suffix trees. Recently, O(${\mid}P{\mid}{\codt}{\mid}{\Sigma}{\mid}$-time and O(${\mid}P{\mid}P{\cdot}log{\mid}{\Sigma}{\mid}$)-time search algorithms in suffix ways were developed. In this paper we present time and space efficient search algorithms in suffix arrays. One algorithm runs in O(${\mid}P{\mid}$) time using O($n{\cdot}{\mid}{\Sigma}{\mid}$)-bits space, and the other runs in O($n{\cdot}{\mid}{\Sigma}{\mid}$ time using O($nlog{\mid}{\Sigma}{\mid}+{\mid}{\Sigma}{\mid}{\cdot}$nlog log n/logn)-bits space, which is more space efficient and still fast. Experiments show that our algorithms are efficient in both time and space when compared to previous algorithms.

Estimation of Substring Selectivity in Biological Sequence Database (생물학 서열 데이타베이스에서 부분 문자열의 선적도 추정)

  • 배진욱;이석호
    • Journal of KIISE:Databases
    • /
    • v.30 no.2
    • /
    • pp.168-175
    • /
    • 2003
  • Until now, substring selectivities have been estimated by two steps. First step is to build up a count-suffix tree, which has statistical information about substrings, and second step is to estimate substring selectivity using it. However, it's actually impossible to build up a count-suffix tree from biological sequences because their lengths are too long. So, this paper proposes a novel data structure, count q-gram tree, consisting of fixed length substrings. The Count q-gram tree retains the exact counts of all substrings whose lengths are equal to or less than q and this tree is generated in 0(N) time and in site not subject to total length of all sequences, N. This paper also presents an estimation technique, k-MO. k-MO can choose overlapping length of splitted substrings from a query string, and this choice will affect accuracy of selectivity and query processing time. Experiments show k-MO can estimate very accurately.

An Effective Algorithm for Checking Subsumption Relation on String Data Containing Wildcard Characters (와일드카드 문자를 포함하는 스트링 데이터 사이의 포함관계 확인을 위한 효율적인 알고리즘)

  • Kim, Do-Han;Park, Hee-Jin;Paek, Eun-Ok
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.9
    • /
    • pp.475-482
    • /
    • 2005
  • String data containing wildcard characters may represent certain patterns in texts. A subsumption relation between two patterns can be defined by a subset relation between sets of strings that match those patterns. Thus, the subsumption relation check is important to determine whether each pattern represents a set of strings without any overlap with another pattern. In this paper, we propose an effective algorithm that can determine subsumption relation between strings with wildcard characters. First, we consider a simple extension of the suffix tree algorithm so that it nay include wildcard characters and then we propose another method that checks the subsumption relation by dividing a suffix tree structure at each location of string data.

Fast Construction of Suffix Arrays for DNA Strings (DNA 스트링에 대하여 써픽스 배열을 구축하는 빠른 알고리즘)

  • Jo, Jun-Ha;Kim, Nam-Hee;Kwon, Ki-Ryong;Kim, Dong-Kyue
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.8
    • /
    • pp.319-326
    • /
    • 2007
  • To perform fast searching in massive data such as DNA strings, the most efficient method is to construct full-text index data structures of given strings. The widely used full-text index structures are suffix trees and suffix arrays. Since the suffix may uses less space than the suffix tree, the suffix array is proper for DNA strings. Previously developed construction algorithms of suffix arrays are not suitable for DNA strings since those are designed for integer alphabets. We propose a fast algorithm to construct suffix arrays on DNA strings whose alphabet sizes are fixed by 4. We reduce the construction time by improving encoding and merging steps on Kim et al.[1]'s algorithm. Experimental results show that our algorithm constructs suffix arrays on DNA strings 1.3-1.6 times faster than Kim et al.'s algorithm, and also for other algorithms in most cases.

Linear-Time Search in Suffix Arrays (접미사 배열을 이용한 선형시간 탐색)

  • Sin Jeong SeoP;Kim Dong Kyue;Park Heejin;Park Kunsoo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.5
    • /
    • pp.255-259
    • /
    • 2005
  • To search a pattern P in a text, such index data structures as suffix trees and suffix arrays are widely used in diverse applications of string processing and computational biology. It is well known that searching in suffix trees is faster than suffix ways in the aspect of time complexity, i.e., it takes O(${\mid}P{\mid}$) time to search P on a constant-size alphabet in a suffix tree while it takes O(${\mid}P{\mid}+logn$) time in a suffix way where n is the length of the text. In this paper we present a linear-tim8 search algorithm in suffix arrays for constant-size alphabets. For a gene.al alphabet $\Sigma$, it takes O(${\mid}P{\mid}log{\mid}{\Sigma}{\mid}$) time.

Elastic Rule Discovering in Sequence Databases (시퀀스 데이터베이스에서 유연 규칙의 탐사)

  • Park, Sang-Hyun;Kim, Sang-Wook;Kim, Man-Soon
    • Journal of Industrial Technology
    • /
    • v.21 no.A
    • /
    • pp.147-153
    • /
    • 2001
  • This paper presents techniques for discovering rules with elastic patterns. Elastic patterns are useful for discovering rules from data sequences with different sampling rates. For fast discovery of rules whose heads and bodies are elastic patterns, we construct a suffix tree from succinct forms of data sequences. The suffix tree is a compact representation of rules, and is also used as an index structure for finding rules matched to a target head sequence. When matched rules cannot be found, the concept of rule relaxation is introduced. Using a cluster hierarchy and a relaxation error, we find the least relaxed rules that provide the most specific information on a target head sequence. Performance evaluation through extensive experiments reseals the effectiveness of the proposed approach.

  • PDF

Exact Matching Algorithm on Expanded Word Suffix Tree (확장된 단어 서픽스 트리에서의 완전매칭 알고리즘)

  • 박준영;정원형;김삼묘
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.10a
    • /
    • pp.575-577
    • /
    • 2000
  • DNA 염기 서열을 분석하는데 효율적으로 쓸 수 있는 자료구조서 서픽스 트리(Suffix Tree)가 제시되었다. 그러나 매우 큰 유전자 서열에 대한 서픽스 트리는 대용량의 메모리 공간을 필요로 한다. 따라서 메모리 공간의 절약을 위해서 단어 서픽스 트리를 이용하는 방법이 제안되었다. 단어 서픽스 트리는 이러한 장점에도 불구하고 단어에 의미를 두고 만든 트리 구조이기 때문에 완전 매칭 문제를 해결하기 위한 정보가 부족해서 제한적 완전 매칭 알고리즘이 제시되었다. 제한적 완전 매칭 알고리즘에서는 찾으려는 패턴이 어떤 단어의 부-문자열에 위치하거나, 두 단어 이상에 걸쳐 나오면 찾지 못하는 문제가 발생한다. 본 논문에서는 단어 서픽스 트리의 완전 매칭 문제를 해결하기 위해 각 단어들의 서픽스에 대한 정보로 구성된 Generalized 서픽스 트리를 사용하여 확장된 단어 서픽스 트리를 제시하고, 완전 매칭 알고리즘을 제안한다.

  • PDF

Effective Biological Sequence Alignment Method using Divide Approach

  • Choi, Hae-Won;Kim, Sang-Jin;Pi, Su-Young
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.6
    • /
    • pp.41-50
    • /
    • 2012
  • This paper presents a new sequence alignment method using the divide approach, which solves the problem by decomposing sequence alignment into several sub-alignments with respect to exact matching subsequences. Exact matching subsequences in the proposed method are bounded on the generalized suffix tree of two sequences, such as protein domain length more than 7 and less than 7. Experiment results show that protein sequence pairs chosen in PFAM database can be aligned using this method. In addition, this method reduces the time about 15% and space of the conventional dynamic programming approach. And the sequences were classified with 94% of accuracy.

Implementation of Engine Generating Mutation Worm Signature Using LCSeq (LCSeq를 이용한 변형 웜 시그니쳐 생성 엔진 구현)

  • Ko, Joon-Sang;Lee, Jae-Kwang;Kim, Bong-Han
    • The Journal of the Korea Contents Association
    • /
    • v.7 no.11
    • /
    • pp.94-101
    • /
    • 2007
  • We introduce the way to detect the mutation worm. We implemented the program that can generate signature using LCSeq(Longest Common Subsequence) technique in Suffix Tree studied as pattern recognition algorithm. We also showed the process to detect the mutation of CodeRed worm and Nimda worm and evaluated signatures generated by snort and LCSeq.

A Suffix Tree Approach for Efficient XML Path Indexing (접미어 트리 구조를 이용한 효율적인 XML 경로 인덱싱)

  • 이덕형;원정임;노관준;윤지희
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10c
    • /
    • pp.88-90
    • /
    • 2002
  • 최근 인터넷 상에서 XML 문서의 사용이 급속도로 보편화, 일반화됨 따라 정보 검색을 위한 다양한 XML 질의 언어가 제안되고 있다. XML 질의의 공통 특징으로서 ‘*’ 문자 등을 사용한 정규화 경로식(regular path expression)에 의한 손쉬운 구조정보 검색 기능을 들 수 있다. 본 논문에서는 접미어 트리(suffix tree)를 이용한 새로운 경로 인덱싱 기법을 제안한다. 제안하는 기법에서는 XML 문서상의 각 경로를 축약된 유일한 문자열로 인코딩하며, 인코딩 된 각 문자열의 모든 접미어 정보를 인덱스에 저장한다. 본 기법은 일반 정규화 경로식을 포함하는 구조질의를 매우 효율적으로 처리하며, 또한 경로 정보가 부정확하게 기술된 경우에도 관사 질의 처리를 효과적으로 처리할 수 있다.

  • PDF