• Title/Summary/Keyword: suffix matching

Search Result 19, Processing Time 0.028 seconds

Probabilistic Model for Performance Analysis of a Heuristic with Multi-byte Suffix Matching

  • Choi, Yoon-Ho
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.7 no.4
    • /
    • pp.711-725
    • /
    • 2013
  • A heuristic with multi-byte suffix matching plays an important role in real pattern matching algorithms. By skipping many characters at a time in the process of comparing a given pattern with the text, the pattern matching algorithm based on a heuristic with multi-byte suffix matching shows a faster average search time than algorithms based on deterministic finite automata. Based on various experimental results and simulations, the previous works show that the pattern matching algorithms with multi-byte suffix matching performs well. However, there have been limited studies on the mathematical model for analyzing the performance in a standard manner. In this paper, we propose a new probabilistic model, which evaluates the performance of a heuristic with multi-byte suffix matching in an average-case search. When the theoretical analysis results and experimental results were compared, the proposed probabilistic model was found to be sufficient for evaluating the performance of a heuristic with suffix matching in the real pattern matching algorithms.

Finding All-Pairs Suffix-Prefix Matching Using Suffix Array (접미사 배열을 이용한 Suffix-Prefix가 일치하는 모든 쌍 찾기)

  • Han, Seon-Mi;Woo, Jin-Woon
    • The KIPS Transactions:PartA
    • /
    • v.17A no.5
    • /
    • pp.221-228
    • /
    • 2010
  • Since string operations were applied to computational biology, security and search for Internet, various data structures and algorithms for computing efficient string operations have been studied. The all-pairs suffix-prefix matching is to find the longest suffix and prefix among given strings. The matching algorithm is importantly used for fast approximation algorithm to find the shortest superstring, as well as for bio-informatics and data compressions. In this paper, we propose an algorithm to find all-pairs suffix-prefix matching using the suffix array, which takes O($k{\cdot}m$)�� time complexity. The suffix array algorithm is proven to be better than the suffix tree algorithm by showing it takes less time and memory through experiments.

High Performance Pattern Matching algorithm with Suffix Tree Structure for Network Security (네트워크 보안을 위한 서픽스 트리 기반 고속 패턴 매칭 알고리즘)

  • Oh, Doohwan;Ro, Won Woo
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.6
    • /
    • pp.110-116
    • /
    • 2014
  • Pattern matching algorithms are widely used in computer security systems such as computer networks, ubiquitous networks, sensor networks, and so on. However, the advances in information technology causes grow on the amount of data and increase on the computation complexity of pattern matching processes. Therefore, there is a strong demand for a novel high performance pattern matching algorithms. In light of this fact, this paper newly proposes a suffix tree based pattern matching algorithm. The suffix tree is constructed based on the suffix values of all patterns. Then, the shift nodes which informs how many characters can be skipped without matching operations are added to the suffix tree in order to boost matching performance. The proposed algorithm reduces memory usage on the suffix tree and the amount of matching operations by the shift nodes. From the performance evaluation, our algorithm achieved 24% performance gain compared with the traditional algorithm named as Wu-Manber.

Efficient Construction of Generalized Suffix Arrays by Merging Suffix Arrays (써픽스 배열 합병을 이용한 일반화된 써픽스 배열의 효율적인 구축 알고리즘)

  • Jeon, Jeong-Eun;Park, Heejin;Kim, Dong-Kyue
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.6
    • /
    • pp.268-278
    • /
    • 2005
  • We consider constructing the generalized suffix way of strings A and B when the suffix arrays of A and B are given, j.e., merging two suffix arrays of A and B. There are efficient algorithms to merge some special suffix arrays such as the odd array and the even array. However, for the general case that A and B are arbitrary strings, no efficient merging algorithms have been developed. Thus, one had to construct the generalized suffix arrays of A and B by constructing the suffix array of A$\#$B$\$$ from scratch, even though the suffix ways of A and B are given. In this paper, we Present efficient merging algorithms for the suffix arrays of two arbitrary strings A and B drawn from constant and integer alphabets. The experimental results show that merging two suffix ways of A and B are about 5 times faster than constructing the suffix way of A$\#$B$\$$ from scratch for constant alphabets. Our algorithms include searching all suffixes of string B in the suffix array of A. To do this, we use suffix links in suffix ways and we developed efficient algorithms for computing the suffix links. Efficient computation of suffix links is another contribution of this paper because it can be used to solve other problems occurred in bioinformatics that should search all suffixes of a given string in the suffix array of another string such as computing matching statistics, finding longest common substrings, and so on. The experimental results show that our methods for computing suffix links is about 3-4 times faster than the previous fastest methods.

An Index Data Structure for String Search in External Memory (외부 메모리에서 문자열을 효율적으로 탐색하기 위한 인덱스 자료 구조)

  • Na, Joong-Chae;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.598-607
    • /
    • 2005
  • We propose a new external-memory index data structure, the Suffix B-tree. The Suffix B-tree is a B-tree in which the key is a string like the String B-tree. While the node in the String B-tree is implemented with a Patricia trio, the node in the Suffix B-tree is implemented with an array. So the Suffix B-tree is simpler and easier to be Implemented than the String B-tree. Nevertheless, the branching algorithm of the Suffix B-tree is as efficient as that of the String B-tree. Consequently, the Suffix B-tree takes the same worst-case disk accesses as the String B-tree to solve the string matching problem, which is fundamental and important in the area of string algorithms.

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST (서픽스트리 클러스터링 방법과 블라스트를 통합한 유전자 서열의 클러스터링과 기능검색에 관한 연구)

  • Han, Sang-Il;Lee, Sung-Gun;Kim, Kyung-Hoon;Lee, Ju-Yeong;Kim, Young-Han;Hwang, Kyu-Suk
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.11 no.10
    • /
    • pp.851-856
    • /
    • 2005
  • The DNA and protein data of diverse species have been daily discovered and deposited in the public archives according to each established format. Database systems in the public archives provide not only an easy-to-use, flexible interface to the public, but also in silico analysis tools of unidentified sequence data. Of such in silico analysis tools, multiple sequence alignment [1] methods relying on pairwise alignment and Smith-Waterman algorithm [2] enable us to identify unknown DNA, protein sequences or phylogenetic relation among several species. However, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST was combined with a clustering tool. Our clustering and annotating tool is summarized as the following steps: (1) construction of suffix tree; (2) masking of cross-matching pairs; (3) clustering of gene sequences and (4) annotating gene clusters by BLAST search. The system was successfully evaluated with 22 gene sequences in the pyrubate pathway of bacteria, clustering 7 clusters and finding out representative common subsequences of each cluster

Gene Sequences Clustering for the Prediction of Functional Domain (기능 도메인 예측을 위한 유전자 서열 클러스터링)

  • Han Sang-Il;Lee Sung-Gun;Hou Bo-Kyeng;Byun Yoon-Sup;Hwang Kyu-Suk
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.12 no.10
    • /
    • pp.1044-1049
    • /
    • 2006
  • Multiple sequence alignment is a method to compare two or more DNA or protein sequences. Most of multiple sequence alignment tools rely on pairwise alignment and Smith-Waterman algorithm to generate an alignment hierarchy. Therefore, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST and CDD (Conserved Domain Database)search were combined with a clustering tool. Our clustering and annotating tool consists of constructing suffix tree, overlapping common subsequences, clustering gene sequences and annotating gene clusters by BLAST and CDD search. The system was successfully evaluated with 36 gene sequences in the pentose phosphate pathway, clustering 10 clusters, finding out representative common subsequences, and finally identifying functional domains by searching CDD database.

A Generalization of the Linearized Suffix Tree to Square Matrices

  • Na, Joong-Chae;Lee, Sun-Ho;Kim, Dong-Kyue
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.12
    • /
    • pp.1760-1766
    • /
    • 2010
  • The linearized suffix tree (LST) is an array data structure supporting traversals on suffix trees. We apply this LST to two dimensional (2D) suffix trees and obtain a space-efficient substitution of 2D suffix trees. Given an $n{\times}n$ text matrix and an $m{\times}m$ pattern matrix over an alphabet ${\Sigma}$, our 2D-LST provides pattern matching in $O(m^2log{\mid}{\Sigma}{\mid})$ time and $O(n^2)$ space.

Suffix Tree Constructing Algorithm for Large DNA Sequences Analysis (대용량 DNA서열 처리를 위한 서픽스 트리 생성 알고리즘의 개발)

  • Choi, Hae-Won
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.15 no.1
    • /
    • pp.37-46
    • /
    • 2010
  • A Suffix Tree is an efficient data structure that exposes the internal structure of a string and allows efficient solutions to a wide range of complex string problems, in particular, in the area of computational biology. However, as the biological information explodes, it is impossible to construct the suffix trees in main memory. We should find an efficient technique to construct the trees in a secondary storage. In this paper, we present a method for constructing a suffix tree in a disk for large set of DNA strings using new index scheme. We also show a typical application example with a suffix tree in the disk.

Effective Biological Sequence Alignment Method using Divide Approach

  • Choi, Hae-Won;Kim, Sang-Jin;Pi, Su-Young
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.6
    • /
    • pp.41-50
    • /
    • 2012
  • This paper presents a new sequence alignment method using the divide approach, which solves the problem by decomposing sequence alignment into several sub-alignments with respect to exact matching subsequences. Exact matching subsequences in the proposed method are bounded on the generalized suffix tree of two sequences, such as protein domain length more than 7 and less than 7. Experiment results show that protein sequence pairs chosen in PFAM database can be aligned using this method. In addition, this method reduces the time about 15% and space of the conventional dynamic programming approach. And the sequences were classified with 94% of accuracy.