• Title/Summary/Keyword: Suffix Tree

Search Result 50, Processing Time 0.038 seconds

A Design for Efficient Similar Subsequence Search with a Priority Queue and Suffix Tree in Image Sequence Databases (이미지 시퀀스 데이터베이스에서 우선순위 큐와 접미어 트리를 이용한 효율적인 유사 서브시퀀스 검색의 설계)

  • 김인범
    • Journal of the Korea Computer Industry Society
    • /
    • v.4 no.4
    • /
    • pp.613-624
    • /
    • 2003
  • This paper proposes a design for efficient and accurate retrieval of similar image subsequences using the multi-dimensional time warping distance as similarity evaluation tool in image sequence database after building of two indexing structures implemented with priority queue and suffix tree respectively. Receiving query image sequence, at first step, the proposed method searches the candidate set of similar image subsequences in priory queue index structure. If it can not get satisfied results, it retrieves another candidate set in suffix tree index structure at second step. The using of the low-bound distance function can remove the dissimilar subsequence without false dismissals during similarity evaluating process between query image sequence and stored sequences in two index structures.

  • PDF

Gene Sequences Clustering for the Prediction of Functional Domain (기능 도메인 예측을 위한 유전자 서열 클러스터링)

  • Han Sang-Il;Lee Sung-Gun;Hou Bo-Kyeng;Byun Yoon-Sup;Hwang Kyu-Suk
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.12 no.10
    • /
    • pp.1044-1049
    • /
    • 2006
  • Multiple sequence alignment is a method to compare two or more DNA or protein sequences. Most of multiple sequence alignment tools rely on pairwise alignment and Smith-Waterman algorithm to generate an alignment hierarchy. Therefore, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST and CDD (Conserved Domain Database)search were combined with a clustering tool. Our clustering and annotating tool consists of constructing suffix tree, overlapping common subsequences, clustering gene sequences and annotating gene clusters by BLAST and CDD search. The system was successfully evaluated with 36 gene sequences in the pentose phosphate pathway, clustering 10 clusters, finding out representative common subsequences, and finally identifying functional domains by searching CDD database.

An Efficient Suffix Tree Reconstructing Algorithm for Biological Sequence Analysis (DNA 분석에 효율적인 서픽스 트리 재구성 알고리즘)

  • Choi, Hae-Won;Jung, Young-Seok;Kim, Sang-Jin
    • Journal of Digital Convergence
    • /
    • v.12 no.12
    • /
    • pp.265-275
    • /
    • 2014
  • This paper introduces a new algorithms for reconstructing the suffix tree of character string, when a substring id deleted from the string or a string is inserted into the string as a substring. The algorithem has two main functions, delete-structure and insert-structure. The main objective of this algorithm is to save the time for constructing the suffix tree of an edited string, when the suffix tree of the original string is available. We tested the performance of this algorithm with some DNA sequences. This test shows that delete-reconstructing can save time when the length of the subsequence deleted is less than 30% of the original sequence, and the insert-reconstructing takes less time with regard to the length of inserted sequence.

A Study on Clustering and Identifying Gene Sequences using Suffix Tree Clustering Method and BLAST (서픽스트리 클러스터링 방법과 블라스트를 통합한 유전자 서열의 클러스터링과 기능검색에 관한 연구)

  • Han, Sang-Il;Lee, Sung-Gun;Kim, Kyung-Hoon;Lee, Ju-Yeong;Kim, Young-Han;Hwang, Kyu-Suk
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.11 no.10
    • /
    • pp.851-856
    • /
    • 2005
  • The DNA and protein data of diverse species have been daily discovered and deposited in the public archives according to each established format. Database systems in the public archives provide not only an easy-to-use, flexible interface to the public, but also in silico analysis tools of unidentified sequence data. Of such in silico analysis tools, multiple sequence alignment [1] methods relying on pairwise alignment and Smith-Waterman algorithm [2] enable us to identify unknown DNA, protein sequences or phylogenetic relation among several species. However, in the existing multiple alignment method as the number of sequences increases, the runtime increases exponentially. In order to remedy this problem, we adopted a parallel processing suffix tree algorithm that is able to search for common subsequences at one time without pairwise alignment. Also, the cross-matching subsequences triggering inexact-matching among the searched common subsequences might be produced. So, the cross-matching masking process was suggested in this paper. To identify the function of the clusters generated by suffix tree clustering, BLAST was combined with a clustering tool. Our clustering and annotating tool is summarized as the following steps: (1) construction of suffix tree; (2) masking of cross-matching pairs; (3) clustering of gene sequences and (4) annotating gene clusters by BLAST search. The system was successfully evaluated with 22 gene sequences in the pyrubate pathway of bacteria, clustering 7 clusters and finding out representative common subsequences of each cluster

Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model (공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법)

  • Cho, Yoon-Ho;Lee, Sang-Keun
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.2
    • /
    • pp.142-151
    • /
    • 2009
  • Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.

A New merging Algorithm for Constructing suffix Trees for Integer Alphabets (정수 문자집합상의 접미사트리 구축을 위한 새로운 합병 알고리즘)

  • Kim, Dong-Kyu;Sim, Jeong-Seop;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.2
    • /
    • pp.87-93
    • /
    • 2002
  • A new approach of constructing a suffix tree $T_s$for the given string S is to construct recursively a suffix tree $ T_0$ for odd positions construct a suffix tree $T_e$ for even positions from $ T_o$ and then merge $ T_o$ and $T_e$ into $T_s$ To construct suffix trees for integer alphabets in linear time had been a major open problem on index data structures. Farach used this approach and gave the first linear-time algorithm for integer alphabets The hardest part of Farachs algorithm is the merging step. In this paper we present a new and simpler merging algorithm based on a coupled BFS (breadth-first search) Our merging algorithm is more intuitive than Farachs coupled DFS (depth-first search ) merging and thus it can be easily extended to other applications.

Scalable and Accurate Intrusion Detection using n-Gram Augmented Naive Bayes and Generalized k-Truncated Suffix Tree (N-그램 증강 나이브 베이스 알고리즘과 일반화된 k-절단 서픽스트리를 이용한 확장가능하고 정확한 침입 탐지 기법)

  • Kang, Dae-Ki;Hwang, Gi-Hyun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.13 no.4
    • /
    • pp.805-812
    • /
    • 2009
  • In many intrusion detection applications, n-gram approach has been widely applied. However, n-gram approach has shown a few problems including unscalability and double counting of features. To address those problems, we applied n-gram augmented Naive Bayes with k-truncated suffix tree (k-TST) storage mechanism directly to classify intrusive sequences and compared performance with those of Naive Bayes and Support Vector Machines (SVM) with n-gram features by the experiments on host-based intrusion detection benchmark data sets. Experimental results on the University of New Mexico (UNM) benchmark data sets show that the n-gram augmented method, which solves the problem of independence violation that happens when n-gram features are directly applied to Naive Bayes (i.e. Naive Bayes with n-gram features), yields intrusion detectors with higher accuracy than those from Naive Bayes with n-gram features and shows comparable accuracy to those from SVM with n-gram features. For the scalable and efficient counting of n-gram features, we use k-truncated suffix tree mechanism for storing n-gram features. With the k-truncated suffix tree storage mechanism, we tested the performance of the classifiers up to 20-gram, which illustrates the scalability and accuracy of n-gram augmented Naive Bayes with k-truncated suffix tree storage mechanism.

A New Algorithm for Constructing the Truncated Suffix Tree (절단 접미사 트리를 생성하는 새로운 알고리즘)

  • Na, Joong Chae
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.04a
    • /
    • pp.999-1001
    • /
    • 2009
  • 절단 접미사 트리(truncated suffix tree)는 접미사 트리의 절단 버전으로, 주어진 문자열의 부분 문자열 중 일정 길이 이하인 것들만을 표현하는 자료구조이다. 절단 접미사 트리는 일정 길이 이하의 문자열들만을 고려하는 응용에 유용한데, 특히 LZ77 압축과 같이 온라인 생성 알고리즘이 필요한 응용들도 있다. 본 논문에서는 절단 접미사 트리를 온라인으로 생성하는 새로운 알고리즘을 제시한다.

An Efficient Index Structure for Bottom-Up Query Processing of XML Documents (XML 문서의 상향식 질의처리를 지원하는 효율적인 색인구조)

  • Seo Dong-Min;Kim Eun-Jae;Seong Dong-Ook;Yoo Jae-Soo;Cho Ki-Hyung
    • Journal of Internet Computing and Services
    • /
    • v.7 no.4
    • /
    • pp.101-113
    • /
    • 2006
  • A path query is used in XML. Several index structures have been studied for processing the path query efficiently. In recent. the index schemes using suffix tree with structure join method were proposed. ViST is the most representative method among such methods. ViST processes the query using suffix tree and uses B+-tree to reduce the search time of the documents. However, it significantly degrades the search performance when processing the path query. The reason is that it regards the element that is not ancestor-descendant relation in the document as a descendent. In this paper, we propose an efficient index structure to solve the problem of ViST. The query processing method suitable to the index structure is also proposed. It is shown through various experiments that the proposed index structure outperforms the existing index structure in terms of the query processing time.

  • PDF

Improvement of Practical Suffix Sorting Algorithm (실용적인 접미사 정렬 알고리즘의 개선)

  • Jeong, Tae-Young;Lee, Tae-Hyung;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.36 no.2
    • /
    • pp.68-72
    • /
    • 2009
  • The suffix array is a data structure storing all suffixes of a string in lexicographical order. It is widely used in string problems instead of the suffix tree, which uses a large amount of memory space. Many researches have shown that not only the suffix array can be built in O(n), but also it can be constructed with a small time and space usage for real-world inputs. In this paper, we analyze a practical suffix sorting algorithm due to Maniscalco and Puglisi [1], and we propose an efficient algorithm which improves Maniscalco-Puglisi's running time.