Browse > Article

Efficient Construction of Generalized Suffix Arrays by Merging Suffix Arrays  

Jeon, Jeong-Eun (부산대학교 컴퓨터공학과)
Park, Heejin (한양대학교 정보통신학부)
Kim, Dong-Kyue (부산대학교 컴퓨터공학과)
Abstract
We consider constructing the generalized suffix way of strings A and B when the suffix arrays of A and B are given, j.e., merging two suffix arrays of A and B. There are efficient algorithms to merge some special suffix arrays such as the odd array and the even array. However, for the general case that A and B are arbitrary strings, no efficient merging algorithms have been developed. Thus, one had to construct the generalized suffix arrays of A and B by constructing the suffix array of A$\#$B$\$$ from scratch, even though the suffix ways of A and B are given. In this paper, we Present efficient merging algorithms for the suffix arrays of two arbitrary strings A and B drawn from constant and integer alphabets. The experimental results show that merging two suffix ways of A and B are about 5 times faster than constructing the suffix way of A$\#$B$\$$ from scratch for constant alphabets. Our algorithms include searching all suffixes of string B in the suffix array of A. To do this, we use suffix links in suffix ways and we developed efficient algorithms for computing the suffix links. Efficient computation of suffix links is another contribution of this paper because it can be used to solve other problems occurred in bioinformatics that should search all suffixes of a given string in the suffix array of another string such as computing matching statistics, finding longest common substrings, and so on. The experimental results show that our methods for computing suffix links is about 3-4 times faster than the previous fastest methods.
Keywords
suffix arrays; merging suffix arrays; generalized suffix array; suffix link; computing matching statistics; finding longest common substrings;
Citations & Related Records
연도 인용수 순위
  • Reference
1 K. Sadakane, Succinct representations of lcp Information and improvements in the compressed suffix arrays, CM-SIAM Symp. on Discrete Algorithms, pp. 225-232, 2002
2 R. Grossi and J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, ACM Symp. Theory of Computing, pp, 397-406, 2000   DOI
3 K. Sadakane, Compressed text databases with efficient query algorithms based on the compressed suffix array, Int. Symp. Algorithms and Computation, pp. 410-421, 2000
4 P. Ferragina and G. Manzini, Opportunistic data structures with applications, IEEE Symp. Found Computer Science, pp. 390-398, 2001   DOI
5 R. Grossi, A. Gupta and J.S. Vitter, When indexing equals compression: Experiments with compressing suffix arrays and applications, ACM-SIAM Symp. on Discrete Algorithms, 2004
6 D. K. Kim and K. Park, Linear-time construction of two-dimensional suffix trees, Int. Colloq. on Automata, Languages and Programming, pp. 463-472, 1999
7 M. Burrows and D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994
8 M. Bender and M. Farach-Colton, The LCA Problem Revisited, In Proceedings of LATIN 2000, LNCS vol 1776, pp. 88-94, 2000   DOI
9 A. Aho, J. Hopcroft, J. Ullman, Data Structures and Algorithms, Addison-Wesley, 1983
10 P. Ferragina and G. Manzini, An experimental study of an opportunistic index, ACM-SIAM Symp. on Discrete Algorithms, pp. 269-278, 2001
11 G. Gonnet, R. Baeza-Yates, and T. Snider, New indices for text: Pat trees and pat arrays. In W. B. Frakes and R. A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, Prentice Hall, pp. 66-82, 1992
12 M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch, Replacing suffix trees with enhanced suffix arrays, J. of Discrete Algorithms, pp. 53-86, 2004   DOI   ScienceOn
13 W.K. Hon, K. Sadakane and W.K. Sung, Breaking a time-and-space barrier in constructing full-text indices, IEEE Symp. Found. Computer Science, pp. 251-260, 2003   DOI
14 J. S. Sim, D. K. Kim, H. Park and K. Park, Linear-time search in suffix arrays, Australasian Workshop on Combinatorial Algorithms, pp. 139-146, 2003
15 M. T. Chen and J. Seiferas, Efficient and elegant subword tree construction, In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, NATO ASI Series F: Computer and System Sciences, 1985
16 M. Abouelhoda, E. Ohlebusch, and S. Kurtz, Optimal exact string matching based on suffix arrays, Symp. on String Processing and Information Retrieval, pp. 31-43, 2002
17 D. K. Kim, J. S. Sim, H. Park and K. Park, Linear-time construction of suffix arrays, Symp. Combinatorial Pattern Matching, pp. 186-199, 2003.
18 P. Ko and S. Aluru, Space-efficient linear time construction of suffix arrays, Symp. Combinatorial Pattern Matching, pp. 200-210, 2003
19 J. Karkkainen and P. Sanders, Simpler linear work suffix array construction, Int. Colloq. Automata Languages and Programming, pp. 943-955, 2003
20 U. Manber, G. Myers, 'Suffix arrays: a new method for on-line string searches,' SIAM J. Computing 22, pp. 935-948, 1993   DOI   ScienceOn
21 M. Farach-Colton, P. Ferragina and S. Muthukri-shnan, On the sorting-complexity of suffix tree construction, J. Assoc. Comput. Mach, vol 47, pp. 987-1011, 2000   DOI   ScienceOn
22 E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica 14, pp. 353-364, 1993   DOI
23 M. Farach, Optimal suffix tree construction with large alphabets, IEEE Symp. Found. Computer Science (1991), 137-143   DOI
24 D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge Univ. Press, 1997
25 E. M. McCreight, 'A space-economical suffix tree construction algorithms,' J. ACM 23, pp. 262-272, 1976   DOI   ScienceOn
26 P. Weiner, Linear pattern matching algorithms, Proc. 14th IEEE Symp. Switching and Automata Theory, pp. 1-11, 1973