Comparisons of Practical Performance for Constructing Compressed Suffix Arrays

압축된 써픽스 배열 구축의 실제적인 성능 비교

  • 박치성 (부산대학교 컴퓨터공학과) ;
  • 김민환 (부산대학교 컴퓨터공학과) ;
  • 이석환 (동명대학교 정보보안학과) ;
  • 권기룡 (부경대학교 컴퓨터공학과) ;
  • 김동규 (한양대학교 전자통신컴퓨터공학부)
  • Published : 2007.06.15

Abstract

Suffix arrays, fundamental full-text index data structures, can be efficiently used where patterns are queried many times. Although many useful full-text index data structures have been proposed, their O(nlogn)-bit space consumption motivates researchers to develop more space-efficient ones. However, their space efficient versions such as the compressed suffix array and the FM-index have been developed; those can not reduce the practical working space because their constructions are based on the existing suffix array. Recently, two direct construction algorithms of compressed suffix arrays from the text without constructing the suffix array have been proposed. In this paper, we compare practical performance of these algorithms of compressed suffix arrays with that of various algorithms of suffix arrays by measuring the construction times, the peak memory usages during construction and the sizes of their final outputs.

써픽스 배열은 기본적인 전체 텍스트 인덱스 자료구조로서, 반복되는 패턴 질의 수행 시 효율적으로 사용될 수 있다. 유용한 전체 텍스트 인덱스 자료구조들이 많이 제안되어왔음에도 불구하고, O(nlogn)-비트 공간을 필요로 하는 공통적인 문제점으로 인하여 보다 효율적으로 공간을 사용할 수 있는 방법에 대한 필요성이 요구되었다. 하지만 기 개발된 압축된 써픽스 배열이나 FM-인덱스와 같은 것들 또한 이미 존재하는 써픽스 배열에서부터 구축되어야 하기 때문에 실제적인 사용 공간을 줄일 수는 없었다. 최근, 써픽스 배열을 구축할 필요 없이 텍스트로부터 직접 압축된 써픽스 배열을 구축할 수 있는 두 가지 알고리즘들이 제안되었다. 본 논문에서는 실험을 통해 자료구조 구축 시간과 구축 시 필요로 하는 최대 사용 공간, 구축이 끝난 후 최종 자료구조의 크기 등을 측정함으로써 이 두 가지 압축된 써픽스 배열 구축 알고리즘과 기존의 써픽스 배열들과의 실제적인 성능을 비교한다.

Keywords

References

  1. P. Weiner, 'Linear pattern matching algorithms,' Proc. 14th IEEE Symp. Switching and Automata Theory, pp.1-11, 1973
  2. E. M. McCreight, 'A space-economical suffix tree construction algorithm,' J. ACM., Vol.23, No.2, pp.262-272, 1976 https://doi.org/10.1145/321941.321946
  3. E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica, Vol.14, pp.249-260, 1995 https://doi.org/10.1007/BF01206331
  4. M. Farach, 'Optimal suffix tree construction with large alphabets,' Proc. 38th IEEE Symp. Found. Computer Science pp.137-143, 1997 https://doi.org/10.1109/SFCS.1997.646102
  5. M. Farach, P. Ferragina and S, Muthukrishnan, 'On the sorting-complexity of suffix tree construction,' J. Assoc. Comput. Mach. Vol.47, pp.987-1011, 2000 https://doi.org/10.1145/355541.355547
  6. U. Manber and G. Myers, 'Suffix arrays: A new method for on-line string searches,' SIAM J. Comput., Vol.22, No.5, pp.935-948, 1993 https://doi.org/10.1137/0222058
  7. D. Gusfield, 'An 'Increment-by-one' approach to suffix arrays and trees,' Report. CSE-90-39, Computer Science Division, University of California, Davis, 1990
  8. D. K. Kim, J. S. Sim, H. Park and K. Park, 'Linear-time construction of suffix arrays,' Proc. 14th Symp. Combinatorial Pattern Matching, pp.186-199, 2003
  9. P. Ko and S. Aluru, 'Space-efficient linear time construction of suffix arrays,' Proc. 14th Symp. Combinatorial Pattern Matching, pp.200-210, 2003
  10. J. Kakkanen and P. Sanders, 'Simple linear work suffix array construction,' Proc. 30th Int. Colloq. Automata Languages and Programming, pp.943-955, 2003
  11. D. K. Kim, J. Jo and H. Park, 'A fast algorithm for constructing suffix arrays for fixed-size alphabets,' Proc. 3rd Int. Workshop on Experimental and Efficient Algorithms, pp.301-314, 2004
  12. N. J. Larsson and K. Sadakane, 'Faster Suffix Sorting,' Report. LU-CS-TR:99-214, Dept. of Computer Science, Lund University, Sweden, 1999
  13. G. Manzini and P. Ferragina, 'Engineering a lightweight suffix array construction algorithm,' Algorithmica, Vol.40, pp.33-50, 2004 https://doi.org/10.1007/s00453-004-1094-1
  14. K. Schurmann and J. Stoye, 'An incomplex algorithm for fast suffix array construction,' Software: Practices and Experience, 2006 (to appear) https://doi.org/10.1002/spe.v37:3
  15. J. I. Munro, V. Raman and S. S. Rao, 'Space efficient suffix trees,' J. of Algorithms, Vol.39, pp.205-222, 2001 https://doi.org/10.1006/jagm.2000.1151
  16. R. Grossi and J. Vitter, 'Compressed suffix arrays and suffix trees with applications to text indexing and string matching,' Proc. 32nd ACM Symp. Theory of Computing, pp.397-406, 2000 https://doi.org/10.1145/335305.335351
  17. P. Ferragina and G. Manzini, 'Opportunistic data structures with applications,' Proc. 41st IEEE Symp. Found. Computer Science, pp.390-398, 2001 https://doi.org/10.1109/SFCS.2000.892127
  18. W. K. Hon, K. Sadakane and W. K. Sung, 'Breaking a time-and-space barrier in constructing full-text indices,' Proc. 44th IEEE Symp. Found. Computer Science, pp.251-260, 2003
  19. J. C. Na, 'Linear-time construction of compressed suffix arrays using $O(nlog^{\varepsilon}$ n)-bit working space for large alphabets,' Proc. 16th Combinatorial Pattern Matching, pp.57-67, 2005
  20. R.M. Karp, R. E. Miller and A. L. Rosenberg, 'Rapid identification of repeated patterns in strings,' Proc. 4th ACM Symp. Theory of Computing, pp.125-136, 1972 https://doi.org/10.1145/800152.804905
  21. H. Itoh and H. Tanaka, 'An efficient method for in memory construction of suffix array,' Proc. 11th Symp. String Processing and Information Retrieval, pp.81-88, 1999
  22. D. K. Kim, J. C. Na, J. E. Kim and K. Park, 'Efficient implementation of rank and select functions for succinct representation,' Proc. 4th Int. Workshop on Experimental and Efficient Algorithms, p.315-327, 2005 https://doi.org/10.1007/11427186_28
  23. T. Hagerup, P. B. Miltersen and R. Pagh, 'Deterministic dictionaries,' J. of Algorithms, Vol.41, No.1, pp.69-85, 2001 https://doi.org/10.1006/jagm.2001.1171
  24. D. E. Willard, 'Log-logarithmic worst-case range queries are possible in space {\Theta}(N)$ ,' Information Processing Letters, Vol.17, No.2, pp.81-84, 1983 https://doi.org/10.1016/0020-0190(83)90075-3
  25. M. G. Maass 'Matching statistics: efficient computation and a new practical algorithm for the multiple common substring problem,' Software:Practices and Experience, Vol.36, No.3, pp.305-331, 2006 https://doi.org/10.1002/spe.v36:3