Linear-Time Search in Suffix Arrays

접미사 배열을 이용한 선형시간 탐색

  • Published : 2005.06.01

Abstract

To search a pattern P in a text, such index data structures as suffix trees and suffix arrays are widely used in diverse applications of string processing and computational biology. It is well known that searching in suffix trees is faster than suffix ways in the aspect of time complexity, i.e., it takes O(${\mid}P{\mid}$) time to search P on a constant-size alphabet in a suffix tree while it takes O(${\mid}P{\mid}+logn$) time in a suffix way where n is the length of the text. In this paper we present a linear-tim8 search algorithm in suffix arrays for constant-size alphabets. For a gene.al alphabet $\Sigma$, it takes O(${\mid}P{\mid}log{\mid}{\Sigma}{\mid}$) time.

계산 생물학이나 문자열 연구 분야에 다양하게 웅용되는 패턴 탐색 문제에 접미사 트리와 접미사 배열과 같은 인덱스 자료구조가 널리 사용되어 왔다. 접미사 트리를 이용한 패턴 탐색이 접미사 배열을 이용한 탐색보다 시간 복잡도 관점에서 더 빠른 것으로 알려져 왔다. 즉, 상수 크기의 알파벳에 대해 패턴 P를 길이 n인 텍스트에서 탐색하기 위해 접미사 트리는 O(${\mid}P{\mid}$)시간이 필요한 반면 접미사 배열은 O(${\mid}P{\mid}+ logn$) 시간이 필요하다. 본 논문에서는 상수 크기 알파벳에 대해 접미사 배열을 이용한 선형시간 탐색 알고리즘을 제시한다. 본 알고리즘은 일반적인 알파벳 $\Sigma$에 대해서는 O(${\mid}P{\mid}log{\mid}{\Sigma$)시간이 필요하다.

Keywords

References

  1. E. M. McCreight, 'A space-economical suffix tree construction algorithms,' J. ACM 23, pp. 262-272, 1976 https://doi.org/10.1145/321941.321946
  2. P. Weiner, Linear pattern matching algorithms, Proc. 14th IEEE Symp. Switching and Automata Theory, pp. 1-11, 1973
  3. U. Manber, G. Myers, 'Suffix arrays: a new method for on-line string searches,' SIAM J. Computing 22, pp. 935-948, 1993 https://doi.org/10.1137/0222058
  4. G. Gonnet, R. Baeza-Yates, and T. Snider, New indices for text: Pat trees and pat arrays. In W. B. Frakes and R. A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pp. 66-82. Prentice Hall, 1992
  5. M. Farach-Colton, P. Ferragina and S. Muthukrishnan, On the sorting-complexity of suffix tree construction, J. Assoc. Comput. Mach, vol. 47, pp. 987-1011, 2000 https://doi.org/10.1145/355541.355547
  6. D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge Univ. Press, 1997
  7. D. Gusfield, An 'Increment-by-one' approach to suffix arrays and trees, manuscript, 1990
  8. S. Burkhardt and J. Karkkainen, Fast lightweight suffix array construction and checking, Symp. Combinatorial Pattern Matching, LNCS 2676, pp. 55-69, 2003 https://doi.org/10.1007/3-540-44888-8_5
  9. W. Hon, K. Sadakane, and W. Sung, Breaking a time-and-space barrier in constructing full-text indices, Proc. IEEE Symp. Found Computer Science, pp.251-260, 2003
  10. J. Karkkainen and P. Sanders, Simple linear work suffix array construction, Int. Colloq. Automata Languages and Programming, LNCS 2719, pp. 943-955, 2003
  11. D. Kim, J.S. Sim, H. Park, and K. Park, Linear-time construction of suffix arrays, Symp. Combinatorial Pattern Matching, LNCS 2676, pp. 186-199, 2003 https://doi.org/10.1007/3-540-44888-8_14
  12. P. Ko and S. Aluru, Space efficient linear time construction of suffix arrays, Symp. Combinatorial Pattern Matching, LNCS 2676, pp. 200-210, 2003 https://doi.org/10.1007/3-540-44888-8_15
  13. M. Farach, Optimal suffix tree construction with large alphabets, IEEE Symp. Found. Computer Science (1991), 137-143 https://doi.org/10.1109/SFCS.1997.646102
  14. R. Hariharan, Optimal parallel suffix tree construction, J. Comput. Syst. Sci., vol. 55, pp. 44-69, 1997 https://doi.org/10.1006/jcss.1997.1496
  15. M.I. Abouelhoda, E. Ohlebusch, and S. Kurtz, Optimal exact string matching based on suffix arrays, International Symposium on String Processing and Information Retrieval, LNCS 2476, 31-43, 2002
  16. P. Ferragina and G.. Manzini, Opportunistic data structures with applications, IEEE Symp. Found Computer Science, 390-398, 2001 https://doi.org/10.1109/SFCS.2000.892127
  17. K. Sadakane, Succinct representation of lcp information and improvement in the compressed suffixarrays, ACM-SIAM Symp. on Discrete Algorithms, pp. 225-232, 2002