An Index Data Structure for String Search in External Memory

외부 메모리에서 문자열을 효율적으로 탐색하기 위한 인덱스 자료 구조

  • 나중채 (서울대학교 전기, 컴퓨터공학부) ;
  • 박근수 (서울대학교 전기, 컴퓨터공학부)
  • Published : 2005.12.01

Abstract

We propose a new external-memory index data structure, the Suffix B-tree. The Suffix B-tree is a B-tree in which the key is a string like the String B-tree. While the node in the String B-tree is implemented with a Patricia trio, the node in the Suffix B-tree is implemented with an array. So the Suffix B-tree is simpler and easier to be Implemented than the String B-tree. Nevertheless, the branching algorithm of the Suffix B-tree is as efficient as that of the String B-tree. Consequently, the Suffix B-tree takes the same worst-case disk accesses as the String B-tree to solve the string matching problem, which is fundamental and important in the area of string algorithms.

본 논문에서는 새로운 외부 메모리 인덱스 자료 구조인 접미사 B-tree를 제안한다. 접미사 B-tree는 String B-tree와 마찬가지로 문자열을 키로 가지는 B-tree이다. String B-tree의 노드는 복잡한 Patricia ie로 구현된 반면, 접미사 B-tree의 노드는 일반적인 B-tree처럼 배열로 구현되어 보다 간단하고 구현하기 쉽다. 그럼에도 불구하고 접미사 B-tree에서 배열을 이용하여 String B-tree만큼 효율적으로 분기를 찾을 수 있다. 결과적으로 문자열 알고리즘 분야에서 기본적이고 중요한 문제인 문자열 매칭을 String B-tree와 동일한 디스크 접근을 사용하여 수행할 수 있다.

Keywords

References

  1. D. Gusfield, Algorithms on Strings, Tree, and Sequences, Cambridge University Press, Cambridge, 1997
  2. E. M. McCreight, 'A space-economical suffix tree construction algorithms,' J. ACM 23, pp. 262-272, 1976 https://doi.org/10.1145/321941.321946
  3. E. Ukkonen, 'On-line construction of suffix trees,' Algorithmica 14, pp. 353-364, 1993 https://doi.org/10.1007/BF01206331
  4. P. Weiner, 'Linear pattern matching algorithms,' Proceedings of the 14th IEEE symposium on Switching and Automata Theory, pp. 1-11, 1973
  5. U. Manber, G. Myers, 'Suffix arrays: a new method for on-line string searches,' SIAM J. Computing 22, pp. 935-948, 1993 https://doi.org/10.1137/0222058
  6. E. Ukkonen, D. Wood, 'Approximate string matching with suffix automata,' Algorithmica 10, pp. 353-364, 1993 https://doi.org/10.1007/BF01769703
  7. D. R. Morrison, 'PATRICIA: Practical algorithm to retrieve information coded in alphanumeric,' J. ACM 15, pp. 514-534, 1968 https://doi.org/10.1145/321479.321481
  8. R. W. Irving, L. Love, 'The suffix binary search tree and suffix AVL tree,' to appear in Journal of Discrete Algorithms https://doi.org/10.1016/S1570-8667(03)00034-0
  9. J S. Vitter, 'External memory algorithms and data structures: dealing with massive data,' ACM Computing Surveys 33, pp. 209-271, 2001 https://doi.org/10.1145/384192.384193
  10. N. Prywes, H. Gray, 'The organization of a Multilist-type associative memory ,' IEEE Trans. on Communication and Electronics 68, pp. 488-492, 1963
  11. R. Bayer, C. McCreight, 'Organization and maintenance of large ordered indexes,' Acta Informatica 1, 3. pp. 173-189, 1972 https://doi.org/10.1007/BF00288683
  12. R. Bayer, K. Unterauer, 'Prefix B-trees,' ACM Trans. Database System 2, 1, pp. 11-26, 1977 https://doi.org/10.1145/320521.320530
  13. D. Comer, 'The ubiquitous B-trees,' Computing Surveys 11, pp. 121-137, 1979 https://doi.org/10.1145/356770.356776
  14. P. Ferragina, R. Grossi, 'The string B-tree: a new data structure for string search in external memory and its applications,' JACM 46(2), pp. 236-280, 1999 https://doi.org/10.1145/301970.301973
  15. T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park, 'Linear-time longest-common-prefix computation in suffix arrays and its applications,' 12th Symposium on Combinatorial Pattern Matching, pp. 181-192, 2001