The Consensus String Problem based on Radius is NP-complete

거리반경기반 대표문자열 문제의 NP-완전

  • 나중채 (세종대학교 컴퓨터공학과) ;
  • 심정섭 (인하대학교 컴퓨터정보공학부)
  • Published : 2009.06.15

Abstract

The problems to compute the distances or similarities of multiple strings have been vigorously studied in such diverse fields as pattern matching, web searching, bioinformatics, computer security, etc. One well-known method to compare multiple strings in the given set is finding a consensus string which is a representative of the given set. There are two objective functions that are frequently used to find a consensus string, one is the radius and the other is the consensus error. The radius of a string x with respect to a set S of strings is the smallest number r such that the distance between the string x and each string in S is at most r. A consensus string based on radius is a string that minimizes the radius with respect to a given set. The consensus error of a string with respect to a given set S is the sum of the distances between x and all the strings in S. A consensus string of S based on consensus error is a string that minimizes the consensus error with respect to S. In this paper, we show that the problem of finding a consensus string based on radius is NP-complete when the distance function is a metric.

여러 문자열들을 비교하여 유사성 또는 거리(오차)를 계산하는 문제는 패턴매칭, 웹검색 바이오인포매틱스, 컴퓨터 보안 등 다양한 응용 분야와의 연관성으로 인해 활발히 연구되어 왔다. 주어진 문자열 집합 내의 여러 문자열들의 거리를 비교하기 위해 주어진 집합 내의 모든 문자열들을 대표하는 한 문자열(대표문자열)을 찾는 방법이 있다. 대표문자열 방법은 주어진 문자열 집합과 가장 유사한 한 문자열을 찾는 방법으로 주로 이용되는 목적함수는 거리반경과 거리합이 있다. 거리반경은 집합 내의 문자열들과 특정 문자열과의 거리들의 최대값으로 정의되며, 모든 문자열들 중에서 최소의 거리반경을 만드는 문자열을 주어진 문자열 집합에 대한 거리반경기반 대표문자열이라 한다. 거리합은 집합 내의 문자열들과 특정 문자열과의 거리들의 합으로 정의되며, 모든 문자열들 중에서 최소의 거리합을 만드는 문자열을 주어진 문자열집합에 대한 거리합기반 대표문자열이라 한다. 본 논문에서는 메트릭 거리함수에 대해 거리반경기반 대표문자열 문제가 NP-완전임을 증명한다.

Keywords

References

  1. Dan Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology, Cambridge University Press, 1997.
  2. S.F. Altschul and D.J. Lipman, Trees, Stars, and Multiple Biological Sequence Alignments, SIAM J. Appl. Math. 49(1), pp. 197-209, 1989. https://doi.org/10.1137/0149012
  3. M.S. Waterman, Introduction to computational biology: Maps, sequences and genomes, CHAPMAN& HALL/CRC, 1995.
  4. X. Zha and S. Sahni, Highly compressed Aho-Corasick automata for efficient intrusion detection, ISCC 2008, pp. 298-303, 2008.
  5. H. Carrillo and D. Lipman, The multiple sequence alignment problem in biology, SIAM J. Appl. Math., 48(5), pp. 1073-1082, 1988. https://doi.org/10.1137/0148063
  6. L. Wang and T. Jiang, On the Complexity of Multiple Sequence Alignment, Journal of Computational Biology, 1(4), pp. 337-348, 1994. https://doi.org/10.1089/cmb.1994.1.337
  7. P. Bonizzoni and G. Vedova, The complexity of multiple sequence alignment with SP-score that is a metric, Theoretical Computer Science, 259(1-2), pp. 63-79, 2001. https://doi.org/10.1016/S0304-3975(99)00324-2
  8. W. Just, Computational complexity of multiple sequence alignment with sp-score, Journal of Computational Biology, 8, pp. 615-623, 2001. https://doi.org/10.1089/106652701753307511
  9. I. Elias, Settling the intractability of multiple alignment, Journal of Computational Biology, 13(7), pp. 1323-1339, 2006. https://doi.org/10.1089/cmb.2006.13.1323
  10. M. Frances and A. Litman, On covering problems of codes, Theory of Computing Systems, 30(2), pp. 113-119, 1997.
  11. A. Ben-Dor, G. Lancia, J. Perone, and R. Ravi, Banishing bias from consensus sequences, In Proceedings of Symposium on Combinatorial Pattern Matching, pp. 247-261, 1997.
  12. L. Gasieniec, J. Jansson, and A. Lingas, Efficient approximation algorithms for the Hamming center problem, In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pp. 905-906, 1999.
  13. K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang, Distinguishing string selection problems, In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pp. 633-642, 1999.
  14. M. Li, B. Ma, and L. Wang, Finding similar regions in many strings, In Proceedings of Annual ACM Symposium on Theory of Computing, pp. 473-482, 1999.
  15. J.S. Sim and K. Park, The consensus string problem for a metric is NP-complete, Journal of Discrete Algorithms, 1(1), pp. 111-117, 2003. https://doi.org/10.1016/S1570-8667(03)00011-X
  16. J.S. Sim, C.S. Iliopoulos, K. Park, and W.F. Smyth, Approximate periods of strings, Theoretical Computer Science, 262(1-2), pp. 557-568, 2001. https://doi.org/10.1016/S0304-3975(00)00365-0
  17. D. Maier, The complexity of some problems on subsequences and supersequences, J. ACM, 25, pp. 322-336, 1978. https://doi.org/10.1145/322063.322075
  18. M. Middendorf, More on the complexity of common superstring and supersequence problems, Theoretical Computer Science, 125, pp. 205-228, 1994. https://doi.org/10.1016/0304-3975(92)00074-2
  19. K.J. Räihä, E. Ukkonen, The shortest common supersequence problem over binary alphabet is NP-complete, Theoretical Computer Science, 16, pp. 187-198, 1981. https://doi.org/10.1016/0304-3975(81)90075-X