Browse > Article

String Matching Algorithm on Multi-byte Character Set Texts  

Kim, Eun-Sang (서울대학교 컴퓨터공학부)
Kim, Jin-Wook (서울대학교병원 의료정보센터)
Park, Kun-Soo (서울대학교 컴퓨터공학부)
Abstract
An extensive research on exact string matching has been done, but there have been few researches on the matching in multi-byte character set texts such as EUC~KR. This paper shows that false matches may occur in multi-byte character set texts such as EUC-KR when using KMP algorithm, and presents a refined KMP algorithm without false matches applying a character-based prefix function. And also, Experimental results show that our algorithm is faster than string matching algorithms of widely used editors, Vim and Emacs, and the existing automata-based algorithm.
Keywords
Exact string matching; EUC-KR; Multi-byte character set; False match; KMP;
Citations & Related Records
연도 인용수 순위
  • Reference
1 GNU Emacs Editor, http://www.gnu.org/software/emacs
2 R. Nigel Horspool. Practical Fast Searching in Strings. Software Practice and Experience, vol.10, no.6, pp.501-506, 1980.   DOI
3 Daniel M. Sunday. A Very Fast Substring Search Algorithm. Communications of the ACM, vol.33, no.8, pp.132-142, 1990   DOI
4 G. De V. Smit. A comparison of three string matching algorithms. Software: Practice and Experience, vol.12, no.1, pp.57-66, 1982.   DOI   ScienceOn
5 D. E. Knuth, J. H. Morris Jr, and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Journal on Computing, vol.6, pp.323-350, 1977.   DOI
6 Robert S. Boyer and J. Strother Moore. A Fast String Searching Algorithm. Communications of the ACM, vol.20, no.10, pp.762-772, 1977.   DOI   ScienceOn
7 Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Efficient experimental string matching by weak factor recognition. 12th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol.2089, pp.51-72, 2001.
8 Masayuki Takeda, Satoru Miyamoto, Takuya Kida, Ayumi Shinohara, Shuichi Fukamachi, Takeshi Shinohara, and Setsuo Arikawa. Processing Text Files as Is: Pattern matching over Compressed Texts, Multibyte Character Texts, and Semi- structured Texts. String Processing and Information Retrieval (SPIRE) 2002, LNCS 2476, pp.170-186, 2002.
9 Vim Editor, http://www.vim.org
10 Korean Industrial Standards, http://www.standard.go.kr
11 Heikki Hyyro, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda. On Bit-parallel Processing of Multi-byte Text. Proceedings of the 1st Asia Information Retrieval Symposium (AIRS) 2004, LNCS 3411, pp.289-300, 2005.