• Title/Summary/Keyword: substring

Search Result 23, Processing Time 0.029 seconds

Frequency Estimation of Substring for Scientific Database (과학 데이타베이스에서 부분 문자열의 발생 빈도 예측)

  • 배진욱;이석호
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.04a
    • /
    • pp.536-538
    • /
    • 2003
  • 대량의 짧은 문자열들에 대해 부분 문자열의 발생 빈도를 예측하는 문제는 카운트 서픽스 트리를 미리 생성한 후 이를 이용함으로써 처리될 수 있다. 카운트 서픽스 트리는 모든 부분 문자열의 발생 빈도를 저장한 뒤 가지치기를 함으로써, 제한된 트리 크기와 발생 빈도 예측이라는 두 가지 목표를 처리한다. 하지만, 염기서열에서 처럼 저장된 문자열의 길이가 길어질 경우 카운트 서픽스 트리를 생성하기가 대단히 어려워진다는 문제점이 발생한다. 이 논문에서는 선삽입, 후가지치기 방식의 카운트 서픽스 트리 대신 처음부터 길이가 q 이하인 문자열들만을 삽입하는 큐그램 트리를 제안한다. 큐그램 트리는 제한된 트리 크기에 따라 저장할 부분 문자열의 크기를 미리 결정할 수 있으며, 데이타베이스에 저장된 문자열의 전체 길이가 N일 때 O(N) 시간에 생성 가능하다. 실험 결과 제한된 부분 문자열을 가지고 있음에도 불구하고 긴 부분 문자열의 발생 빈도를 매우 정확하게 예측할 수 있음을 보였다.

  • PDF

High Speed Substring Analysis Algorithm for Converting from the Korean Company Name to Roman Characters (한글 상호(商號)를 로마자로 변환하기 위한 고속 부분문자열 분석 알고리즘)

  • Myeong-jin Hwang;Sun-ho Jo;Hyuk-chul Kwon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2008.11a
    • /
    • pp.168-170
    • /
    • 2008
  • 한글 상호(商號) 로마자 변환기는 한글로 만들어진 상호를 로마자로 자동 변환하는 시스템이다. 이 변환기는 기사용 로마자 상호명과 업종명, 그리고 표준 한글 로마자 변환 규칙에 의해 생성한 로마자를 조합하여 로마자 상호를 생성한다. 이때, 조합을 위한 알고리즘이 필요한데, 기존에 비슷한 용도에 사용되었던 stack 알고리즘을 적용할 경우 비효율적이다. 본 논문은 이를 대체할 새 알고리즘을 제안한다. 새 알고리즘은 기존 stack 알고리즘을 사용할 때에 비해 복잡도를 O(bd)에서 O(b*d)로 줄여 성능을 높인다.

Predictive Morphological Analysis of Korean with Dynamic Programming (동적 프로그래밍기법에 근거한 예측중심의 한국어 형태소 분석)

  • 김덕봉;최기선
    • Korean Journal of Cognitive Science
    • /
    • v.4 no.2
    • /
    • pp.145-180
    • /
    • 1994
  • In this paper,we present an efficient morphological analysis model for Korean which produces from an input word all the feasible sequences of morphemes in the word.This model is deterministic in applying spelling rules,and has few redundant computations in processing complex and ambiguous words.This is the effect of three types of new techniques:first,a new method for interpreting speilling rules;second,predictive rule applications which restrict to the spelling rules suitable for the input word;third,the use of dynamic programming which enables the analyzer to avoid recomputing analyzed substring in case the input word is morphologically ambiguous.our model has been experimented with 413,975 word randomly selected from the corpus of Korean elementary textbooks.Experimental results show that our model guarantees fast and reliable processing.

A Local Alignment Algorithm using Normalization by Functions (함수에 의한 정규화를 이용한 local alignment 알고리즘)

  • Lee, Sun-Ho;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.5_6
    • /
    • pp.187-194
    • /
    • 2007
  • A local alignment algorithm does comparing two strings and finding a substring pair with size l and similarity s. To find a pair with both sufficient size and high similarity, existing normalization approaches maximize the ratio of the similarity to the size. In this paper, we introduce normalization by functions that maximizes f(s)/g(l), where f and g are non-decreasing functions. These functions, f and g, are determined by experiments comparing DNA sequences. In the experiments, our normalization by functions finds appropriate local alignments. For the previous algorithm, which evaluates the similarity by using the longest common subsequence, we show that the algorithm can also maximize the score normalized by functions, f(s)/g(l) without loss of time.

An Efficient Local Alignment Algorithm for DNA Sequences including N and X (N과 X를 포함하는 DNA 서열을 위한 효율적인 지역정렬 알고리즘)

  • Kim, Jin-Wook
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.3
    • /
    • pp.275-280
    • /
    • 2010
  • A local alignment algorithm finds a substring pair of given two strings where two substrings of the pair are similar to each other. A DNA sequence can consist of not only A, C, G, and T but also N and X where N and X are used when the original bases lose their information for various reasons. In this paper, we present an efficient local alignment algorithm for two DNA sequences including N and X using the affine gap penalty metric. Our algorithm is an extended version of the Kim-Park algorithm and can be extended in case of including other characters which have similar properties to N and X.

A Geometric Proof on Shortest Paths of Bounded Curvature (제한된 곡률을 갖는 최단경로에 대한 기하학적 증명)

  • Ahn, Hee-Kap;Bae, Sang-Won;Cheong, Otfried
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.4
    • /
    • pp.132-137
    • /
    • 2007
  • A point-wise car-like robot moving in the plane changes its direction with a constraint on turning curvature. In this paper, we consider the problem of computing a shortest path of bounded curvature between a prescribed initial configuration (position and orientation) and a polygonal goal, and propose a new geometric proof showing that the shortest path is either of type CC or CS (or their substring), where C specifies a non-degenerate circular arc and S specifies a non-degenerate straight line segment. Based on the geometric property of the shortest path, the shortest path from a configuration to a polygonal goal can be computed in linear time.

Searching for Variants Using Trie-Index (트라이 인덱스를 이용한 이형태 검색)

  • Park, In-Cheol
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.8
    • /
    • pp.1986-1992
    • /
    • 2009
  • A user often searches a data by inputting a variant such as the abbreviation or substring of a word, or a misspelled word. The simple approach to the searching for variants is to build a variants dictionary. However, it entails enormous cost and time and can not handle variants by misspelling. Approximate searching, searching by approximate string matching, is a good approach to the searching. A problem in the approach is that it cannot handle variants by abbreviations. This paper propose a method for searching various variants including abbreviations and misspelled words, by using the trie indexing. First, this paper shows a variant matching method with the calculation of path weighted-metric. In addition, it provides variant searching algorithm to reduce the search time.

Fast Search with Data-Oriented Multi-Index Hashing for Multimedia Data

  • Ma, Yanping;Zou, Hailin;Xie, Hongtao;Su, Qingtang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.9 no.7
    • /
    • pp.2599-2613
    • /
    • 2015
  • Multi-index hashing (MIH) is the state-of-the-art method for indexing binary codes, as it di-vides long codes into substrings and builds multiple hash tables. However, MIH is based on the dataset codes uniform distribution assumption, and will lose efficiency in dealing with non-uniformly distributed codes. Besides, there are lots of results sharing the same Hamming distance to a query, which makes the distance measure ambiguous. In this paper, we propose a data-oriented multi-index hashing method (DOMIH). We first compute the covariance ma-trix of bits and learn adaptive projection vector for each binary substring. Instead of using substrings as direct indices into hash tables, we project them with corresponding projection vectors to generate new indices. With adaptive projection, the indices in each hash table are near uniformly distributed. Then with covariance matrix, we propose a ranking method for the binary codes. By assigning different bit-level weights to different bits, the returned bina-ry codes are ranked at a finer-grained binary code level. Experiments conducted on reference large scale datasets show that compared to MIH the time performance of DOMIH can be improved by 36.9%-87.4%, and the search accuracy can be improved by 22.2%. To pinpoint the potential of DOMIH, we further use near-duplicate image retrieval as examples to show the applications and the good performance of our method.

A New Algorithm for the Longest Common Non-superstring (최장공통비상위 문자열을 찾는 새로운 알고리즘)

  • Choi, Si-Won;Lee, Dok-Young;Kim, Dong-Kyue;Na, Joong-Chae;Sim, Jeong-Seop
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.15 no.1
    • /
    • pp.67-71
    • /
    • 2009
  • Recently, the string non-inclusion related problems have been studied vigorously. Given a set of strings F over a constant size alphabet, consider a string x such that x does not include any string in F as a substring. We call x a Common Non-SuperString(CNSS for short) of F. Among the CNSS's of F, the longest one with finite length is called the Longest Common Non-SuperString(LCNSS for short) of F. In this paper, we first propose a new graph model using prefixes of F. Next, we suggest an O(N)-time algorithm for finding the LCNSS of F, where N is the sum of the lengths of all the strings in F.

Parallel Approximate String Matching with k-Mismatches for Multiple Fixed-Length Patterns in DNA Sequences on Graphics Processing Units (GPU을 이용한 다중 고정 길이 패턴을 갖는 DNA 시퀀스에 대한 k-Mismatches에 의한 근사적 병열 스트링 매칭)

  • Ho, ThienLuan;Kim, HyunJin;Oh, SeungRohk
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.66 no.6
    • /
    • pp.955-961
    • /
    • 2017
  • In this paper, we propose a parallel approximate string matching algorithm with k-mismatches for multiple fixed-length patterns (PMASM) in DNA sequences. PMASM is developed from parallel single pattern approximate string matching algorithms to effectively calculate the Hamming distances for multiple patterns with a fixed-length. In the preprocessing phase of PMASM, all target patterns are binary encoded and stored into a look-up memory. With each input character from the input string, the Hamming distances between a substring and all patterns can be updated at the same time based on the binary encoding information in the look-up memory. Moreover, PMASM adopts graphics processing units (GPUs) to process the data computations in parallel. This paper presents three kinds of PMASM implementation methods in GPUs: thread PMASM, block-thread PMASM, and shared-mem PMASM methods. The shared-mem PMASM method gives an example to effectively make use of the GPU parallel capacity. Moreover, it also exploits special features of the CUDA (Compute Unified Device Architecture) memory structure to optimize the performance. In the experiments with DNA sequences, the proposed PMASM on GPU is 385, 77, and 64 times faster than the traditional naive algorithm, the shift-add algorithm and the single thread PMASM implementation on CPU. With the same NVIDIA GPU model, the performance of the proposed approach is enhanced up to 44% and 21%, compared with the naive, and the shift-add algorithms.