• Title/Summary/Keyword: Substring Matching

Search Result 7, Processing Time 0.023 seconds

Constant Time RMESH Algorithm for Computing Longest Common Substring and Maximal Repeat of String (문자열의 최장 공통 부분문자열과 최대 반복자를 구하기 위한 상수시간 RMESH 알고리즘)

  • Han, Seon-Mi;Woo, Jin-Woon
    • The KIPS Transactions:PartA
    • /
    • v.16A no.5
    • /
    • pp.319-326
    • /
    • 2009
  • Since string operations were applied to computational biology area, various data structures and algorithms for computing efficient string operations have been studied. The longest common substring problem is an operation to find the longest matching substring in more than two strings, and maximal repeat of string problem is an operation to find substrings repeated more than once in the given string. These operations are importantly used in the string processing area such as pattern matching and likelihood measurement. In this paper, we present algorithms to compute the longest common substring of two strings and to find the maximal repeat of string using three-dimensional $n{\times}n{\times}n$ processors on RMESH(Reconfigurable MESH). Our algorithms have O(1) time complexity.

A SNOMED CT Browser System Supporting Structural Search of Clinical Terminology (의학용어의 구조 검색을 지원하는 SNOMED CT 브라우저 시스템)

  • Ryu, Wooseok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.353-355
    • /
    • 2015
  • SNOMED CT browser is a search browser which searches and browses terminologies include in SNOMED CT. These terminologies shows a structural form using a variety of relationships. However, previous browsers merely lists up substring-matched search results, rather than using structural characteristics. This paper proposes and implements a browser system which shows a sub-graph of search results enabling structural search of the results. The implementation includes searching of terminologies based on substring-matching, tree-based graphical organization of the search results, and history of concept views.

  • PDF

Automatic Generation of Training Character Samples for OCR Systems

  • Le, Ha;Kim, Soo-Hyung;Na, In-Seop;Do, Yen;Park, Sang-Cheol;Jeong, Sun-Hwa
    • International Journal of Contents
    • /
    • v.8 no.3
    • /
    • pp.83-93
    • /
    • 2012
  • In this paper, we propose a novel method that automatically generates real character images to familiarize existing OCR systems with new fonts. At first, we generate synthetic character images using a simple degradation model. The synthetic data is used to train an OCR engine, and the trained OCR is used to recognize and label real character images that are segmented from ideal document images. Since the OCR engine is unable to recognize accurately all real character images, a substring matching method is employed to fix wrongly labeled characters by comparing two strings; one is the string grouped by recognized characters in an ideal document image, and the other is the ordered string of characters which we are considering to train and recognize. Based on our method, we build a system that automatically generates 2350 most common Korean and 117 alphanumeric characters from new fonts. The ideal document images used in the system are postal envelope images with characters printed in ascending order of their codes. The proposed system achieved a labeling accuracy of 99%. Therefore, we believe that our system is effective in facilitating the generation of numerous character samples to enhance the recognition rate of existing OCR systems for fonts that have never been trained.

Searching for Variants Using Trie-Index (트라이 인덱스를 이용한 이형태 검색)

  • Park, In-Cheol
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.8
    • /
    • pp.1986-1992
    • /
    • 2009
  • A user often searches a data by inputting a variant such as the abbreviation or substring of a word, or a misspelled word. The simple approach to the searching for variants is to build a variants dictionary. However, it entails enormous cost and time and can not handle variants by misspelling. Approximate searching, searching by approximate string matching, is a good approach to the searching. A problem in the approach is that it cannot handle variants by abbreviations. This paper propose a method for searching various variants including abbreviations and misspelled words, by using the trie indexing. First, this paper shows a variant matching method with the calculation of path weighted-metric. In addition, it provides variant searching algorithm to reduce the search time.

Parallel Approximate String Matching with k-Mismatches for Multiple Fixed-Length Patterns in DNA Sequences on Graphics Processing Units (GPU을 이용한 다중 고정 길이 패턴을 갖는 DNA 시퀀스에 대한 k-Mismatches에 의한 근사적 병열 스트링 매칭)

  • Ho, ThienLuan;Kim, HyunJin;Oh, SeungRohk
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.66 no.6
    • /
    • pp.955-961
    • /
    • 2017
  • In this paper, we propose a parallel approximate string matching algorithm with k-mismatches for multiple fixed-length patterns (PMASM) in DNA sequences. PMASM is developed from parallel single pattern approximate string matching algorithms to effectively calculate the Hamming distances for multiple patterns with a fixed-length. In the preprocessing phase of PMASM, all target patterns are binary encoded and stored into a look-up memory. With each input character from the input string, the Hamming distances between a substring and all patterns can be updated at the same time based on the binary encoding information in the look-up memory. Moreover, PMASM adopts graphics processing units (GPUs) to process the data computations in parallel. This paper presents three kinds of PMASM implementation methods in GPUs: thread PMASM, block-thread PMASM, and shared-mem PMASM methods. The shared-mem PMASM method gives an example to effectively make use of the GPU parallel capacity. Moreover, it also exploits special features of the CUDA (Compute Unified Device Architecture) memory structure to optimize the performance. In the experiments with DNA sequences, the proposed PMASM on GPU is 385, 77, and 64 times faster than the traditional naive algorithm, the shift-add algorithm and the single thread PMASM implementation on CPU. With the same NVIDIA GPU model, the performance of the proposed approach is enhanced up to 44% and 21%, compared with the naive, and the shift-add algorithms.

High-Speed Search for Pirated Content and Research on Heavy Uploader Profiling Analysis Technology (불법복제물 고속검색 및 Heavy Uploader 프로파일링 분석기술 연구)

  • Hwang, Chan-Woong;Kim, Jin-Gang;Lee, Yong-Soo;Kim, Hyeong-Rae;Lee, Tae-Jin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.30 no.6
    • /
    • pp.1067-1078
    • /
    • 2020
  • With the development of internet technology, a lot of content is produced, and the demand for it is increasing. Accordingly, the number of contents in circulation is increasing, while the number of distributing illegal copies that infringe on copyright is also increasing. The Korea Copyright Protection Agency operates a illegal content obstruction program based on substring matching, and it is difficult to accurately search because a large number of noises are inserted to bypass this. Recently, researches using natural language processing and AI deep learning technologies to remove noise and various blockchain technologies for copyright protection are being studied, but there are limitations. In this paper, noise is removed from data collected online, and keyword-based illegal copies are searched. In addition, the same heavy uploader is estimated through profiling analysis for heavy uploaders. In the future, it is expected that copyright damage will be minimized if the illegal copy search technology and blocking and response technology are combined based on the results of profiling analysis for heavy uploaders.

An Index Structure for Substructure Searching In Chemical Databases (화학 데이타베이스에서 부분구조 검색을 위한 인덱스 구조)

  • Lee Hwangu;Cha Jaehyuk
    • Journal of KIISE:Databases
    • /
    • v.31 no.6
    • /
    • pp.641-649
    • /
    • 2004
  • The relationship between chemical structures and biological activities is researched briskly in the area of 'Medicinal Chemistry' At the base of these structure-based drug design tries, medicinal chemists search the existing drugs of similar chemical structure to target drug for the development of a new drug. Therefore, it is such necessary that an automatic system selects drug files that have a set of chemical moieties matching a user-defined query moiety. Substructure searching is the process of identifying a set of chemical moieties that match a specific query moiety. Testing for substructure searching was developed in the late 1950s. In graph theoretical terms, this problem corresponds to determining which graphs in a set are subgraph isomorphic to a specified query moiety. Testing for subgraph isomorphism has been proved, in the general case, to be an NP- complete problem. For the purpose of overcoming this difficulty, there were computational approaches. On the 1990s, a US patent has been granted on an atom-centered indexing scheme, used by the RS3 system; this has the virtue that the indexes generated can be searched by direct text comparison. This system is commercially used(http://www.acelrys.com/rs3). We define the RS3 system's drawback and present a new indexing scheme. The RS3 system treats substructure searching with substring matching by means of expressing chemical structure aspredefined strings. However, it has insufficient 'rerall' and 'precision‘ because it is impossible to index structures uniquely for same atom and same bond. To resolve this problem, we make the minimum-cost- spanning tree for one centered atom and describe a structure with paths per levels. Expressing 2D chemical structure into 1D a string has limit. Therefore, we break 2D chemical structure into 1D structure fragments. We present in this paper a new index technique to improve recall and precision surprisingly.