• Title/Summary/Keyword: substring

Search Result 23, Processing Time 0.021 seconds

Feature Selection and Classification of Protein CDS Using n-Block substring weighted Linear Model (N-Block substring 가중 선형모형을 이용한 단백질 CDS의 특징 추출 및 분류)

  • Choi, Seong-Yong;Kim, Jin-Su;Han, Seung-Jin;Choi, Jun-Hyeog;Rim, Kee-Wook;Lee, Jung-Hyun
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.19 no.5
    • /
    • pp.730-736
    • /
    • 2009
  • It is more important to analysis of huge gemonics data in Bioinformatics. Here we present a novel datamining approach to predict structure and function using protein's primnary structure only. We propose not also to develope n-Block substring search algorithm in reducing enormous search space effectively in relation to feature selection, but to formulate weighted linear algorithm in a prediction of structure and function of a protein using primary structure. And we show efficient in protein domain characterization and classification by calculation weight value in determining domain association in each selected substring, and also reveal that more efficient results are acquired through claculated model score result in an inference about degree of association with each CDS(coding sequence) in domain.

Constant Time RMESH Algorithm for Computing Longest Common Substring and Maximal Repeat of String (문자열의 최장 공통 부분문자열과 최대 반복자를 구하기 위한 상수시간 RMESH 알고리즘)

  • Han, Seon-Mi;Woo, Jin-Woon
    • The KIPS Transactions:PartA
    • /
    • v.16A no.5
    • /
    • pp.319-326
    • /
    • 2009
  • Since string operations were applied to computational biology area, various data structures and algorithms for computing efficient string operations have been studied. The longest common substring problem is an operation to find the longest matching substring in more than two strings, and maximal repeat of string problem is an operation to find substrings repeated more than once in the given string. These operations are importantly used in the string processing area such as pattern matching and likelihood measurement. In this paper, we present algorithms to compute the longest common substring of two strings and to find the maximal repeat of string using three-dimensional $n{\times}n{\times}n$ processors on RMESH(Reconfigurable MESH). Our algorithms have O(1) time complexity.

Estimation of Substring Selectivity in Biological Sequence Database (생물학 서열 데이타베이스에서 부분 문자열의 선적도 추정)

  • 배진욱;이석호
    • Journal of KIISE:Databases
    • /
    • v.30 no.2
    • /
    • pp.168-175
    • /
    • 2003
  • Until now, substring selectivities have been estimated by two steps. First step is to build up a count-suffix tree, which has statistical information about substrings, and second step is to estimate substring selectivity using it. However, it's actually impossible to build up a count-suffix tree from biological sequences because their lengths are too long. So, this paper proposes a novel data structure, count q-gram tree, consisting of fixed length substrings. The Count q-gram tree retains the exact counts of all substrings whose lengths are equal to or less than q and this tree is generated in 0(N) time and in site not subject to total length of all sequences, N. This paper also presents an estimation technique, k-MO. k-MO can choose overlapping length of splitted substrings from a query string, and this choice will affect accuracy of selectivity and query processing time. Experiments show k-MO can estimate very accurately.

A SNOMED CT Browser System Supporting Structural Search of Clinical Terminology (의학용어의 구조 검색을 지원하는 SNOMED CT 브라우저 시스템)

  • Ryu, Wooseok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.353-355
    • /
    • 2015
  • SNOMED CT browser is a search browser which searches and browses terminologies include in SNOMED CT. These terminologies shows a structural form using a variety of relationships. However, previous browsers merely lists up substring-matched search results, rather than using structural characteristics. This paper proposes and implements a browser system which shows a sub-graph of search results enabling structural search of the results. The implementation includes searching of terminologies based on substring-matching, tree-based graphical organization of the search results, and history of concept views.

  • PDF

A Suffix Tree Transform Technique for Substring Selectivity Estimation (부분 문자열 선택도 추정을 위한 서픽스트리 변환 기법)

  • Lee, Hong-Rae;Shim, Kyu-Seok;Kim, Hyoung-Joo
    • Journal of KIISE:Databases
    • /
    • v.34 no.2
    • /
    • pp.141-152
    • /
    • 2007
  • Selectivity estimation has been a crucial component in query optimization in relational databases. While extensive researches have been done on this topic for the predicates of numerical data, only little work has been done for substring predicates. We propose novel suffix tree transform algorithms for this problem. Unlike previous approaches where a full suffix tree is pruned and then an estimation algorithm is employed, we transform a suffix tree into a suffix graph systematically. In our approach, nodes with similar counts are merged while structural information in the original suffix tree is preserved in a controlled manner. We present both an error-bound algorithm and a space-bound algorithm. Experimental results with real life data sets show that our algorithms have lower average relative error than that of the previous works as well as good error distribution characteristics.

LR(k) Substring Recognition and Completion (LR(k) 서브 스트링 인식과 완성)

  • 김상헌;박용관;유재우
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.04a
    • /
    • pp.62-67
    • /
    • 2000
  • 편집 환경에서 입력되는 구문은 완전한 문장으로 입력되기보다는 문장의 일부가 부분적으로 입력되면서 점진적으로 프로그램을 완성하게 된다. 본 논문에서는 부분적인 문장의 입력을 분석하여 문장의 부족한 부분을 예측하여 서브 스트링에 대한 파스트리를 완성할 수 있는 방법을 제시한다.

  • PDF

Searching Algorithms for Protein Sequences and Weighted Strings (단백질 시퀀스와 가중치 스트링에 대한 탐색 알고리즘)

  • Kim, Sung-Kwon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.8
    • /
    • pp.456-462
    • /
    • 2002
  • We are developing searching algorithms for weighted strings such as protein sequences. Let${\sum}$ be an alphabet and for each $a{\in}{\sum}$ its weight ${\mu}(a)$ is given. Given a string $A=a_1a_2…a_n\; with each ai{\in}{\sum}$, a substring<$A(i.j)=a_ia_{i+1}…a_j$ has weight ${\in}(A(i.j))={\in}(a_i)+{\in}(a_i+1)+…+{\in}(a_j)$.The problem we are dealing with is to preprocess A to build a searching structure, and later, given a query weight M, the structure is used to answer the question of whether there is a substring A(i,j) such that$M={\in}(A(i,j))$.In this paper an algorithm that improves over the previous result will be presented. The previously best known algorithm answers a query in $0(\frac{nlog\;logn}{log\; n})$time using a searching structure that requires O(n) amount of memory. Our algorithm reduces the memory requirement to $0(\frac{n}{log\; n})$ while achieving the same query answer time.

An Efficient Suffix Tree Reconstructing Algorithm for Biological Sequence Analysis (DNA 분석에 효율적인 서픽스 트리 재구성 알고리즘)

  • Choi, Hae-Won;Jung, Young-Seok;Kim, Sang-Jin
    • Journal of Digital Convergence
    • /
    • v.12 no.12
    • /
    • pp.265-275
    • /
    • 2014
  • This paper introduces a new algorithms for reconstructing the suffix tree of character string, when a substring id deleted from the string or a string is inserted into the string as a substring. The algorithem has two main functions, delete-structure and insert-structure. The main objective of this algorithm is to save the time for constructing the suffix tree of an edited string, when the suffix tree of the original string is available. We tested the performance of this algorithm with some DNA sequences. This test shows that delete-reconstructing can save time when the length of the subsequence deleted is less than 30% of the original sequence, and the insert-reconstructing takes less time with regard to the length of inserted sequence.

Automatic Generation of Training Character Samples for OCR Systems

  • Le, Ha;Kim, Soo-Hyung;Na, In-Seop;Do, Yen;Park, Sang-Cheol;Jeong, Sun-Hwa
    • International Journal of Contents
    • /
    • v.8 no.3
    • /
    • pp.83-93
    • /
    • 2012
  • In this paper, we propose a novel method that automatically generates real character images to familiarize existing OCR systems with new fonts. At first, we generate synthetic character images using a simple degradation model. The synthetic data is used to train an OCR engine, and the trained OCR is used to recognize and label real character images that are segmented from ideal document images. Since the OCR engine is unable to recognize accurately all real character images, a substring matching method is employed to fix wrongly labeled characters by comparing two strings; one is the string grouped by recognized characters in an ideal document image, and the other is the ordered string of characters which we are considering to train and recognize. Based on our method, we build a system that automatically generates 2350 most common Korean and 117 alphanumeric characters from new fonts. The ideal document images used in the system are postal envelope images with characters printed in ascending order of their codes. The proposed system achieved a labeling accuracy of 99%. Therefore, we believe that our system is effective in facilitating the generation of numerous character samples to enhance the recognition rate of existing OCR systems for fonts that have never been trained.

Data Compression Capable of Error Control Using Block-sorting and VF Arithmetic Code (블럭정렬과 VF형 산술부호에 의한 오류제어 기능을 갖는 데이터 압축)

  • Lee, Jin-Ho;Cho, Suk-Hee;Park, Ji-Hwan;Kang, Byong-Uk
    • The Transactions of the Korea Information Processing Society
    • /
    • v.2 no.5
    • /
    • pp.677-690
    • /
    • 1995
  • In this paper, we propose the high efficiency data compression capable of error control using block-sorting, move to front(MTF) and arithmetic code with variable length in to fixed out. First, the substring with is parsed into length N is shifted one by one symbol. The cyclic shifted rows are sorted in lexicographical order. Second, the MTF technique is applied to get the reference of locality in the sorted substring. Then the preprocessed sequence is coded using VF(variable to fixed) arithmetic code which can be limited the error propagation in one codeword. The key point is how to split the fixed length codeword in proportion to symbol probabilities in VF arithmetic code. We develop the new VF arithmetic coding that split completely the codeword set for arbitrary source alphabet. In addition to, an extended representation for symbol probability is designed by using recursive Gray conversion. The performance of proposed method is compared with other well-known source coding methods with respect to entropy, compression ratio and coding times.

  • PDF