• Title/Summary/Keyword: 서열정렬법

Search Result 8, Processing Time 0.033 seconds

A method for comparing documents using fingerprinting and sequence alignment. (지문법과 서열정렬법을 결합한 다단계 정렬 방법의 문서 유사도 비교)

  • Seo, Jongkyu;Ock, Chang-Seok;Cho, Hwan-Gue
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.11a
    • /
    • pp.576-579
    • /
    • 2012
  • 문서유사도를 비교하는 방법은 지문법과 서열 정렬법이 널리 알려져 있다. 지문법은 계산속도가 빠른 대신 정확도가 떨어지며, 서열정렬법은 계산속도가 느린 대신 정확도가 높다. 다단계 정렬은 두 방법의 비중을 조절하여 문서 유사도를 비교할 수 있는 방법으로, 각 방법의 장점을 얻으면서 단점을 보완하도록 고안되었다[1]. 이 논문에서는 다단계 정렬방법에 대해 설명하고, 다단계정렬 방법에서 발생 가능한 단편화 문제를 제거하여 정확도를 향상시키는 방법에 대해 소개한다.

Multi-Level Sequence Alignment : An Adaptive Control Method Between Speed and Accuracy for Document Comparison (계산속도 및 정확도의 적응적 제어가 가능한 다단계 문서 비교 시스템)

  • Seo, Jong-Kyu;Tak, Haesung;Cho, Hwan-Gue
    • Journal of KIISE
    • /
    • v.41 no.9
    • /
    • pp.728-743
    • /
    • 2014
  • Finger printing and sequence alignment are well-known approaches for document similarity comparison. A fingerprinting method is simple and fast, but it can not find particular similar regions. A string alignment method is used for identifying regions of similarity by arranging the sequences of a string. It has an advantage of finding particular similar regions, but it also has a disadvantage of taking more computing time. The Multi-Level Alignment (MLA) is a new method designed for taking the advantages of both methods. The MLA divides input documents into uniform length blocks, and then extracts fingerprints from each block and calculates similarity of block pairs by comparing the fingerprints. A similarity table is created in this process. Finally, sequence alignment is used for specifying longest similar regions in the similarity table. The MLA allows users to change block's size to control proportion of the fingerprint algorithm and the sequence alignment. As a document is divided into several blocks, similar regions are also fragmented into two or more blocks. To solve this fragmentation problem, we proposed a united block method. Experimentally, we show that computing document's similarity with the united block is more accurate than the original MLA method, with minor time loss.

An effective method for comparing similarity of document with Multi-Level alignment (다단계정렬을 활용한 효율적인 문서 유사도 비교법)

  • Seo, Jong-Kyu;Hwang, Hae-Lyen;Cho, Hwan-Gue
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.04a
    • /
    • pp.402-405
    • /
    • 2012
  • 문서와 문서간의 유사도들 측정하는 방법 은 크게 지문법 (fingerprint)을 이용한 방법과 서열 정렬(sequence alignment)알고리즘을 이용한 방법이 있다. 두 방법은 각각 속도와 정확도라는 장점을 가지고 있다. 다단계정렬(MLA, Multi-Level alignment))는 이러한 두 방법을 조합하여 탐색 속도와 정확도 사이의 비중을 사용자가 결정할 수 있도록 하기 위한 방법이다.[1] 다단계 정렬은 두 문서를 단위 블록(basis block)로 나누고 블록 간의 벡터를 비교하여 유사도를 측정하게 되는데, 본 연구에서는 초성 추출 및 어간 추출을 통해 단위 블록의 벡터를 빠른 시 간에 생성하고 비교하는 방법과 다단계 탐색을 통해 정확도를 유지하면서 빠르게 유사도를 측정하는 방식에 대해 설명한다. 실험결과 제안 방법을 통해 다단계 정렬 방법을 이용한 대용량 문서 비교의 속도가 2 배 이상 빨라짐을 보인다.

Applying Genomic Sequence Alignment Methodology for Source Codes Plagiarism Detection (유전체 서열의 정렬 기법을 이용한 소스 코드 표절 검사)

  • 강은미;황미녕;조환규
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.9 no.3
    • /
    • pp.352-367
    • /
    • 2003
  • The syntactic and semantic characteristics of a computer program can be represented by the keywords sequence extracted from the source code. Therefore the similarity and the difference between two programs can be clearly figured out by comparing the keyword sequences obtained from the given programs. Various methods for measuring the similarity of two different sequences have been intensively studied already in bioinformatics on biological genetic sequence manipulation. In this paper, we propose a new method for measuring the similarity of two different programs and detecting the partial plagiarism by exploiting the sequence alignment techniques. In order to evaluate the performance of the proposed method, we experimented with the actual Program codes submitted by 70 students attending a Data Structure course )tow 2001. The experimental results show that the proposed method is more effective and powerful than the fingerprint method which is the most commonly used for the Plagiarism detection.

A Study on the Genomic Patterns of SARS coronavirus using Bioinformtaics Techniques (바이오인포매틱스 기법을 활용한 SARS 코로나바이러스의 유전정보 연구)

  • Ahn, Insung;Jeong, Byeong-Jin;Son, Hyeon S.
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2007.11a
    • /
    • pp.522-526
    • /
    • 2007
  • Since newly emerged disease, the Severe Acute Respiratory Syndrome (SARS), spread from Asia to North America and Europe rapidly in 2003, many researchers have tried to determine where the virus came from. In the phylogenetic point of view, SARS virus has been known to be one of the genus Coronavirus, but, the overall conservation of SARS virus sequence was not highly similar to that of known coronaviruses. The natural reservoirs of SARS-CoV are not clearly determined, yet. In the present study, the genomic sequences of SARS-CoV were analyzed by bioinformatics techniques such as multiple sequence alignment and phylogenetic analysis methods as well multivariate statistical analysis. All the calculating processes, including calculations of the relative synonymous codon usage (RSCU) and other genomic parameters using 30,305 coding sequences from the two genera, Coronavirus, and Lentivirus, and one family, Orthomyxoviridae, were performed on SMP cluster in KISTI, Supercomputing Center. As a result, SARS_CoV showed very similar RSCU patterns with feline coronavirus on the both axes of the correspondence analysis, and this result showed more agreeable results with serological results for SARS_CoV than that of phylogenetic result itself. In addition, SARS_CoV, human immunodeficiency virus, and influenza A virus commonly showed the very low RSCU differences among each synonymous codon group, and this low RSCU bias might provide some advantages for them to be transmitted from other species into human beings more successfully. Large-scale genomic analysis using bioinformatics techniques may be useful in genetic epidemiology field effectively.

  • PDF

A Fragmentation and Search Method of Query Document for Partially Plagiarized Section Detection (부분표절구간 검출을 위한 질의문서의 분할 및 탐색 기법)

  • Ock, Chang-Seok;Seo, Jong-Kyu;Cho, Hwan-Gue
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.11a
    • /
    • pp.586-589
    • /
    • 2012
  • 표절과 관련된 이슈가 주목받고 있는 상황에서 표절을 검출하는 방법에 대한 연구가 활발히 진행되고 있다. 일반적으로 표절구간 검출을 위해 복잡한 자연어처리와 같은 의미론적 접근방법이 아닌 비교적 단순한 어휘기반의 문자열 처리 방법을 사용한다. 대표적인 방법으로는 지문법 (Fingerprinting)과 서열정렬 (Sequence alignment) 등이 있다. 하지만 이 방법들을 이용하여 대용량 문서에 대한 표절검사를 수행하기에는 시공간적 복잡도의 문제가 발생한다. 본 논문에서는 이러한 단점을 극복하기 위해 NGS (Next Generation Sequencing)에서 사용하는 BWT (Burrows-Wheeler Transform)[1]를 이용한 탐색방법을 응용한다. 또한 부분표절구간을 검출하고 정확도를 향상시키기 위해 질의문서를 분할하여 작은 조각으로 만든 뒤, 조각들에 대한 질의탐색을 수행한다. 본 논문에서는 질의문서를 분할하는 두 가지 방법을 소개한다. 두 가지 방법은 k-mer analysis를 이용한 방법과 random-split analysis를 이용한 방법으로, 각 방법의 장단점을 실험을 통해 분석하고 실제 부분표절구간의 검출 정확도를 측정하였다.

Taxonomic status of Goodyera rosulacea (Orchidaceae): molecular evidence based on ITS and trnL sequences (로젯사철란(Goodyera rosulacea: Orchidaceae)의 분류학적 위치: ITS와 trnL 염기서열에 의한 분자적 증거)

  • Lee, Chang Shook;Eom, Sang Mi;Lee, Nam Sook
    • Korean Journal of Plant Taxonomy
    • /
    • v.36 no.3
    • /
    • pp.189-207
    • /
    • 2006
  • Goodyera rosulacea, which is morphologically similar to G. repens, is described recently as a new species based on its distinct morphological characters such as rosette-formed leaves, short rhizome and habitat. To verify the taxonomic identity of G. rosulacea and its taxonomic relationship within Korean Goodyera taxa, sequences of the internal transcribed spacer (ITS) region of nuclear ribosomal DNA and the trnL region of cpDNA from 24 accessions including 1 outgroup accession were analyzed. Aligned sequences were analyzed using maximum parsimony and distance method, and the taxonomic identity and the taxonomic relationships among the related taxa were estimated by the existence of private marker gene and the phylogenetic tree of the aligned sequences. Molecular data indicate that G. rosulacea gas several private marker genes and shows monophyly in phylogenetic trees of both ITS and trnL sequences. the pairwise distance between G. rosulacea and the orher taxa of Korean Goodyera was 3.49-6.68% for ITS region and 5.05-9.53% for trnL region, indicating that G. rosulacea could be treated as an independent species. Therefore, our molecular data support the taxonomic of G. rosulacea as a distinct species of Korea. In phylogenetic trees, G. rosulacea formed same clade with G. repens, which has similar morphological characters with G. rosulacea, and showed the lowest pairwise distance with G. repens among Korean Goodyera taxa. These molecular data sugguested that G. rosulacea and G. repens are closely related taxa.

Development of Cleaved Amplified Polymorphic Sequence (CAPS) Marker for Selecting Powdery Mildew-Resistance Line in Strawberry (Fragaria×ananassa Duchesne) (딸기 흰가루병 저항성 계통 선발을 위한 분자마커 개발)

  • Je, Hee-Jeong;Ahn, Jae-Wook;Yoon, Hae-Suk;Kim, Min-Keun;Ryu, Jae-San;Hong, Kwang-Pyo;Lee, Sang-Dae;Park, Young-Hoon
    • Horticultural Science & Technology
    • /
    • v.33 no.5
    • /
    • pp.722-729
    • /
    • 2015
  • Powdery mildew (PM) caused by Podosphaera aphanis is a major disease that can result in significant yield losses in strawberry (Fragaria ${\times}$ ananassa Duchesne). For preventing PM, pesticides are usually applied in strawberry. In this study, molecular markers were developed to increase breeding efficiency of PM-resistance cultivars by marker-assisted selection (MAS). An $F_2$ population derived from a cross between PM-resistance 'Seolhyang' and PM-susceptibility 'Akihime' was evaluated for disease resistance to PM and RAPD (random amplification of polymorphic DNA)-BSA (bulked segregant analysis). Among 200 RAPD primers tested, OPE10 primer amplified a 311bp-band present in with 331bp. Sequence alignment performed for searching polymorphisms and six single nucleotide polymorphism (SNP) were found in amplified regions. To develop polymorphic marker for distinguishing between resistant and susceptible, RAPD was converted to cleaved amplified polymorphic sequence (CAPS) marker. Among restriction enzymes associated with six SNPs, Eae I (Y/GGCCR) was successfully digested to 231bp in susceptible. The results suggest that the selected CAPS marker could be used for increasing efficiency of selecting powdery mildew resistant strawberry in breeding system.