• Title/Summary/Keyword: Edit Distance

Search Result 48, Processing Time 0.024 seconds

Approximate Periods of Strings based on Distance Sum for DNA Sequence Analysis (DNA 서열분석을 위한 거리합기반 문자열의 근사주기)

  • Jeong, Ju Hui;Kim, Young Ho;Na, Joong Chae;Sim, Jeong Seop
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.2
    • /
    • pp.119-122
    • /
    • 2013
  • Repetitive strings such as periods have been studied vigorously in so diverse fields as data compression, computer-assisted music analysis, bioinformatics, and etc. In bioinformatics, periods are highly related to repetitive patterns in DNA sequences so called tandem repeats. In some cases, quite similar but not the same patterns are repeated and thus we need approximate string matching algorithms to study tandem repeats in DNA sequences. In this paper, we propose a new definition of approximate periods of strings based on distance sum. Given two strings $p({\mid}p{\mid}=m)$ and $x({\mid}x{\mid}=n)$, we propose an algorithm that computes the minimum approximate period distance based on distance sum. Our algorithm runs in $O(mn^2)$ time for the weighted edit distance, and runs in O(mn) time for the edit distance, and runs in O(n) time for the Hamming distance.

Context-Weighted Metrics for Example Matching (문맥가중치가 반영된 문장 유사 척도)

  • Kim, Dong-Joo;Kim, Han-Woo
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.6 s.312
    • /
    • pp.43-51
    • /
    • 2006
  • This paper proposes a metrics for example matching under the example-based machine translation for English-Korean machine translation. Our metrics served as similarity measure is based on edit-distance algorithm, and it is employed to retrieve the most similar example sentences to a given query. Basically it makes use of simple information such as lemma and part-of-speech information of typographically mismatched words. Edit-distance algorithm cannot fully reflect the context of matched word units. In other words, only if matched word units are ordered, it is considered that the contribution of full matching context to similarity is identical to that of partial matching context for the sequence of words in which mismatching word units are intervened. To overcome this drawback, we propose the context-weighting scheme that uses the contiguity information of matched word units to catch the full context. To change the edit-distance metrics representing dissimilarity to similarity metrics, to apply this context-weighted metrics to the example matching problem and also to rank by similarity, we normalize it. In addition, we generalize previous methods using some linguistic information to one representative system. In order to verify the correctness of the proposed context-weighted metrics, we carry out the experiment to compare it with generalized previous methods.

A Method to Measure the Self-Supplied News Volumes of Internet Newspaper Company

  • Kim, Dong-Joo;Lee, Won Joo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.10
    • /
    • pp.99-105
    • /
    • 2015
  • The growth of internet infrastructure and a tremendous increment of internet users lead actively to found internet newspaper publishing companies, which are able to dig up and publish own news articles. In disregard of these quantitative growth of internet newspaper companies, the qualitative growth of them doesn't coincide with the quantitative growth. Therefore, to require social responsibility and to build healthy media environment, Korean government has put in force registration system of internet newspaper company. According to this system, internet newspaper companies have to produce at the inside over 30 percent of weekly publications, and this requisite increases the needs of its verification. This paper investigates technologies to measure the self-supplied news volumes of internet newspaper company, examines validity of them, and presents appropriate method to measure. To compare huge amount of news articles rapidly, the presented method is based on the modified edit-distance, which reflects human cognition of word and empirical information related with it. To prove correctness of our presented method, we show experimental results for some real internet news articles.

Finding Approximate Covers of Strings (문자열의 근사커버 찾기)

  • Sim, Jeong-Seop;Park, Kun-Soo;Kim, Sung-Ryul;Lee, Jee-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.1
    • /
    • pp.16-21
    • /
    • 2002
  • Repetitive strings have been studied in such diverse fields as molecular biology data compression etc. Some important regularities that have been studied are perods, covers seeds and squares. A natural extension of the repetition problems is to allow errors. Among the four notions above aproximate squares and approximate periodes have been studied. In this paper, we introduce the notion of approximate covers which is an approximate version of covers. Given two strings P(|P|=m) and T(|T|=n) we propose and algorithm with finds the minimum distance t such that P is a t-approximate cover of T. The algorithm take O(m,n) time for the edit distance and $O(mn^2)$ time of finding a string which is an approximate cover of T is minimum distance is NP-complete.

Effective Image Clustering Using Shock Graphsm (쇼크 그래프를 이용한 효과적인 영상 군집화)

  • Jang, Seok-Woo;Khanam, Solima;Paik, Woo-Jin
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2011.01a
    • /
    • pp.249-252
    • /
    • 2011
  • 본 논문에서는 쇼크(shock) 그래프 기반의 뼈대 특징을 이용하여 모양 정보를 분류하기 위해 그래프 편집 거리(edit cost) 기반의 k-means 군집화 알고리즘을 적용하는 방법을 제안한다. 본 논문에서 제안된 방법에서는 먼저 질의 영상과 대상 데이터베이스 영상으로부터 뼈대 기반의 쇼크 그래프를 추출한 후 종점(end points)과 분기점(branch points)을 가중치를 이용하여 적응적으로 선택한다. 그런 다음, 두 영상 사이의 편집 거리를 구하여 이를 k-means 군집화 알고리즘의 거리 척도로 적용함으로써 대용량의 영상을 보다 효과적으로 분류한다. 성능을 평가하기 위해서 제안된 알고리즘을 MPEG-7 데이터베이스에 적용하였으며, 그 결과 제안된 영상 분류 방법이 기존의 영상 분류 방법에 비해서 보다 효과적으로 모양 기반의 영상을 분류하였음을 확인하였다.

  • PDF

Detecting Anomalous Trajectories of Workers using Density Method

  • Lan, Doi Thi;Yoon, Seokhoon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.14 no.2
    • /
    • pp.109-118
    • /
    • 2022
  • Workers' anomalous trajectories allow us to detect emergency situations in the workplace, such as accidents of workers, security threats, and fire. In this work, we develop a scheme to detect abnormal trajectories of workers using the edit distance on real sequence (EDR) and density method. Our anomaly detection scheme consists of two phases: offline phase and online phase. In the offline phase, we design a method to determine the algorithm parameters: distance threshold and density threshold using accumulated trajectories. In the online phase, an input trajectory is detected as normal or abnormal. To achieve this objective, neighbor density of the input trajectory is calculated using the distance threshold. Then, the input trajectory is marked as an anomaly if its density is less than the density threshold. We also evaluate performance of the proposed scheme based on the MIT Badge dataset in this work. The experimental results show that over 80 % of anomalous trajectories are detected with a precision of about 70 %, and F1-score achieves 74.68 %.

Semantic Correspondence of Database Schema from Heterogeneous Databases using Self-Organizing Map

  • Dumlao, Menchita F.;Oh, Byung-Joo
    • Journal of IKEEE
    • /
    • v.12 no.4
    • /
    • pp.217-224
    • /
    • 2008
  • This paper provides a framework for semantic correspondence of heterogeneous databases using self- organizing map. It solves the problem of overlapping between different databases due to their different schemas. Clustering technique using self-organizing maps (SOM) is tested and evaluated to assess its performance when using different kinds of data. Preprocessing of database is performed prior to clustering using edit distance algorithm, principal component analysis (PCA), and normalization function to identify the features necessary for clustering.

  • PDF

Construction of Auto-replicators and Hyper-cycles Using Genetic Algorithms (유전자 알고리즘을 이용한 자가복제자와 하이퍼사이클의 구성)

  • Gwac Cho-Hwa;Wee Kyu-Bum
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.06a
    • /
    • pp.31-33
    • /
    • 2006
  • 활자유전학(typogenetics)은 인공생명(artificial life) 연구에 사용되는 형식 시스템으로서, 자가복제자와 하이퍼사이클의 출현에 관한 연구에 효과적인 모델이다. 본 연구에서는 하이퍼사이클에 추가될 복제자의 차이점과 유사점을 측정하기 위하여 편집거리(edit distance)를 사용하여, 기존의 연구에서 생성된 하이퍼사이클 보다 더 큰 크기의 다양한 하이퍼사이클들을 생성하였다.

  • PDF

A Study on XML Document Similarity based on Function Modeling (함수 변환 모델링에 의한 XML 문서의 유사성 비교에 대한 연구)

  • Lee Ho-Suk
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.06c
    • /
    • pp.58-60
    • /
    • 2006
  • 근래에 XML 문서가 인터넷에서 정보 교환의 방법으로 표준화됨에 따라, 많은 양의 데이터가 XML 문서 포맷으로 저장되고 있다. XML 문서의 유사성 연구는 크게 edit-distance를 이용하는 방법, 문서의 그래프 모델을 이용하는 방법, 문서의 매트릭스 모델을 이용하는 방법 등이 있다. 최근에는 문서를 encoding 하고 푸리에 변환을 이용하는 방법이 보고되었다. 본 논문에서는 XML 문서를 함수로 변환하여 모델링하여 문서의 구조적 유사성을 비교하는 방법을 제안한다. 제안된 방법을 사용하여 XML 문서를 함수로 모델링하였으며 XML 문서 간의 유사성을 비교해 보았다.

  • PDF

Recognition of Korean Menu for Online to Offline Stores : VGG-ResNet Fusion Model with Attention Mechanism (Online to Offline 상점을 위한 한글 메뉴판 인식 : 어텐션 메커니즘을 적용한 VGG-ResNet 융합 모델)

  • Jongwook Si;Sangjin Lee;Sungyoung Kim
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.17 no.4
    • /
    • pp.190-197
    • /
    • 2024
  • The O2O store model dissolves the boundaries between online and offline platforms, providing significant convenience to customers. To effectively operate such platforms, small business owners must provide necessary information in digital format. Specifically, the process of digitizing Korean menus manually can lead to multiple issues, and the use of OCR technology often results in high error rates due to the low accuracy in recognizing Korean. In response, this paper proposes an enhanced OCR model based on the popular EasyOCR framework, aimed at improving the recognition accuracy of Korean. The proposed model integrates the structural advantages of VGG and ResNet, and incorporates an attention mechanism to significantly improve the recognition performance of Korean. Moreover, experimental results indicate that the proposed model achieved approximately a 3.5% improvement in accuracy and around a 1% improvement in both confidence score and normalized edit distance compared to EasyOCR. Therefore, this demonstrates that the proposed method effectively addresses the existing challenges.