• Title/Summary/Keyword: 시퀀스 유사도 계산

Search Result 35, Processing Time 0.037 seconds

A Sequence Similarity Measure Considering the Product Taxonomy in Transaction Data (구매이력 데이터에서 상품 분류 체계를 고려한 시퀀스 유사도 측정 기법)

  • Yang, Yu-Jeong;Lee, Ki Yong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.05a
    • /
    • pp.367-370
    • /
    • 2019
  • 본 논문은 구매이력 데이터에서 상품간의 분류 체계를 고려하여 시퀀스 간의 유사도를 계산하는 새로운 방법을 제안한다. 시퀀스란 두 항목간의 순서가 존재하는 데이터를 의미한다. 항목 간의 선후관계가 중요한 시퀀스 데이터에서는 두 시퀀스 간의 유사도를 정확히 정의하는 것이 중요하다. 본 논문에서는 대표적인 시퀀스 유사도 측정 알고리즘인 편집 거리 알고리즘을 활용하여 구매이력 데이터에서 시퀀스 간의 유사도를 정의한다. 상품은 상품의 특성에 따라 항목 분류 체계에서 여러 범주로 분류된다. 이 경우 기존의 편집 거리 알고리즘에서 문자의 일치유무에 따라 단순히 0 또는 1을 부여하는 것은 부정확하다. 따라서 본 논문은 편집 거리 알고리즘의 수정 연산 중 대체 연산 비용 계산 시 항목 분류 트리를 사용하여 연산 비용이 0 에서 1 사이의 값을 가지도록 세분화하였다. 실험 결과 제안 방법은 대체 연산 비용 계산 시 두 문자가 다르면 단순히 1 을 부여하는 기존의 편집 거리 알고리즘에 비해 시퀀스 간의 유사도를 더 정확하게 계산함을 확인하였다.

Efficient Retrieval of Similar Shape-Based Subsequences for Sequence Database (시퀀스 데이터베이스를 위한 모양기반의 유사 부분시퀀스 검색)

  • 이정화;윤지희
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1999.10a
    • /
    • pp.340-342
    • /
    • 1999
  • 시퀀스 데이터(sequence data)에서는 각 데이터 값보다는 전후 그들 사이의 변화추세 등이 더 큰 정보로 작용하는 것이 일반적이다. 본문에서는 시퀀스 데이터베이스를 대상으로 하여 주어진 시퀀스 패턴과 모양이 유사한 모든 부분시퀀스를 검색해 내는 새로운 방식을 제안한다. 본 방식에서는 시퀀스 데이터의 모양 추출을 위한 데이터 변환, 유사 모양 패턴 클러스터링, 새로운 유사도 계산 방식 등을 도입함으로써, 기존의 방식이 매우 제한적인 패턴만을 유사패턴으로 간주하던 것에 비하여, 패턴이 데이터축 혹은 타임축으로 각각 확대, 축소, 이동된 경우에도 유사패턴으로 검색이 가능하다.

  • PDF

Purchase Transaction Similarity Measure Considering Product Taxonomy (상품 분류 체계를 고려한 구매이력 유사도 측정 기법)

  • Yang, Yu-Jeong;Lee, Ki Yong
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.9
    • /
    • pp.363-372
    • /
    • 2019
  • A sequence refers to data in which the order exists on the two items, and purchase transaction data in which the products purchased by one customer are listed is one of the representative sequence data. In general, all goods have a product taxonomy, such as category/ sub-category/ sub-sub category, and if they are similar to each other, they are classified into the same category according to their characteristics. Therefore, in this paper, we not only consider the purchase order of products to compare two purchase transaction sequences, but also calculate their similarity by giving a higher score if they are in the same category in spite of their difference. Especially, in order to choose the best similarity measure that directly affects the calculation performance of the purchase transaction sequences, we have compared the performance of three representative similarity measures, the Levenshtein distance, dynamic time warping distance, and the Needleman-Wunsch similarity. We have extended the existing methods to take into account the product taxonomy. For conventional similarity measures, the comparison of goods in two sequences is calculated by simply assigning a value of 0 or 1 according to whether or not the product is matched. However, the proposed method is subdivided to have a value between 0 and 1 using the product taxonomy tree to give a different degree of relevance between the two products, even if they are different products. Through experiments, we have confirmed that the proposed method was measured the similarity more accurately than the previous method. Furthermore, we have confirmed that dynamic time warping distance was the most suitable measure because it considered the degree of association of the product in the sequence and showed good performance for two sequences with different lengths.

A Hierarchical Clustering Algorithm Using Extended Sequence Element-based Similarity Measure (확장된 시퀀스 요소 기반의 유사도를 이용한 계층적 클러스터링 알고리즘)

  • Oh, Seung-Joon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.5 s.43
    • /
    • pp.321-327
    • /
    • 2006
  • Recently there has been enormous growth in the amount of commercial and scientific data. Such datasets consist of sequence data that have an inherent sequential nature. However, only a few of the existing clustering algorithms consider sequentiality. This study presents a similarity measure and a method for clustering such sequence datasets. Especially, we present an extended concept of the measure of similarity, which considers various conditions. Using a splice dataset, we show that the quality of clusters generated by our proposed clustering algorithm is better than that of clusters produced by traditional clustering algorithms.

  • PDF

A Design for Efficient Similar Subsequence Search with a Priority Queue and Suffix Tree in Image Sequence Databases (이미지 시퀀스 데이터베이스에서 우선순위 큐와 접미어 트리를 이용한 효율적인 유사 서브시퀀스 검색의 설계)

  • 김인범
    • Journal of the Korea Computer Industry Society
    • /
    • v.4 no.4
    • /
    • pp.613-624
    • /
    • 2003
  • This paper proposes a design for efficient and accurate retrieval of similar image subsequences using the multi-dimensional time warping distance as similarity evaluation tool in image sequence database after building of two indexing structures implemented with priority queue and suffix tree respectively. Receiving query image sequence, at first step, the proposed method searches the candidate set of similar image subsequences in priory queue index structure. If it can not get satisfied results, it retrieves another candidate set in suffix tree index structure at second step. The using of the low-bound distance function can remove the dissimilar subsequence without false dismissals during similarity evaluating process between query image sequence and stored sequences in two index structures.

  • PDF

A K-Nearest Neighbor Algorithm for Categorical Sequence Data (범주형 시퀀스 데이터의 K-Nearest Neighbor알고리즘)

  • Oh Seung-Joon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.10 no.2 s.34
    • /
    • pp.215-221
    • /
    • 2005
  • TRecently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. In this Paper, we study how to classify these sequence datasets. There are several kinds techniques for data classification such as decision tree induction, Bayesian classification and K-NN etc. In our approach, we use a K-NN algorithm for classifying sequences. In addition, we propose a new similarity measure to compute the similarity between two sequences and an efficient method for measuring similarity.

  • PDF

Maximizing the Early Abandon Effect in Time-Series Distance Computation (시계열 거리 계산에서 미리 버림 효과의 최대화)

  • Lee, Jeong-Gon;Kim, Sang-Pil;Moon, Yang-Sae
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.04a
    • /
    • pp.1226-1228
    • /
    • 2011
  • 본 논문에서는 유사 시퀀스 매칭에서 미리 버림 계산의 효율적인 방법을 제안한다. 미리 버림은 유사 시퀀스 매칭에서 유클리디안 거리 계산 도중 거리 계산 값이 허용치보다 큰 경우 나머지 거리 계산을 하지 않는 방법이다. 기존의 방법은 시퀀스 첫 엔트리를 시작으로 하여 유클리디안 거리 계산을 진행한다. 이 방법은 데이터 고려 없이 계산이 진행되기 때문에 데이터의 특성에 따라 효과가 크게 다른 점을 보인다. 본 논문에서는 미리 버림의 효과를 최대화 시키기 위해 유클리디안 거리 계산 시작점을 오프셋이라 정의하고, 이를 데이터 특성에 맞게 조절하는 방법을 제안한다. 실험 결과, 제안한 오프셋 조절 미리 버림 방법이 대용량의 데이터 베이스 기반 시스템에서 기존 기법에 비해 좋은 성능 향상시킨 것으로 나타났다.

A Scalable Clustering Method for Categorical Sequences (범주형 시퀀스들에 대한 확장성 있는 클러스터링 방법)

  • Oh, Seung-Joon;Kim, Jae-Yearn
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.2
    • /
    • pp.136-141
    • /
    • 2004
  • There has been enormous growth in the amount of commercial and scientific data, such as retail transactions, protein sequences, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. We also present an efficient method for determining the similarity measure and develop a clustering algorithm. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a real dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

An Index-Based Search Method for Performance Improvement of Set-Based Similar Sequence Matching (집합 유사 시퀀스 매칭의 성능 향상을 위한 인덱스 기반 검색 방법)

  • Lee, Juwon;Lim, Hyo-Sang
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.11
    • /
    • pp.507-520
    • /
    • 2017
  • The set-based similar sequence matching method measures similarity not for an individual data item but for a set grouping multiple data items. In the method, the similarity of two sets is represented as the size of intersection between them. However, there is a critical performances issue for the method in twofold: 1) calculating intersection size is a time consuming process, and 2) the number of set pairs that should be calculated the intersection size is quite large. In this paper, we propose an index-based search method for improving performance of set-based similar sequence matching in order to solve these performance issues. Our method consists of two parts. In the first part, we convert the set similarity problem into the intersection size comparison problem, and then, provide an index structure that accelerates the intersection size calculation. Second, we propose an efficient set-based similar sequence matching method which exploits the proposed index structure. Through experiments, we show that the proposed method reduces the execution time by 30 to 50 times then the existing methods. We also show that the proposed method has scalability since the performance gap becomes larger as the number of data sequences increases.

Shape-Based Subsequence Retrieval Supporting Multiple Models in Time-Series Databases (시계열 데이터베이스에서 복수의 모델을 지원하는 모양 기반 서브시퀀스 검색)

  • Won, Jung-Im;Yoon, Jee-Hee;Kim, Sang-Wook;Park, Sang-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.10D no.4
    • /
    • pp.577-590
    • /
    • 2003
  • The shape-based retrieval is defined as the operation that searches for the (sub) sequences whose shapes are similar to that of a query sequence regardless of their actual element values. In this paper, we propose a similarity model suitable for shape-based retrieval and present an indexing method for supporting the similarity model. The proposed similarity model enables to retrieve similar shapes accurately by providing the combination of various shape-preserving transformations such as normalization, moving average, and time warping. Our indexing method stores every distinct subsequence concisely into the disk-based suffix tree for efficient and adaptive query processing. We allow the user to dynamically choose a similarity model suitable for a given application. More specifically, we allow the user to determine the parameter p of the distance function $L_p$ when submitting a query. The result of extensive experiments revealed that our approach not only successfully finds the subsequences whose shapes are similar to a query shape but also significantly outperforms the sequence search.