Mining Clusters of Sequence Data using Sequence Element-based Similarity Measure

시퀀스 요소 기반의 유사도를 이용한 시퀀스 데이터 클러스터링

  • 오승준 (한양대학교 산업공학과) ;
  • 김재련 (한양대학교 산업공학과)
  • Published : 2004.11.01

Abstract

Recently, there has been enormous growth in the amount of commercial and scientific data, such as protein sequences, retail transactions, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, only a few of the existing clustering algorithms consider sequentiality. This study presents a method for clustering such sequence datasets. The similarity between sequences must be decided before clustering the sequences. This study proposes a new similarity measure to compute the similarity between two sequences using a sequence element. Two clustering algorithms using the proposed similarity measure are proposed: a hierarchical clustering algorithm and a scalable clustering algorithm that uses sampling and a k-nearest neighbor method. Using a splice dataset and synthetic datasets, we show that the quality of clusters generated by our proposed clustering algorithms is better than that of clusters produced by traditional clustering algorithms.

Keywords