Sequential Pattern Mining with Optimization Calling MapReduce Function on MapReduce Framework

Kim, Jin-Hyun;Shim, Kyu-Seok;

doi:10.3745/KIPSTD.2011.18D.2.081

The KIPS Transactions:PartD (정보처리학회논문지D)

Volume 18D Issue 2
/
Pages.81-88
/
2011
/
1598-2866(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Sequential Pattern Mining with Optimization Calling MapReduce Function on MapReduce Framework

맵리듀스 프레임웍 상에서 맵리듀스 함수 호출을 최적화하는 순차 패턴 마이닝 기법

김진현 (서울대학교 전기컴퓨터공학부) ;
심규석 (서울대학교 전기컴퓨터공학부)

Received : 2010.09.27
Accepted : 2011.02.14
Published : 2011.04.30

https://doi.org/10.3745/KIPSTD.2011.18D.2.081 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Sequential pattern mining that determines frequent patterns appearing in a given set of sequences is an important data mining problem with broad applications. For example, sequential pattern mining can find the web access patterns, customer's purchase patterns and DNA sequences related with specific disease. In this paper, we develop the sequential pattern mining algorithms using MapReduce framework. Our algorithms distribute input data to several machines and find frequent sequential patterns in parallel. With synthetic data sets, we did a comprehensive performance study with varying various parameters. Our experimental results show that linear speed up can be achieved through our algorithms with increasing the number of used machines.

시퀀스(sequence) 데이터가 주어졌을 때 그 중에서 빈번(frequent)한 순차 패턴을 찾는 순차 패턴 마이닝(sequential pattern mining)은 여러 어플리케이션(application)에 사용되는 중요한 데이터마이닝 문제이다. 순차 패턴 마이닝은 웹 접속 패턴, 고객 구매 패턴, 특정 질병의 DNA 시퀀스를 찾는 등 광범위한 분야에서 사용된다. 본 논문에서는 맵리듀스(MapReduce) 프레임웍 상에서 맵리듀스 함수 호출을 최적화하는 순차 패턴 마이닝 알고리즘을 개발하였다. 이 알고리즘은 여러 대의 기계에 데이터들을 분산시켜 병렬적으로 빈번한 순차 패턴을 찾는다. 실험적으로 다양한 데이터를 이용하여 파라미터 값을 변화시켜가며 제안된 알고리즘의 성능을 종합적으로 확인하였다. 그리고 실험 결과를 통해 제안된 알고리즘은 기계 수에 대해 선형적인 속도 개선을 보인다는 것을 확인하였다.

Keywords

References

J. Dean, S. Ghemawat, "MapReduce: Simplfied Data Processing on Large Clusters," In Proc. of the 6th OSDI, 2004.
Hadoop, "http://hadoop.apache.org/core/"
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto. "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth," In Proc. of 17th International Conference on Data Engineering, 2001.
R. Agrawal, R. Srikank, "Mining Sequence Patterns," In Proc. of International Conference on Data Engineering, 1995.
R. Agrawal, R. Srikant, "Fast Algorithms for Mining Association Rules," In Proc. of International Conference on Very Large Data Bases, 1994.
R. Agrawal, R. Srikant, "Mining Sequential Patterns: Generalizations and Performance Improvement," In Proc. of the 5th International Conference on Extending Database Technology, 1996.
J. Wang, J. Han, "BIDE: efficient mining of frequent closed sequences", In Proc. of the 20th IEEE International Conference on Data Engineering(ICDE), 2004. https://doi.org/10.1109/ICDE.2004.1319986
X. Yan, J. Han, R. Afshar, "CloSpan: mining closed sequential patterns in large datasets", In Proc. of the 3rd SIAM International Conference on Data Mining(SDB), 2004.
H. Liu, J. Han, D. Xin, Z. Shao, "Mining interesting patterns from very high dimensional data: a top-down row enumeration approach", In Proc. of the 6th SIAM International Conference on Data Mining(SDM), 2006.
L. Hongyan, L. Fangzhou, C. Yunjue, "New approach for the sequential pattern mining of high-dimensional sequence databases", Decision Support System, 2010.
Illimine, http://illimine.cs.uiuc.edu/