DOI QR코드

DOI QR Code

Analysis and Evaluation of Frequent Pattern Mining Technique based on Landmark Window

랜드마크 윈도우 기반의 빈발 패턴 마이닝 기법의 분석 및 성능평가

  • Pyun, Gwangbum (Dept. of Computer Engineering, Sejong University) ;
  • Yun, Unil (Dept. of Computer Engineering, Sejong University)
  • Received : 2014.02.05
  • Accepted : 2014.04.14
  • Published : 2014.06.30

Abstract

With the development of online service, recent forms of databases have been changed from static database structures to dynamic stream database structures. Previous data mining techniques have been used as tools of decision making such as establishment of marketing strategies and DNA analyses. However, the capability to analyze real-time data more quickly is necessary in the recent interesting areas such as sensor network, robotics, and artificial intelligence. Landmark window-based frequent pattern mining, one of the stream mining approaches, performs mining operations with respect to parts of databases or each transaction of them, instead of all the data. In this paper, we analyze and evaluate the techniques of the well-known landmark window-based frequent pattern mining algorithms, called Lossy counting and hMiner. When Lossy counting mines frequent patterns from a set of new transactions, it performs union operations between the previous and current mining results. hMiner, which is a state-of-the-art algorithm based on the landmark window model, conducts mining operations whenever a new transaction occurs. Since hMiner extracts frequent patterns as soon as a new transaction is entered, we can obtain the latest mining results reflecting real-time information. For this reason, such algorithms are also called online mining approaches. We evaluate and compare the performance of the primitive algorithm, Lossy counting and the latest one, hMiner. As the criteria of our performance analysis, we first consider algorithms' total runtime and average processing time per transaction. In addition, to compare the efficiency of storage structures between them, their maximum memory usage is also evaluated. Lastly, we show how stably the two algorithms conduct their mining works with respect to the databases that feature gradually increasing items. With respect to the evaluation results of mining time and transaction processing, hMiner has higher speed than that of Lossy counting. Since hMiner stores candidate frequent patterns in a hash method, it can directly access candidate frequent patterns. Meanwhile, Lossy counting stores them in a lattice manner; thus, it has to search for multiple nodes in order to access the candidate frequent patterns. On the other hand, hMiner shows worse performance than that of Lossy counting in terms of maximum memory usage. hMiner should have all of the information for candidate frequent patterns to store them to hash's buckets, while Lossy counting stores them, reducing their information by using the lattice method. Since the storage of Lossy counting can share items concurrently included in multiple patterns, its memory usage is more efficient than that of hMiner. However, hMiner presents better efficiency than that of Lossy counting with respect to scalability evaluation due to the following reasons. If the number of items is increased, shared items are decreased in contrast; thereby, Lossy counting's memory efficiency is weakened. Furthermore, if the number of transactions becomes higher, its pruning effect becomes worse. From the experimental results, we can determine that the landmark window-based frequent pattern mining algorithms are suitable for real-time systems although they require a significant amount of memory. Hence, we need to improve their data structures more efficiently in order to utilize them additionally in resource-constrained environments such as WSN(Wireless sensor network).

본 논문에서는 랜드마크 윈도우 기반의 빈발 패턴 마이닝 기법을 분석하고 성능을 평가한다. 본 논문에서는 Lossy counting 알고리즘과 hMiner 알고리즘에 대한 분석을 진행한다. 최신의 랜드마크 알고리즘인 hMiner는 트랜잭션이 발생할 때 마다 빈발 패턴을 마이닝 하는 방법이다. 그래서 hMiner와 같은 랜드마크 기반의 빈발 패턴 마이닝을 온라인 마이닝이라고 한다. 본 논문에서는 랜드마크 윈도우 마이닝의 초기 알고리즘인 Lossy counting와 최신 알고리즘인 hMiner의 성능을 평가하고 분석한다. 우리는 성능평가의 척도로 마이닝 시간과 트랜잭션 당 평균 처리 시간을 평가한다. 그리고 우리는 저장 구조의 효율성을 평가하기 위하여 최대 메모리 사용량을 평가한다. 마지막으로 우리는 알고리즘이 안정적으로 마이닝이 가능한지 평가하기 위해 데이터베이스의 아이템 수를 변화시키면서 평가하는 확장성 평가를 수행한다. 두 알고리즘의 평가 결과로, 랜드마크 윈도우 기반의 빈발 패턴 마이닝은 실시간 시스템에 적합한 마이닝 방식을 가지고 있지만 메모리를 많이 사용했다.

Keywords

References

  1. R. Agrawal and R. Srikant, Fast algorithms for Mining Association Rules, in Proc. of the 20th int'l Conf. on Very Large Data Bases(VLDB), pp.487-499, 1994.
  2. G. S. Manku, R. Motwani, Approximate Frequency Counts over Data Streams, International conference on Very Large Data Bases, pp. 346-357, 2006.
  3. E. T. Wang, A. P. Chen, A novel hash-based approach for mining frequent itemsets over data streams requiring less memory space, Data Mining and Knowledge Discovery, vol.19, no.1, pp.346-357, 2006.
  4. S.K. Tanbeer, C.F. Ahmed, B.S. Jeong and Y.K. Lee, Sliding window-based frequent pattern mining over data streams, Information sciences, vol.179, no.22, pp.3843-3865, 2009. https://doi.org/10.1016/j.ins.2009.07.012
  5. H. Huang, X. Wu, R. Relue, Mining Frequent Patterns with the Pattern Tree, New Generation Computing, vol.23, pp.315-337, 2004.
  6. J. H. Chang and W. S. Lee, Finding recently frequent itemsets adaptively over online transactional data strems, Information Systems, vol.31, pp.849-869, 2006. https://doi.org/10.1016/j.is.2005.04.001
  7. X. Liu, J. Guan and P. Hu, Mining frequent closed itemsets from a landmark window over online data streams, Computers & Mathematics with Applications, vol.57, no.6, pp.927-936, 2009. https://doi.org/10.1016/j.camwa.2008.10.060
  8. A. Ramanathan, P. K. Agarwal, M. kurnikova and C. J. Langmead, An Online Approach for Mining Collective Behaviors form Molecular Dynamics Simulations, International Conference on Research in Computational Molecular Biology, pp.138-154, 2009.
  9. H. Li, N. Zhaing, Z. Chen, A Simple but Effective Maximal Frequent Itemset Mining Algorithm over Streams", Journal of Software, vol. 7, no. 1, pp. 25-32 Jan. 2012
  10. R.C. Wong, A.W. Fu, "Mining Top-k frequent itemsets from data streams", Data Mining and Knowledge Discovery(DMKD), vol.13, no.2, pp. 193-217, 2006. https://doi.org/10.1007/s10618-006-0042-x
  11. R. Jin, G. Agrawal, An Algorithm for In-Core Frequent Itemset Mining on Streaming Data, International Conference on Data Mining(ICDM), pp.210-217, 2005.
  12. X. H. Dang, W. Ng, K. Ong, Online mining of frequent sets in data streams with error guarantee, Knowledge and Information Systems, vol.16, no.2, pp.245-258, 2008. https://doi.org/10.1007/s10115-007-0106-2
  13. E. T. Wang, A. L. Chen, Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis, Data Mining and Knowledge Discovery, vol.23, no.2, pp.252-299, 2011. https://doi.org/10.1007/s10618-010-0204-8
  14. X. Zhu, W. Ding, P. S. Yu, C. Zhang, One-class learning and concept summarization for data streams, Knowledge and Information Systems, vol.28, no.3, pp.523-553, 2011. https://doi.org/10.1007/s10115-010-0331-y
  15. Frequent itemset Mining dataset repository. (www.almaden.ibm.com/software/projects/hdb/resources.shtml)