DOI QR코드

DOI QR Code

대표 패턴 마이닝에 활용되는 패턴 압축 기법들에 대한 분석 및 성능 평가

Analysis and Performance Evaluation of Pattern Condensing Techniques used in Representative Pattern Mining

  • Lee, Gang-In (Dept. of Computer Engineering, Sejong University) ;
  • Yun, Un-Il (Dept. of Computer Engineering, Sejong University)
  • 투고 : 2015.01.26
  • 심사 : 2015.03.18
  • 발행 : 2015.04.30

초록

데이터 마이닝에서 활발히 연구되고 있는 주요 분야들 가운데 하나인 빈발 패턴 마이닝은 대규모의 데이터 집합 또는 데이터베이스로부터 숨겨진 유용한 패턴 정보를 추출하기 위한 방법이다. 또한 이 기법으로 얻을 수 있는 결과물을 통해 데이터베이스내의 다양하고 중요한 특징들을 더욱 손쉽게 자동적으로 분석할 수 있기 때문에 많은 응용영역에도 활발히 적용되고 있다. 하지만 이러한 데이터베이스로부터 단순히 사용자에 의해 설정된 최소 지지도 임계값만을 가지고 이를 만족하는 모든 패턴들을 추출하는 기존의 전통적인 빈발 패턴 마이닝 방식은 데이터베이스의 특성과 임계값 설정의 정도에 따라 극도로 많은 수의 결과 패턴을 생성하는 문제를 가지며, 이에 따른 시간 및 공간 자원의 낭비를 초래한다. 또한 과도하게 생성된 패턴에 대한 분석의 어려움 역시 심각한 문제가 된다. 기존의 빈발 패턴 마이닝 접근방법들이 직면한 이러한 문제를 해결하고자, 데이터베이스로부터 가능한 모든 빈발 패턴들을 마이닝하는 것이 아닌, 이들에 대한 대표 패턴들만은 선별적으로 추출할 수 있도록 하는 대표 패턴 마이닝의 개념과 다양한 관련 기법들이 제안되었다. 본 논문에서는 생성되는 각 패턴의 최대성 또는 폐쇄성을 고려하는 패턴 압축 기법들에 대한 특성들을 기술하고, 이에대한 비교 및 분석을 진행한다. 최대 빈발 패턴 혹은 닫힌 빈발 패턴들을 마이닝함으로써, 효과적인 패턴 압축이 가능하며, 더 적은 시공간 자원으로 마이닝 작업을 수행할 수 있다. 또한 압축된 패턴들은 필요시 다시 원래의 패턴 형태로 복구가 가능한 특징이 있으며, 특히 닫힌 패턴 접근 방법을 이용하면 패턴을 압축하고 다시 해제하는 과정에서 어떠한 정보의 손실도 일어나지 않는다. 본 논문에서는 같은 플랫폼 상에서 동일한 구현 수준의 알고리즘에 대해 실세계로부터 축적된 실 데이터셋들을 가지고 상기 기법들에 대한 성능평가를 진행함으로써, 각 기법이 패턴 생성, 수행 시간, 메모리 사용량과 같은 실제적인 마이닝 성능에 대해 어떠한 영향을 미치는지에 대한 심층적 분석결과를 보인다.

Frequent pattern mining, which is one of the major areas actively studied in data mining, is a method for extracting useful pattern information hidden from large data sets or databases. Moreover, frequent pattern mining approaches have been actively employed in a variety of application fields because the results obtained from them can allow us to analyze various, important characteristics within databases more easily and automatically. However, traditional frequent pattern mining methods, which simply extract all of the possible frequent patterns such that each of their support values is not smaller than a user-given minimum support threshold, have the following problems. First, traditional approaches have to generate a numerous number of patterns according to the features of a given database and the degree of threshold settings, and the number can also increase in geometrical progression. In addition, such works also cause waste of runtime and memory resources. Furthermore, the pattern results excessively generated from the methods also lead to troubles of pattern analysis for the mining results. In order to solve such issues of previous traditional frequent pattern mining approaches, the concept of representative pattern mining and its various related works have been proposed. In contrast to the traditional ones that find all the possible frequent patterns from databases, representative pattern mining approaches selectively extract a smaller number of patterns that represent general frequent patterns. In this paper, we describe details and characteristics of pattern condensing techniques that consider the maximality or closure property of generated frequent patterns, and conduct comparison and analysis for the techniques. Given a frequent pattern, satisfying the maximality for the pattern signifies that all of the possible super sets of the pattern must have smaller support values than a user-specific minimum support threshold; meanwhile, satisfying the closure property for the pattern means that there is no superset of which the support is equal to that of the pattern with respect to all the possible super sets. By mining maximal frequent patterns or closed frequent ones, we can achieve effective pattern compression and also perform mining operations with much smaller time and space resources. In addition, compressed patterns can be converted into the original frequent pattern forms again if necessary; especially, the closed frequent pattern notation has the ability to convert representative patterns into the original ones again without any information loss. That is, we can obtain a complete set of original frequent patterns from closed frequent ones. Although the maximal frequent pattern notation does not guarantee a complete recovery rate in the process of pattern conversion, it has an advantage that can extract a smaller number of representative patterns more quickly compared to the closed frequent pattern notation. In this paper, we show the performance results and characteristics of the aforementioned techniques in terms of pattern generation, runtime, and memory usage by conducting performance evaluation with respect to various real data sets collected from the real world. For more exact comparison, we also employ the algorithms implementing these techniques on the same platform and Implementation level.

키워드

참고문헌

  1. R. Agrawal, T. Imilienski, and A, Swami, "Mining association rules between set of items in large databases", ACM SIGMOD, Vol.40, No.2, pp.207-216, 1993. http://dx.doi.org/10.1145/170036.170072
  2. J. Cai, X. Zhao, and Y. Xun, "Association rule mining method based on weighted frequent pattern tree in mobile computing environment", International Journal of Wireless and Mobile Computing, Vol. 6, No. 2, pp. 193-199, 2013. http://dx.doi.org/10.1504/IJWMC.2013.054047
  3. G. Fang, Z. Deng, and H. Ma, "Network Traffic Monitoring Based on Mining Frequent Patterns", Fuzzy Systems and Knowledge Discovery, Vol. 7, pp. 571-575, 2009. http://dx.doi.org/10.1109/FSKD.2009.444
  4. G. Granhne and J. Zhu, "Fast algorithms for frequent itemset mining using fp-trees", IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.10, pp.1347-1362, 2005. http://dx.doi.org/10.1109/TKDE.2005.166
  5. J. Han, J. Pei, Y. Yin, and R. Mao, "Mining frequent patterns without Candidate Generation: A frequent-Pattern Tree Approach", Data Mining and Knowledge Discovery, Vol.8, No.1, pp.53-87, 2004. http://dx.doi.org/10.1023/B:DAMI.0000005258.31418.83
  6. G. Lee, U. Yun, and K. Ryu, "Sliding window based weighted maximal frequent pattern mining over data streams". Expert Systems with Applications, Vol. 41, No. 2, pp. 694-708, 2014. http://dx.doi.org/10.1016/j.eswa.2013.07.094
  7. G. Pyun, U. Yun, and K. Ryu, "Efficient frequent pattern mining based on Linear Prefix tree", Knowledge Based Systems, Vol.55, pp.125-139, 2014. http://dx.doi.org/10.1016/j.knosys.2013.10.013
  8. G. Pyun and U. Yun, "Performance evaluation of approximate pattern mining based on probabilistic technique", Journal of Internet Computing and Services, Vol. 14, No. 1, pp. 63-69, 2013. http://dx.doi.org/10.7472/jksii.2013.14.63
  9. H. Ryang and U. Yun, "Performance Analysis of Frequent Pattern Mining with Multiple Minimum Supports", Journal of Internet Computing and Services, Vol. 14, No. 6, pp. 1-8, 2013. http://dx.doi.org/10.7472/jksii.2013.14.6.01
  10. A. Sallaberry, N. Pecheur, S. Bringay, M. roche, and M. Teisseire, "Sequential patterns mining and gene sequence visualization to discover novelty from microarray data", Journal of Biomedical Informatics, Vol.44, pp. 760-774, 2011. http://dx.doi.org/10.1016/j.jbi.2011.04.002
  11. M.Y. Su, G.J. Yu, and C.Y. Lin, "A real-time network intrusion detection system for large-scale attacks based on an incremental mining approach", Computers & Security, Vol. 28, No. 5, pp. 301-309, 2009. http://dx.doi.org/10.1016/j.cose.2008.12.001
  12. U. Yun and E. Yoon, "An Efficient Approach for Mining Weighted Approximate Closed Frequent Patterns Considering Noise Constraints", International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 22, No. 6, pp. 879-912, 2014. http://www.worldscientific.com/doi/abs/10.1142/S021848 8514500470
  13. U. Yun and G. Lee, "A Weighted Frequent Graph Pattern Mining Approach considering Length-Decreasing Support Constraints", Journal of Internet Computing and Services, Vol. 15, No. 6, pp. 125-132, 2014. http://dx.doi.org/10.7472/jksii.2014.15.6.125

피인용 문헌

  1. Performance Analysis of Siding Window based Stream High Utility Pattern Mining Methods vol.17, pp.6, 2016, https://doi.org/10.7472/jksii.2016.17.6.53
  2. Clustering Algorithm using the DFP-Tree based on the MapReduce vol.16, pp.6, 2015, https://doi.org/10.7472/jksii.2015.16.6.23
  3. Performance Analysis of Top-K High Utility Pattern Mining Methods vol.16, pp.6, 2015, https://doi.org/10.7472/jksii.2015.16.6.89
  4. 점진적 가중화 맥시멀 대표 패턴 마이닝의 최신 기법 분석, 유아들의 물품 패턴 분석 시나리오 및 성능 분석 vol.21, pp.2, 2020, https://doi.org/10.7472/jksii.2020.21.2.39