DOI QR코드

DOI QR Code

유사한 인기도 추세를 갖는 웹 객체들의 클러스터링

Clustering of Web Objects with Similar Popularity Trends

  • 노웅기 (성결대학교 멀티미디어학부)
  • 발행 : 2008.08.29

초록

인터넷이 광범위하게 활용됨에 따라 검색 키워드, 멀티미디어 객체, 웹 페이지, 블로그 등의 다양한 웹 객체들이 크게 증가하고 있다. 이러한 웹 객체들의 인기도는 시간에 따라 변화하며, 그러한 웹 객체 인기도의 시간적 패턴에 대한 마이닝이 여러 가지 웹 응용에 필요한 중요한 연구 과제가 되고 있다. 예를 들어, 검색 키워드에 대한 인기도 패턴의 분석은 앞으로 인기가 높아질 키워드를 미리 예측할 수 있게 하여 광고주들에게 키워드를 판매하기 위한 가격을 결정하는 데에 중요한 자료가 될 수 있다. 하지만, 웹 객체 인기도가 시간에 따라 변화하고 웹 객체의 개수가 매우 방대하다는 특성으로 인하여 웹 객체 인기도에 대한 분석은 매우 어려운 문제이다. 본 논문에서는 웹 객체 인기도의 시간적 패턴을 마이닝하기 위한 효율적인 알고리즘을 제안한다. 본 논문은 웹 객체 인기도를 시계열로 표현하고, 두 웹 객체 인기도 간의 유사성을 측정하기 위하여 gap 척도를 제안한다. gap 척도의 효율적인 계산을 위하여 FFT를 활용한 알고리즘을 제안하고, 밀도기반 클러스터링 알고리즘을 이용하여 유사한 인기도 추세를 갖는 웹 객체들의 클러스터를 생성한다. 본 논문에서는 웹 객체 인기도가 특정 분포를 따르거나 주기적이라고 가정하지 않는다. Google Trends 웹 사이트로부터 구한 검색 키워드 인기도를 이용한 실험을 통하여, 제안된 알고리즘이 실세계 응용에서 유용함을 보인다.

Huge amounts of various web items such as keywords, images, and web pages are being made widely available on the Web. The popularities of such web items continuously change over time, and mining temporal patterns in popularities of web items is an important problem that is useful for several web applications. For example, the temporal patterns in popularities of search keywords help web search enterprises predict future popular keywords, enabling them to make price decisions when marketing search keywords to advertisers. However, presence of millions of web items makes it difficult to scale up previous techniques for this problem. This paper proposes an efficient method for mining temporal patterns in popularities of web items. We treat the popularities of web items as time-series, and propose gapmeasure to quantify the similarity between the popularities of two web items. To reduce the computation overhead for this measure, an efficient method using the Fast Fourier Transform (FFT) is presented. We assume that the popularities of web items are not necessarily following any probabilistic distribution or periodic. For finding clusters of web items with similar popularity trends, we propose to use a density-based clustering algorithm based on the gap measure. Our experiments using the popularity trends of search keywords obtained from the Google Trends web site illustrate the scalability and usefulness of the proposed approach in real-world applications.

키워드

참고문헌

  1. R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” In Proc. Int'l Conf. on Foundations and Data Organization and Algorithm (FODO), Chicago, Illinois, pp.69-84, Oct., 1993
  2. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Philadelphia, Pennsylvania, pp.49-60, June, 1999
  3. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999
  4. P. J. Carrington, J. Scott, and S Wasserman, Models and Methods in Social Network Analysis, Cambridge University Press, 2005
  5. S. Chien and N. Immorlica, “Semantic Similarity between Search Engine Queries Using Temporal Correlation,” In Proc. Int'l Conf. on World Wide Web (WWW), Chiba, Japan, pp. 2-11, May, 2005 https://doi.org/10.1145/1060745.1060752
  6. M. G. Elfeky, W. G. Aref, and A. K. Elmagarmid, “Periodicity Detection in Time Series Databases,” IEEE Trans. on Knowledge and Data Engineering (TKDE), Vol.17, No.7, pp.875-887, July, 2005 https://doi.org/10.1109/TKDE.2005.114
  7. C. Elkan, “Using the Triangle Inequality to Accelerate k-Means,” In Proc. Int'l Conf. on Machine Learning (ICML), Washington, DC, pp. 147-153, Aug., 2003
  8. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density -Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” In Proc. Int'l Conf. on Knowledge Discovery and Data Mining (KDD), Portland, Oregon, pp.226-231, Aug., 1996
  9. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases,” In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Minneapolis, Minnesota, pp.419-429, May, 1994 https://doi.org/10.1145/191843.191925
  10. R. C. Gonzalez and R. E. Woods, Digital Image Processing, Prentice Hall, 2nd Ed., 2002
  11. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2nd Ed., 2005
  12. M. R. Henzinger, “Web Information Retrieval – An Algorithmic Perspective,” In Proc. Annual European Symposium (ESA), Saarbrucken, Germany, pp.1-8, Sept., 2000
  13. E. J. Keogh, “Exact Indexing of Dynamic Time Warping,” In Proc. Int'l Conf. on Very Large Data Bases (VLDB), Hong Kong, China, pp.406-417, Aug., 2002
  14. E. J. Keogh and S. Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration,” In Proc. Int'l Conf. on Knowledge Discovery and Data Mining, ACM SIGKDD, Edmonton, Canada, pp.102-111, July, 2002 https://doi.org/10.1145/775047.775062
  15. E. J. Keogh, L. Wei, X. Xi, S.-H. Lee, and M. Vlachos, “LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures,” In Proc. Int'l Conf. on Very Large Data Bases(VLDB), Seoul, Korea, pp.882-893, Sept., 2006
  16. R. Kosala and H. Blockeel, “Web Mining Research: A Survey,” SIGKDD Explorations, Vol.2, No.1, pp.1-15, June, 2000 https://doi.org/10.1145/360402.360406
  17. A. N. Langville and C. D. Meyer, “A Survey of Eigenvector Methods of Web Information Retrieval,” The SIAM Review, Vol.47, No.1, pp.135-161, Jan., 2005 https://doi.org/10.1137/S0036144503424786
  18. J. Lin, M. Vlachos, E. J. Keogh, and D. Gunopulos, “Iterative Incremental Clustering of Time Series,” In Proc. Int'l Conf. on Extending Database Technology (EDBT), Crete, Greece, pp.106-122, Mar., 2004
  19. J. Lin et al., “An MPAA-Based Iterative Clustering Algorithm Augmented by Nearest Neighbors Search for Time-Series Data Streams,” In Proc. Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD), Hanoi, Vietnam, pp.333-342, May, 2005
  20. J. McQueen, “Some Methods for Classification and Analysis of Multivariate Observation,” In Proc. Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, California, pp.281-297, 1967
  21. Y.-S. Moon, K.-Y. Whang, and W.-K. Loh, “Duality-Based Subsequence Matching in Time-Series Databases,” In Proc. Int'l Conf. on Data Engineering (ICDE), IEEE, Heidelberg, Germany, pp.263-272, Apr., 2001
  22. Y.-S. Moon, K.-Y. Whang, and W.-S. Han, “General Match: A Subsequence Matching Method in Time-Series Databases Based on Generalized Windows,” In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Madison, Wisconsin, pp. 382-393, June, 2002 https://doi.org/10.1145/564691.564735
  23. M. Nanni, “Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics,” In Proc. Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD), Hanoi, Vietnam, pp.378-387, May, 2005 https://doi.org/10.1007/11430919_45
  24. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, 2nd Ed., 1992
  25. J. G. Proakis and D. K. Manolakis, Digital Signal Processing, Prentice Hall, 4th Ed., 2006
  26. C. Ratanamahatana and E. J. Keogh, “Three Myths about Dynamic Time Warping Data Mining,” In Proc. SIAM International Data Mining Conference (SDM), Newport Beach, California, pp.506-510, Apr., 2005
  27. Y. Sakurai, S. Papadimitriou, and C. Faloutsos, “AutoLag: Automatic Discovery of Lag Correlations in Stream Data,” In Proc. Int'l Conf. on Data Engineering(ICDE), Tokyo, Japan, pp.159-160, Apr., 2005 https://doi.org/10.1109/ICDE.2005.24
  28. M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. J. Keogh, “Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures,” In Proc. Int'l Conf. on Knowledge Discovery and Data Mining, ACM SIGKDD, Washington, D.C., pp. 216-225, Aug., 2003 https://doi.org/10.1145/956750.956777
  29. M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos, “Identifying Similarities, Periodicities and Bursts for Online Search Queries,” In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Paris, France, pp.131-142, June, 2004 https://doi.org/10.1145/1007568.1007586
  30. B.-K. Yi, H. V. Jagadish, and C. Faloutsos, “Efficient Retrieval of Similar Time Sequences Under Time Warping,” In Proc. Int'l Conf. on Data Engineering (ICDE), Orlando, Florida, pp.201-208, Feb., 1998 https://doi.org/10.1109/ICDE.1998.655778
  31. B.-K. Yi and C. Faloutsos, “Fast Time Sequence Indexing for Arbitrary Lp Norms,” In Proc. Int'l Conf. on Very Large Data Bases(VLDB), Cairo, Egypt, pp.385-394, Sept., 2000
  32. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Montreal, Canada, pp.103-114, June, 1996