DOI QR코드

DOI QR Code

Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

  • Received : 2010.10.26
  • Accepted : 2011.03.16
  • Published : 2011.06.30

Abstract

Data mining algorithms should exploit new hardware technologies to accelerate computations. Such goal is difficult to achieve in database management system (DBMS) due to its complex internal subsystems and because data mining numeric computations of large data sets are difficult to optimize. This paper explores taking advantage of existing multithreaded capabilities of multicore CPUs as well as caching in RAM memory to efficiently compute summaries of a large data set, a fundamental data mining problem. We introduce parallel algorithms working on multiple threads, which overcome the row aggregation processing bottleneck of accessing secondary storage, while maintaining linear time complexity with respect to data set size. Our proposal is based on a combination of table scans and parallel multithreaded processing among multiple cores in the CPU. We introduce several database-style and hardware-level optimizations: caching row blocks of the input table, managing available RAM memory, interleaving I/O and CPU processing, as well as tuning the number of working threads. We experimentally benchmark our algorithms with large data sets on a DBMS running on a computer with a multicore CPU. We show that our algorithms outperform existing DBMS mechanisms in computing aggregations of multidimensional data summaries, especially as dimensionality grows. Furthermore, we show that local memory allocation (RAM block size) does not have a significant impact when the thread management algorithm distributes the workload among a fixed number of threads. Our proposal is unique in the sense that we do not modify or require access to the DBMS source code, but instead, we extend the DBMS with analytic functionality by developing User-Defined Functions.

Keywords

References

  1. J. Adibi, T. Barrett, S. Bhatt, H. Chalupsky, J. Chame, and M. Hall, "Processing-in-memory technology for knowledge discovery algorithms," 2nd International Workshop on Data Management on New Hardware (DaMon 2006), Chicago, IL, 2006. https://doi.org/10.1145/1140402.1140405
  2. A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y.- K. Chen, and P. Dubey, "A characterization of data mining algorithms on a modern processor," Proceedings of the 1th International Workshop on Data Management on New Hardware, Baltimore, MD, 2005. https://doi.org/10.1145/1114252.1114258
  3. S. Chaudhuri, U. Fayyad, and J. Bernhardt, "Scalable classification over SQL databases," Proceedings of the 15th International Conference on Data Engineering, NSW, Australia, 1999, pp. 470-479. https://doi.org/10.1109/ICDE.1999.754963
  4. C. Ordonez and J. Garcia-Garcia, "Database systems research on data mining," SIGMOD '10 Proceedings of the 2010 International International Conference on Management of Data, Indianapolis, IN, 2010, pp. 1253-1254. https://doi.org/10.1145/1807167.1807335
  5. C. Ordonez, "Building statistical models and scoring with UDFs," ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007, pp. 1005-1016. https://doi.org/10.1145/1247480.1247599
  6. S. K. Pitchaimalai, C. Ordonez, and C. Garcia-Alvarado, "Comparing SQL and MapReduce to compute Naive Bayes in a single table scan," Proceedings of the Second International Workshop on Cloud Data Management (CloudDB), Toronto, ON, 2010, pp. 9-16. https://doi.org/10.1145/1871929.1871932
  7. C. Ordonez and S. K. Pitchaimalai, "Bayesian classifiers programmed in SQL," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 139-144, Jan. 2010. https://doi.org/10.1109/TKDE.2009.127
  8. C. Ordonez, "Statistical model computation with UDFs," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 12, pp. 1752-1765, Dec. 2010. https://doi.org/10.1109/TKDE.2010.44
  9. J. Cieslewicz and K. A. Ross, "Adaptive aggregation on chip multiprocessors," Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 2009, pp. 339-350.
  10. S. Sarawagi, S. Thomas, and R. Agrawal, "Integrating association rule mining with relational database systems: alternatives and implications," Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, 1998, pp. 343-354.
  11. S. Cohen, "User-defined aggregate functions: bridging theory and practice," ACM SIGMOD International Conference on Management of Data, Chicago, IL, 2006, pp. 49-60. https://doi.org/10.1145/1142473.1142480
  12. J. A. Blakeley, M. Henaire, C. Kleinerman, I. Kunen, A. Prout, and V. Rao, ".NET database programmability and extensibility in microsoft SQL server," ACM SIGMOD International Conference on Management of Data, Vancouver, BC, 2008, pp. 1087-1097. https://doi.org/10.1145/1376616.1376725
  13. M. Jaedicke and B. Mitschang, "On parallel processing of aggregate and scalar functions in object-relational DBMS," SIGMOD Record, vol. 27, no. 2, pp. 379-389, Jun. 1998. https://doi.org/10.1145/276305.276338
  14. M. Navas and C. Ordonez, "Efficient computation of PCA with SVD in SQL," Workshop on Data Mining using Matrices and Tensors in Conjunction with the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (DMMT), Paris, France, 2009. https://doi.org/10.1145/1581114.1581119
  15. S. Manegold, P. Boncz, and M. Kersten, "Optimizing main-memory join on modern hardware," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 709-730, Jul. 2002. https://doi.org/10.1109/TKDE.2002.1019210
  16. J. Cieslewicz, W. Mee, and K. A. Ross, "Cache-conscious buffering for database operators with state," Proceedings of the 5th International Workshop on Data Management on New Hardware, Providence, RI, 2009, pp. 43-51. https://doi.org/10.1145/1565694.1565704
  17. R. Ross, V. S. Subrahmanian, and J. Grant, "Aggregate operators in probabilistic databases," Journal of ACM, vol. 52, no. 1, pp. 54-101, Jan. 2005. https://doi.org/10.1145/1044731.1044734
  18. A. Knobbe, A. Siebes, and B. Marseille, "Involving aggregate functions in multi-relational Search," Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science vol. 2431, Heidelberg: Springer Berlin, 2002, pp. 145-168. https://doi.org/10.1007/3-540-45681-3_24
  19. C. Garcia-Alvarado, Z. Chen, and C. Ordonez, "OLAP with UDFs in digital libraries," ACM 18th International Conference on Information and Knowledge Management, Hong Kong, 2009, pp. 2073-2074. https://doi.org/10.1145/1645953.1646307
  20. H. Wang and C. Zaniolo, "User defined aggregates in object-relational systems," Proceedings of the 16th International Conference on Data Engineering, San Diego, CA, 2000, pp. 135-144. https://doi.org/10.1109/ICDE.2000.839400
  21. C. Luo, H. Thakkar, H. Wang, and C. Zaniolo, "A native extension of SQL for mining data streams," ACM SIGMOD International Conference on Management of Data, Baltimore, MD, 2005, pp. 873-875. https://doi.org/10.1145/1066157.1066271
  22. Z. He, B. S. Lee, and R. Snapp, "Self-tuning cost modeling of userdefined functions in an object-relational DBMS," ACM Transactions on Database Systems, vol. 30, no. 3, pp. 812-853, Sep. 2005. https://doi.org/10.1145/1093382.1093387
  23. C. Ordonez and S. K. Pitchaimalai, "Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling," Data & Knowledge Engineering, vol. 69, no. 4, pp. 383-398, Apr. 2010. https://doi.org/10.1016/j.datak.2009.12.001

Cited by

  1. Intelligent Traffic Prediction by Multi-sensor Fusion using Multi-threaded Machine Learning vol.5, pp.6, 2016, https://doi.org/10.5573/IEIESPC.2016.5.6.430