Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

Ordonez, Carlos;Navas, Mario;Garcia-Alvarado, Carlos;

doi:10.5626/JCSE.2011.5.2.111

Journal of Computing Science and Engineering

Volume 5 Issue 2
/
Pages.111-120
/
2011
/
1976-4677(pISSN)
/
2093-8020(eISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

DOI QR Code

Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

Ordonez, Carlos (Department of Computer Science, University of Houston) ;
Navas, Mario (Department of Computer Science, University of Houston) ;
Garcia-Alvarado, Carlos (Department of Computer Science, University of Houston)

Received : 2010.10.26
Accepted : 2011.03.16
Published : 2011.06.30

https://doi.org/10.5626/JCSE.2011.5.2.111 Citation PDF KPUBS

Download PDF

⟨ Previous Next ⟩

Abstract

Data mining algorithms should exploit new hardware technologies to accelerate computations. Such goal is difficult to achieve in database management system (DBMS) due to its complex internal subsystems and because data mining numeric computations of large data sets are difficult to optimize. This paper explores taking advantage of existing multithreaded capabilities of multicore CPUs as well as caching in RAM memory to efficiently compute summaries of a large data set, a fundamental data mining problem. We introduce parallel algorithms working on multiple threads, which overcome the row aggregation processing bottleneck of accessing secondary storage, while maintaining linear time complexity with respect to data set size. Our proposal is based on a combination of table scans and parallel multithreaded processing among multiple cores in the CPU. We introduce several database-style and hardware-level optimizations: caching row blocks of the input table, managing available RAM memory, interleaving I/O and CPU processing, as well as tuning the number of working threads. We experimentally benchmark our algorithms with large data sets on a DBMS running on a computer with a multicore CPU. We show that our algorithms outperform existing DBMS mechanisms in computing aggregations of multidimensional data summaries, especially as dimensionality grows. Furthermore, we show that local memory allocation (RAM block size) does not have a significant impact when the thread management algorithm distributes the workload among a fixed number of threads. Our proposal is unique in the sense that we do not modify or require access to the DBMS source code, but instead, we extend the DBMS with analytic functionality by developing User-Defined Functions.

Keywords

References

J. Adibi, T. Barrett, S. Bhatt, H. Chalupsky, J. Chame, and M. Hall, "Processing-in-memory technology for knowledge discovery algorithms," 2nd International Workshop on Data Management on New Hardware (DaMon 2006), Chicago, IL, 2006. https://doi.org/10.1145/1140402.1140405
A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y.- K. Chen, and P. Dubey, "A characterization of data mining algorithms on a modern processor," Proceedings of the 1th International Workshop on Data Management on New Hardware, Baltimore, MD, 2005. https://doi.org/10.1145/1114252.1114258
S. Chaudhuri, U. Fayyad, and J. Bernhardt, "Scalable classification over SQL databases," Proceedings of the 15th International Conference on Data Engineering, NSW, Australia, 1999, pp. 470-479. https://doi.org/10.1109/ICDE.1999.754963
C. Ordonez and J. Garcia-Garcia, "Database systems research on data mining," SIGMOD '10 Proceedings of the 2010 International International Conference on Management of Data, Indianapolis, IN, 2010, pp. 1253-1254. https://doi.org/10.1145/1807167.1807335
C. Ordonez, "Building statistical models and scoring with UDFs," ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007, pp. 1005-1016. https://doi.org/10.1145/1247480.1247599
S. K. Pitchaimalai, C. Ordonez, and C. Garcia-Alvarado, "Comparing SQL and MapReduce to compute Naive Bayes in a single table scan," Proceedings of the Second International Workshop on Cloud Data Management (CloudDB), Toronto, ON, 2010, pp. 9-16. https://doi.org/10.1145/1871929.1871932
C. Ordonez and S. K. Pitchaimalai, "Bayesian classifiers programmed in SQL," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 139-144, Jan. 2010. https://doi.org/10.1109/TKDE.2009.127
C. Ordonez, "Statistical model computation with UDFs," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 12, pp. 1752-1765, Dec. 2010. https://doi.org/10.1109/TKDE.2010.44
J. Cieslewicz and K. A. Ross, "Adaptive aggregation on chip multiprocessors," Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 2009, pp. 339-350.
S. Sarawagi, S. Thomas, and R. Agrawal, "Integrating association rule mining with relational database systems: alternatives and implications," Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, 1998, pp. 343-354.
S. Cohen, "User-defined aggregate functions: bridging theory and practice," ACM SIGMOD International Conference on Management of Data, Chicago, IL, 2006, pp. 49-60. https://doi.org/10.1145/1142473.1142480
J. A. Blakeley, M. Henaire, C. Kleinerman, I. Kunen, A. Prout, and V. Rao, ".NET database programmability and extensibility in microsoft SQL server," ACM SIGMOD International Conference on Management of Data, Vancouver, BC, 2008, pp. 1087-1097. https://doi.org/10.1145/1376616.1376725
M. Jaedicke and B. Mitschang, "On parallel processing of aggregate and scalar functions in object-relational DBMS," SIGMOD Record, vol. 27, no. 2, pp. 379-389, Jun. 1998. https://doi.org/10.1145/276305.276338
M. Navas and C. Ordonez, "Efficient computation of PCA with SVD in SQL," Workshop on Data Mining using Matrices and Tensors in Conjunction with the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (DMMT), Paris, France, 2009. https://doi.org/10.1145/1581114.1581119
S. Manegold, P. Boncz, and M. Kersten, "Optimizing main-memory join on modern hardware," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 709-730, Jul. 2002. https://doi.org/10.1109/TKDE.2002.1019210
J. Cieslewicz, W. Mee, and K. A. Ross, "Cache-conscious buffering for database operators with state," Proceedings of the 5th International Workshop on Data Management on New Hardware, Providence, RI, 2009, pp. 43-51. https://doi.org/10.1145/1565694.1565704
R. Ross, V. S. Subrahmanian, and J. Grant, "Aggregate operators in probabilistic databases," Journal of ACM, vol. 52, no. 1, pp. 54-101, Jan. 2005. https://doi.org/10.1145/1044731.1044734
A. Knobbe, A. Siebes, and B. Marseille, "Involving aggregate functions in multi-relational Search," Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science vol. 2431, Heidelberg: Springer Berlin, 2002, pp. 145-168. https://doi.org/10.1007/3-540-45681-3_24
C. Garcia-Alvarado, Z. Chen, and C. Ordonez, "OLAP with UDFs in digital libraries," ACM 18th International Conference on Information and Knowledge Management, Hong Kong, 2009, pp. 2073-2074. https://doi.org/10.1145/1645953.1646307
H. Wang and C. Zaniolo, "User defined aggregates in object-relational systems," Proceedings of the 16th International Conference on Data Engineering, San Diego, CA, 2000, pp. 135-144. https://doi.org/10.1109/ICDE.2000.839400
C. Luo, H. Thakkar, H. Wang, and C. Zaniolo, "A native extension of SQL for mining data streams," ACM SIGMOD International Conference on Management of Data, Baltimore, MD, 2005, pp. 873-875. https://doi.org/10.1145/1066157.1066271
Z. He, B. S. Lee, and R. Snapp, "Self-tuning cost modeling of userdefined functions in an object-relational DBMS," ACM Transactions on Database Systems, vol. 30, no. 3, pp. 812-853, Sep. 2005. https://doi.org/10.1145/1093382.1093387
C. Ordonez and S. K. Pitchaimalai, "Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling," Data & Knowledge Engineering, vol. 69, no. 4, pp. 383-398, Apr. 2010. https://doi.org/10.1016/j.datak.2009.12.001

Cited by

Intelligent Traffic Prediction by Multi-sensor Fusion using Multi-threaded Machine Learning vol.5, pp.6, 2016, https://doi.org/10.5573/IEIESPC.2016.5.6.430

Journal of Computing Science and Engineering

Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)