DOI QR코드

DOI QR Code

A Benchmark Test of Spatial Big Data Processing Tools and a MapReduce Application

  • Nguyen, Minh Hieu (Dep. of Civil Engineering, Yonsei University) ;
  • Ju, Sungha (Dep. of Civil Engineering, Yonsei University) ;
  • Ma, Jong Won (Dep. of Civil Engineering, Yonsei University) ;
  • Heo, Joon (Dep. of Civil Engineering, Yonsei University)
  • Received : 2017.09.29
  • Accepted : 2017.10.31
  • Published : 2017.10.31

Abstract

Spatial data processing often poses challenges due to the unique characteristics of spatial data and this becomes more complex in spatial big data processing. Some tools have been developed and provided to users; however, they are not common for a regular user. This paper presents a benchmark test between two notable tools of spatial big data processing: GIS Tools for Hadoop and SpatialHadoop. At the same time, a MapReduce application is introduced to be used as a baseline to evaluate the effectiveness of two tools and to derive the impact of number of maps/reduces on the performance. By using these tools and New York taxi trajectory data, we perform a spatial data processing related to filtering the drop-off locations within Manhattan area. Thereby, the performance of these tools is observed with respect to increasing of data size and changing number of worker nodes. The results of this study are as follows 1) GIS Tools for Hadoop automatically creates a Quadtree index in each spatial processing. Therefore, the performance is improved significantly. However, users should be familiar with Java to handle this tool conveniently. 2) SpatialHadoop does not automatically create a spatial index for the data. As a result, its performance is much lower than GIS Tool for Hadoop on a same spatial processing. However, SpatialHadoop achieved the best result in terms of performing a range query. 3) The performance of our MapReduce application has increased four times after changing the number of reduces from 1 to 12.

Keywords

References

  1. Aji, A., Sun, X., Vo, H., Liu, Q., Lee, R., Zhang, X., and Wang, F. (2013), Demonstration of Hadoop-GIS: a spatial data warehousing system over MapReduce, Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 05-08 November, Orlando, USA, pp. 528-531.
  2. Apache. (2014), How many maps and reduces, Apache, Wakefield, USA, https://wiki.apache.org/hadoop/HowManyMapsAndReduces (last date accessed: 15 October 2017).
  3. Dede, E., Govindaraju, M., Gunter, D., Canon, R. S., and Ramakrishnan, L. (2013), Performance evaluation of a Mongodb and Hadoop platform for scientific data analysis, Proceedings of the 4th ACM Workshop on Scientific Cloud Computing 2013, ACM, 17 June, New York, USA, pp. 13-20.
  4. Eldawy, A. and Mokbel, M. F. (2013), A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data, Proceedings of the VLDB Endowment, VLDB, 26-30 August, Riva del Garda, Italy, Vol. 06, No. 12, pp. 1230-1233.
  5. Eldawy, A. and Mokbel, M. F. (2014), Pigeon: a spatial MapReduce language, Proceedings of 30th International Conference on Data Engineering (ICDE) 2014, IEEE, 31 March - 04 April, Chicago, USA, pp. 1242-1245.
  6. Eldawy, A. and Mokbel, M. F. (2015a), The ecosystem of SpatialHadoop, Proceedings of SIGSPATIAL Special, ACM, 03-06 November, Seattle, USA, Vol. 06, Issue 03, pp. 03-10.
  7. Eldawy, A. and Mokbel, M. F. (2015b), SpatialHadoop: A MapReduce framework for spatial data, Proceedings of 31st International Conference on Data Engineering (ICDE) 2015, IEEE, 13-17 April, Seoul, Korea, pp. 1352-1363.
  8. Garcia-Garcia, F., Corral, A., Iribarne, L., Mavrommatis, G., and Vassilakopoulos, M. (2017), A comparison of distributed spatial data management systems for processing distance join queries, In: Kirikova, M., Norvag, K., and Papadopoulos, G. (eds.), Advances in Databases and Information Systems, Springer, Cham, Switzerland, pp. 214-228.
  9. Gates, A. and Dai, D. (2016), Programming Pig: Dataflow Scripting with Hadoop, O'Reilly Media, Sebastopol, USA, pp. 65-66.
  10. Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., and Srivastava, U. (2009), Building a high-level dataflow system on top of MapReduce: the Pig experience, Proceedings of the VLDB Endowment, VLDB, 24-28 August, Lyon, France, Vol. 02, pp. 1414-1425.
  11. Jiang, Z. and Shekhar, S. (2017), Spatial Big Data Science, Springer, Cham, Switzerland, pp. 03-13.
  12. Jonathan, M. (2017), GIS tools for Hadoop, Esri, Readlands, USA, https://blogs.esri.com/esri/arcgis/2013/03/25/gis-tools-for-hadoop (last date accessed: 17 October 2017).
  13. Maleki, E. F., Azadani, M. N., and Ghadiri, N. (2016), Performance evaluation of SpatialHadoop for big web mapping data, Proceedings of 2nd International Conference on Web Research (ICWR), IEEE, 27-28 April, Tehran, Iran, pp. 60-65.
  14. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. (2008), Pig latin: a not-so-foreign language for data processing, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ACM, 09-12 June, Vancouver, Canada, pp. 1099-1110.
  15. Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010). The Hadoop distributed file system, Proceedings of 26th Symposium on Mass Storage Systems and Technologies (MSST), IEEE, 03-07 May, Incline Vilage, USA, pp. 01-10.
  16. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. (2009), Hive: a warehousing solution over a map-reduce framework, Proceedings of the VLDB Endowment, VLDB, 24-28 August, Lyon, France, Vol. 02, pp. 1626-1629.
  17. Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., and Saha, B. (2013), Apache Hadoop YARN: yet another resource negotiator, Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC), ACM, 01-03 October, Santa Clara, USA, pp. 05-10.
  18. Vo, H., Aji, A., and Wang, F. (2014), A spatial data partitioning framework for scalable query processing, Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 04-07 November, Dallas/Forth Worth, USA, pp. 545-548.
  19. Wang, Y., Liu, Z., Liao, H., and Li, C. (2015), Improving the performance of GIS polygon overlay computation with MapReduce for spatial big data processing, Cluster Computing Journal, Vol. 18, Issue 02, pp. 507-516. https://doi.org/10.1007/s10586-015-0428-x
  20. Whitman, R. T., Park, M. B., Ambrose, S. M., and Hoel, E. G. (2014), Spatial indexing and analytics on Hadoop, Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 04-07 November, Dallas/Forth Worth, USA, pp. 73-82.
  21. Witayangkurn, A., Horanont, T., and Shibasaki, R. (2012), Performance comparisons of spatial data processing techniques for a large scale mobile phone dataset, Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications, ACM, 01-03 July, Reston, USA, pp. 25-31.
  22. Zhang, J., You, S., and Gruenwald, L. (2014), High-performance spatial query processing on big taxi trip data using gpgpus, Proceedings of International Congress on Big Data, IEEE, 27-30 October, Washington, USA, pp. 72-79.