DOI QR코드

DOI QR Code

SPARQL Query Processing in Distributed In-Memory System

분산 메모리 시스템에서의 SPARQL 질의 처리

  • Received : 2015.01.21
  • Accepted : 2015.06.22
  • Published : 2015.09.15

Abstract

In this paper, we propose a query processing approach that uses the Spark functional programming and distributed memory system to solve the computational overhead of SPARQL. In the semantic web, RDF ontology data is produced at large scale, and the main challenge for the semantic web is to query and manipulate such a large ontology with a high throughput. The most existing studies on SPARQL have focused on deploying the Hadoop MapReduce framework, and although approaches based on Hadoop MapReduce have shown promising results, they achieve a low level of throughput due to the underlying distributed file processes. Therefore, in order to speed up the query processes, we suggest query- processing methods that are based on memory caching in distributed memory system. Our approach is also integrated with a clause unification method for propagating between the clauses that exploits Spark join, map and filter methods along with caching. In our experiments, we have achieved a high level of performance relative to other approaches. In particular, our performance was nearly similar to that of Sempala, which has been considered to be the fastest query processing system.

본 논문에서는 functional 프로그래밍과 분산 메모리 환경인 Spark를 통해 SPARQL 질의문 처리의 오버헤드를 줄일 수 있는 방법을 제안한다. 최근 몇 년간 시멘팁웹의 RDF 온톨로지 데이터는 폭발적으로 증가하고 있기 때문에, 대용량 온톨로지 데이터에 대한 질의문을 효율적으로 처리할 수 있는 방법이 주요 쟁점으로 떠오르고 있다. SPARQL 질의문 처리에 대한 기존의 연구들은 하둡의 맵리듀스 프레임워크에 초점을 맞추고 있다. 그러나 하둡은 분산 파일 처리를 기반의 작업을 수행하므로 성능 저하가 발생할 수 있다. 따라서 질의문 처리 속도를 향상 시키기 위해 본 논문에서는 분산 메모리 시스템을 통해 질의문을 처리할 수 있는 방법을 제안한다. 또한 SPARQL 질의어 사이의 Binding 값을 Propagation하기 위해서 Spark의 Join방식, Functional 프로그램의 Map, Filter 방식, Spark의 캐시 기능을 활용 하는 방식을 제안하고 있다. 본 논문의 실험 결과는 다른 기법들과 비교하여 높은 성능을 얻었다. 특히 현재 가장 빠른 성능을 보이는 SPARQL 질의 엔진인 Sempala와 유사하다는 결과를 얻었다.

Keywords

Acknowledgement

Grant : 대용량 지식처리용 분산 병렬 추론 플랫폼 개발

Supported by : 정보통신기술진층센터

References

  1. Apache Spark. (2013). Retrieved from [Online]. Available: https://spark.apache.org/
  2. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, "Spark: Cluster Computing with Working Sets," Proc. of the 2nd USENIX conference, pp. 10-10, 2010.
  3. M. Zaharia, M Chowdhury, T. Das, A. Dave, J. Ma, M. Franklin, S. Shenker, and I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," Proc. of the NSDI, pp. 2, 2012.
  4. Cloudera Impala, [Online]. Available: http://impala.io/
  5. K. Rohloff and R. Schantz. High-performance, massively scalable distributed systems using the mapreduce software framework: The shard triplestore. International Workshop on Programming Support Innovations for Emerging Distributed Applications, 2010.
  6. Alexander Schatzle, Martin Przyjaciel-Zablocki, Thomas Hornung, Georg Lausen, "PigSPARQL: A SPARQL Query Processing Baseline for Big Data," Proc. of the ISWC (Posters & Demos), pp. 241-244, 2013.
  7. Alexander Schatzle, Martin Przyjaciel-Zablocki, M. Lausen, " PigSPARQL: Mapping SPARQL to Pig Latin," Proc. of the SWIM, pp. 241-244, 2011.
  8. Alexander Schatzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen, "Sempala: Interactive SPARQL Query Processing on Hadoop," Proc. of the 13th International Semantic Web Conference - ISWC, pp. 164-179, Italy, 2014.
  9. Y. Guo, Z. Pan, and J. He in, "LUBM: A benchmark for OWL knowledge base systems," Journal of Web Semantic, Vol. 3, pp. 158-182, 2005. https://doi.org/10.1016/j.websem.2005.06.005
  10. Huang, J., Abadi, D.J., and Ren. K, "Scalable SPARQL Querying of Large RDF Graphs," Proc. of the VLDB Endowment 4 (11), 1123-1134.
  11. Przyjaciel-Zablocki. M, Schatzle. A, Skaley. E, Hornung. T, Lausen. G, "Map-Side Merge Joins for Scalable SPARQL BGP Processing," In: CloudCom, pp. 631-638, 2013.
  12. Huang. J, Venkatraman. K, Abadi D.J, "Query Optimization of Distributed Pattern Matching," Data Engineering (ICDE), 2014 IEEE 30th International Conference, pp. 64-75, 2014.
  13. J. Urbani, S. Kotouslas, J. Maassen, N. Drost, F. Seinstra, F. V. Harmelen, and H. Bal, "WebPIE: a Web-scale Parallel Inference Engine," Journal of Web Semantics, Vol. 10, pp. 59-75, 2012. https://doi.org/10.1016/j.websem.2011.05.004
  14. Schatzle. A, Przyjaciel-Zablocki. M, Dorner. C, Hornung. T, Lausen. G, "Cascading Map-Side Joins over HBase for Scalable Join Processing," Proc. of SSWS+HPCSW, pp. 59, 2012.
  15. Thusoo. A, Sarma. J.S, Jain. N, Shao. Z, Chakka. P, Anthony, Liu. H, Wyckoff. P, and Murthy. R, "Hive: A Warehousing Solution over a Map-Reduce Framework," Proc. of the VLDB Endowment, pp. 1626-1629, 2009.

Cited by

  1. SPARQL Query Processing System over Scalable Triple Data using SparkSQL Framework vol.43, pp.4, 2016, https://doi.org/10.5626/JOK.2016.43.4.450