SPARQL Query Processing System over Scalable Triple Data using SparkSQL Framework

Jeon, MyungJoong;Hong, JinYoung;Park, YoungTack;

doi:10.5626/JOK.2016.43.4.450

정보과학회 논문지 (Journal of KIISE)

제43권4호
/
Pages.450-459
/
2016
/
2383-630X(pISSN)
/
2383-6296(eISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

DOI QR Code

SparQLing : SparkSQL 기반 대용량 트리플 데이터를 위한 SPARQL 질의 시스템 구축

SPARQL Query Processing System over Scalable Triple Data using SparkSQL Framework

전명중 (숭실대학교 컴퓨터공학과) ;
홍진영 (숭실대학교 컴퓨터공학과) ;
박영택 (숭실대학교 컴퓨터공학과)

투고 : 2015.10.20
심사 : 2016.01.29
발행 : 2016.04.15

https://doi.org/10.5626/JOK.2016.43.4.450 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

매년 RDFS 데이터는 대용량화 되어 가며, 빠른 질의를 위한 SPARQL 처리방식에 대한 변화가 필요하게 되었다. 이를 위해 대용량 분산 처리 프레임워크를 활용한 SPARQL의 질의 처리방식이 많이 연구되고 있다. 기존의 연구 중 대용량 분산 처리 프레임워크인 Hadoop(MapReduce) 기반 질의 엔진은 반복적인 작업으로 인한 잦은 I/O 발생으로 실시간 질의 처리가 불가능하며, 인메모리 기반 분산 질의 엔진 역시 낮은 단계의 언어 수준에서 분산 구조를 고려한 구현이 필요하기 때문에 질의 엔진 구축이 어렵다. 본 논문에서는 인메모리 기반 분산 질의 처리 프레임워크인 SparkSQL을 활용하여 대용량 트리플 데이터에 대한 SPARQL 질의문 처리 속도를 향상시킬 수 있는 질의 처리 엔진 구축 방법을 제안한다. SparkSQL 은 Spark 기반의 고수준 분산 질의 엔진으로서 기존의 SQL문을 활용한 질의가 가능하다. 따라서 SPARQL 질의문을 처리하기 위해서는 Jena를 이용하여 Algebra Tree를 생성한 후 이를 Spark 시스템에 적용하기 위한 Spark Algebra Tree로 변환해야 한다. 그리고 이를 이용하여 SparkSQL 질의문을 생성하는 시스템을 구축하였다. 또한 Spark 인메모리 시스템에서 보다 효율적인 질의 처리를 위한 DataFrame기반의 트리플 Property 테이블 설계를 제안하고 SparkSQL 프레임워크에 활용하였다. 마지막으로 기존의 분산처리 프레임워크를 사용한 질의 엔진과 비교 평가를 통하여 연구의 타당성을 검증한다.

Every year, RDFS data tends further toward scalability; hence, the manner of SPARQL processing needs to be changed for fast query. The query processing method of SPARQL has been studied using a scalable distributed processing framework. Current studies indicate that the query engine based on the scalable distributed processing framework i.e., Hadoop(MapReduce) is not suitable for real-time processing because of the repetitive tasks; in addition, it is difficult to construct a query engine based on an In-memory Distributed Query engine, because distributed structure on the low-level is required to be considered. In this paper, we proposed a method to construct a query engine for improving the speed of the query process with the mass triple data. The query engine processes the query of SPARQL using the SparkSQL, which is an In-memory based, distributed query processing framework. SparkSQL is a high-level distributed query engine that facilitates existing SQL statement. In order to process the SPARQL query, after generating the Algebra Tree using Jena, the Algebra Tree is required to be translated to Spark Algebra Tree for application in the Spark system, and construction of the system that generated the SparkSQL query. Furthermore, we proposed the design of triple property table based on DataFrame for more efficient query processing in the Spark system. Finally, we verified the validity through comparative evaluation with the query engine, which is the existing distributed processing framework.

키워드

인메모리 기반 분산 질의 엔진

in-memory based distributed query engine;
RDFS;
SPARQL;
spark;
SparkSQL;
sempala;

참고문헌

S. Alexander, P. Z. Martin, L. Georg, "PigSPARQL: Mapping SPARQL to Pig Latin," SWIM '11 Proceedings of the International Workshop on Semantic Web Information Management, Jun. 2011.
J. Batselem, Wangon Lee, KangPil Kim, Young Tack Park, "SPARQL Query Processing in Distributed In-Memory System," Vol. 42, No.9, pp.1109-1116, Sep. 2015. https://doi.org/10.5626/JOK.2015.42.9.1109
Xi Chen, Huajun Chen, Ningyu Zhang, Songyang Zhang, "SparkRDF: Elastic Discreted RDF Graph Processing Engine With Distributed Memory," ISWC '14, Vol. 9098, pp. 261-264, Oct. 2014.
Alexander Schatzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen, "Sempala: Interactive SPARQL Query Processing on Hadoop," ISWC '14, Vol. 8796, pp.164-179, Oct. 2014.
J. Barrasa and A. Gomez-Perez, "Upgrading relational legacy data to the semantic web," 15th International Conference on World Wide Web, pp.1069-1070. ACM, 2006.
C. Bizer and R. Cyganiak, "D2R Server: Publishing relational databases on the web," The 5th International Semantic Web Conference, 2006.
S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and D. Aumueller, "Triplify: lightweight Linked Data publication from relational databases (2009)," 18th International Conference on World Wide Web, pp.621-630, 2009.
A. Chebotko, S. Lu, and F. Fotouhi, "Semantics preserving SPARQL-to-SQL translation. Data & Know ledge Engineering," Vol. 68, No. 10, pp.973-1000, Apr. 2009. https://doi.org/10.1016/j.datak.2009.04.001
A. Garrote and M. Garcia, "Restful writable APIs for the web of linked data using relational storage solutions," WWW 2011 Workshop: Linked Data on the Web (LDOW2011), 2011.
Freddy Priyatna, Oscar Corcho, Ju an Sequeda, "Formalisation and Experiences of R2RML-based SPARQL to SQL query translation using Morph," Proc. of the 23rd IW3C2, pp.479-490, Apr. 2014.
J. Unbehauen, C. Stadler, and S. Auer, "Accessing relational data on the web with sparqlmap," JIST, Vol. 7774, pp. 65-80. Springer, Dec. 2012.
C. Artem, L. Shiyoung, Hasan M, Jamil, F. Farshad, "Semantics Preserving SPARQL-to-SQL Query Translation for Optional Graph Patterns," Vol. 68, pp. 973-1000, Oct. 2009. https://doi.org/10.1016/j.datak.2009.04.001
M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, and B. Bhattacharjee, "Building an efficient rdf store over a relational database," Proc. of the 2013 international conference on Management of data. ACM, pp.121-132, 2013.
Kevin Wilkinson, "Jena Property Table Implementation," SSWS, pp.35-46, 2006.
Michael Armbrust, Reynold S. Xin, Cheng Lian, et. aI., "Spark SQL: Relational Data Processing in Spark," SIGMOD'15, pp. 1383-1394, May 2015.
[Online]. Available: https//amplab.cs.berkeley.edu/benchmark/

정보과학회 논문지 (Journal of KIISE)

SparQLing : SparkSQL 기반 대용량 트리플 데이터를 위한 SPARQL 질의 시스템 구축

SPARQL Query Processing System over Scalable Triple Data using SparkSQL Framework

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)