• Title/Summary/Keyword: Distributed Query Processing

Search Result 130, Processing Time 0.022 seconds

Development of a CUBRID-Based Distributed Parallel Query Processing System

  • Kim, Hyeong-Il;Yang, HyeonSik;Yoon, Min;Chang, Jae-Woo
    • Journal of Information Processing Systems
    • /
    • v.13 no.3
    • /
    • pp.518-532
    • /
    • 2017
  • Due to the rapid growth of the amount of data, research on bigdata processing has been highlighted. For bigdata processing, CUBRID Shard is able to support query processing in parallel way by dividing the database into a number of CUBRID servers. However, CUBRID Shard can answer a user's query only when the query is required to gain accesses to a single CUBRID server, instead of multiple ones. To solve the problem, in this paper we propose a CUBRID based distributed parallel query processing system that can answer a user's query in parallel and distributed manner. Finally, through the performance evaluation, we show that our proposed system provides 2-3 times better performance on query processing time than the existing CUBRID Shard.

SPARQL Query Processing System over Scalable Triple Data using SparkSQL Framework (SparQLing : SparkSQL 기반 대용량 트리플 데이터를 위한 SPARQL 질의 시스템 구축)

  • Jeon, MyungJoong;Hong, JinYoung;Park, YoungTack
    • Journal of KIISE
    • /
    • v.43 no.4
    • /
    • pp.450-459
    • /
    • 2016
  • Every year, RDFS data tends further toward scalability; hence, the manner of SPARQL processing needs to be changed for fast query. The query processing method of SPARQL has been studied using a scalable distributed processing framework. Current studies indicate that the query engine based on the scalable distributed processing framework i.e., Hadoop(MapReduce) is not suitable for real-time processing because of the repetitive tasks; in addition, it is difficult to construct a query engine based on an In-memory Distributed Query engine, because distributed structure on the low-level is required to be considered. In this paper, we proposed a method to construct a query engine for improving the speed of the query process with the mass triple data. The query engine processes the query of SPARQL using the SparkSQL, which is an In-memory based, distributed query processing framework. SparkSQL is a high-level distributed query engine that facilitates existing SQL statement. In order to process the SPARQL query, after generating the Algebra Tree using Jena, the Algebra Tree is required to be translated to Spark Algebra Tree for application in the Spark system, and construction of the system that generated the SparkSQL query. Furthermore, we proposed the design of triple property table based on DataFrame for more efficient query processing in the Spark system. Finally, we verified the validity through comparative evaluation with the query engine, which is the existing distributed processing framework.

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

Hilbert-curve based Multi-dimensional Indexing Key Generation Scheme and Query Processing Algorithm for Encrypted Databases (암호화 데이터를 위한 힐버트 커브 기반 다차원 색인 키 생성 및 질의처리 알고리즘)

  • Kim, Taehoon;Jang, Miyoung;Chang, Jae-Woo
    • Journal of Korea Multimedia Society
    • /
    • v.17 no.10
    • /
    • pp.1182-1188
    • /
    • 2014
  • Recently, the research on database outsourcing has been actively done with the popularity of cloud computing. However, because users' data may contain sensitive personal information, such as health, financial and location information, the data encryption methods have attracted much interest. Existing data encryption schemes process a query without decrypting the encrypted databases in order to support user privacy protection. On the other hand, to efficiently handle the large amount of data in cloud computing, it is necessary to study the distributed index structure. However, existing index structure and query processing algorithms have a limitation that they only consider single-column query processing. In this paper, we propose a grid-based multi column indexing scheme and an encrypted query processing algorithm. In order to support multi-column query processing, the multi-dimensional index keys are generated by using a space decomposition method, i.e. grid index. To support encrypted query processing over encrypted data, we adopt the Hilbert curve when generating a index key. Finally, we prove that the proposed scheme is more efficient than existing scheme for processing the exact and range query.

A Genetic Algorithm for Minimizing Query Processing Time in Distributed Database Design: Total Time Versus Response Time (분산 데이타베이스에서의 질의실행시간 최소화를 위한 유전자알고리즘: 총 시간 대 반응시간)

  • Song, Suk-Kyu
    • The KIPS Transactions:PartD
    • /
    • v.16D no.3
    • /
    • pp.295-306
    • /
    • 2009
  • Query execution time minimization is an important objective in distributed database design. While total time minimization is an objective for On Line Transaction Processing (OLTP), response time minimization is for Decision Support queries. We formulate the sub-query allocation problem using analytical models and solve with genetic algorithm (GA). We show that query execution plans with total time minimization objective are inefficient from response time perspective and vice versa. The procedure is tested with simulation experiments for queries of up to 20 joins. Comparison with exhaustive enumeration indicates that GA produced optimal solutions in all cases in much less time.

Meta Data Caching Mechanism in Distributed Directory Database Systems (분산 디렉토리 데이터베이스 시스템에서의 메타 데이터 캐싱 기법)

  • Lee, Kang-Woo;Koh, Jin-Gwang
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.6
    • /
    • pp.1746-1752
    • /
    • 2000
  • In this paper, a cache mechanism is proposed to improve the speed of query processing in distributed director database systems. To decrease search time of requested objects and query processing time. query requests and results about objects in a remote site are stored in the cache of a local site. Cache system architecture is designed according to the classified information. Cache schema are designed for each cache information. Operational algorithms are developed for meta data cache which has meta data tree. This tree improves the speed of query processing by reducing the scope of search space. Finally, performance evaluation is performed by comparing the proposed cache mechanism with X500.

  • PDF

A Distributed SPARQL Query Processing Scheme Considering Data Locality and Query Execution Path (데이터 지역성 및 질의 수행 경로를 고려한 분산 SPARQL 질의 처리 기법)

  • Kim, Byounghoon;Kim, Daeyun;Ko, Geonsik;Noh, Yeonwoo;Lim, Jongtae;Bok, kyoungsoo;Lee, Byoungyup;Yoo, Jaesoo
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.5
    • /
    • pp.275-283
    • /
    • 2017
  • A large amount of RDF data has been generated along with the increase of semantic web services. Various distributed storage and query processing schemes have been studied to efficiently use the massive amounts of RDF data. In this paper, we propose a distributed SPARQL query processing scheme that considers the data locality and query execution path of large RDF data. The proposed scheme considers the data locality and query execution path in order to reduce join and communication costs. In a distributed environment, when processing a SPARQL query, it is divided into several sub-queries according to the conditions of the WHERE clause by considering the data locality. The proposed scheme reduces data communication costs by grouping and processing the sub-queries through the index based on associated nodes. In addition, in order to reduce unnecessary joins and latency when processing the query, it creates an efficient query execution path considering data parsing cost, the amount of each node's data communication, and latency. It is shown through various performance evaluations that the proposed scheme outperforms the existing scheme.

Query Optimization on Large Scale Nested Data with Service Tree and Frequent Trajectory

  • Wang, Li;Wang, Guodong
    • Journal of Information Processing Systems
    • /
    • v.17 no.1
    • /
    • pp.37-50
    • /
    • 2021
  • Query applications based on nested data, the most commonly used form of data representation on the web, especially precise query, is becoming more extensively used. MapReduce, a distributed architecture with parallel computing power, provides a good solution for big data processing. However, in practical application, query requests are usually concurrent, which causes bottlenecks in server processing. To solve this problem, this paper first combines a column storage structure and an inverted index to build index for nested data on MapReduce. On this basis, this paper puts forward an optimization strategy which combines query execution service tree and frequent sub-query trajectory to reduce the response time of frequent queries and further improve the efficiency of multi-user concurrent queries on large scale nested data. Experiments show that this method greatly improves the efficiency of nested data query.

SPARQL Query Processing in Distributed In-Memory System (분산 메모리 시스템에서의 SPARQL 질의 처리)

  • Jagvaral, Batselem;Lee, Wangon;Kim, Kang-Pil;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.9
    • /
    • pp.1109-1116
    • /
    • 2015
  • In this paper, we propose a query processing approach that uses the Spark functional programming and distributed memory system to solve the computational overhead of SPARQL. In the semantic web, RDF ontology data is produced at large scale, and the main challenge for the semantic web is to query and manipulate such a large ontology with a high throughput. The most existing studies on SPARQL have focused on deploying the Hadoop MapReduce framework, and although approaches based on Hadoop MapReduce have shown promising results, they achieve a low level of throughput due to the underlying distributed file processes. Therefore, in order to speed up the query processes, we suggest query- processing methods that are based on memory caching in distributed memory system. Our approach is also integrated with a clause unification method for propagating between the clauses that exploits Spark join, map and filter methods along with caching. In our experiments, we have achieved a high level of performance relative to other approaches. In particular, our performance was nearly similar to that of Sempala, which has been considered to be the fastest query processing system.

A Novel Air Indexing Scheme for Window Query in Non-Flat Wireless Spatial Data Broadcast

  • Im, Seok-Jin;Youn, Hee-Yong;Choi, Jin-Tak;Ouyang, Jinsong
    • Journal of Communications and Networks
    • /
    • v.13 no.4
    • /
    • pp.400-407
    • /
    • 2011
  • Various air indexing and data scheduling schemes for wireless broadcast of spatial data have been developed for energy efficient query processing. The existing schemes are not effective when the clients' data access patterns are skewed to some items. It is because the schemes are based on flat broadcast that does not take the popularity of the data items into consideration. In this paper, thus, we propose a data scheduling scheme letting the popular items appear more frequently on the channel, and grid-based distributed index for non-flat broadcast (GDIN) for window query processing. The proposed GDIN allows quick and energy efficient processing of window query, matching the clients' linear channel access pattern and letting the clients access only the queried data items. The simulation results show that the proposed GDIN significantly outperforms the existing schemes in terms of access time, tuning time, and energy efficiency.