• Title/Summary/Keyword: join

Search Result 1,153, Processing Time 0.028 seconds

Effective Load Shedding for Multi-Way windowed Joins Based on the Arrival Order of Tuples on Data Streams (다중 윈도우 조인을 위한 튜플의 도착 순서에 기반한 효과적인 부하 감소 기법)

  • Kwon, Tae-Hyung;Lee, Ki-Yong;Son, Jin-Hyun;Kim, Myoung-Ho
    • Journal of KIISE:Databases
    • /
    • v.37 no.1
    • /
    • pp.1-11
    • /
    • 2010
  • Recently, there has been a growing interest in the processing of continuous queries over multiple data streams. When the arrival rates of tuples exceed the memory capacity of the system, a load shedding technique is used to avoid the system becoming overloaded by dropping some subset of input tuples. In this paper, we propose an effective load shedding algorithm for multi-way windowed joins over multiple data streams. Most previous load shedding algorithms estimate the productivity of each tuple, i.e., the number of join output tuples produced by the tuple, based on its "join attribute value" and drop tuples with the lowest productivity. However, the productivity of a tuple cannot be accurately estimated from its join attribute value when the join attribute values are unique and do not repeat, or the distribution of the join attribute values changes over time. For these cases, we estimate the productivity of a tuple based on its "arrival order" on data streams, rather than its join attribute value. The proposed method can effectively estimate the productivity of a tuple even when the productivity of a tuple cannot be accurately estimated from its join attribute value. Through extensive experiments and analysis, we show that our proposed method outperforms the previous methods in terms of effectiveness and efficiency.

A Pipelined Hash Join Method for Load Balancing (부하 균형 유지를 고려한 파이프라인 해시 조인 방법)

  • Moon, Jin-Gue;Park, No-Sang;Kim, Pyeong-Jung;Jin, Seong-Il
    • The KIPS Transactions:PartD
    • /
    • v.9D no.5
    • /
    • pp.755-768
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new hash join methods with load balancing capabilities. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets adaptively via a frequency distribution. Using hash-based joins, multiple joins can be pipelined so that the early results from a join, before the whole join is completed, are sent to the next join processing without staying on disks. Unless the pipelining execution of multiple hash joins includes some load balancing mechanisms, the skew effect can severely deteriorate system performance. In this paper, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.

Continuous Query Processing in Data Streams Using Duality of Data and Queries (데이타와 질의의 이원성을 이용한 데이타스트림에서의 연속질의 처리)

  • Lim Hyo-Sang;Lee Jae-Gil;Lee Min-Jae;Whang Kyu-Young
    • Journal of KIISE:Databases
    • /
    • v.33 no.3
    • /
    • pp.310-326
    • /
    • 2006
  • In this paper, we deal with a method of efficiently processing continuous queries in a data stream environment. We classify previous query processing methods into two dual categories - data-initiative and query-initiative - depending on whether query processing is initiated by selecting a data element or a query. This classification stems from the fact that data and queries have been treated asymmetrically. For processing continuous queries, only data-initiative methods have traditionally been employed, and thus, the performance gain that could be obtained by query-initiative methods has been overlooked. To solve this problem, we focus on an observation that data and queries can be treated symmetrically. In this paper, we propose the duality model of data and queries and, based on this model, present a new viewpoint of transforming the continuous query processing problem to a multi-dimensional spatial join problem. We also present a continuous query processing algorithm based on spatial join, named Spatial Join CQ. Spatial Join CQ processes continuous queries by finding the pairs of overlapping regions from a set of data elements and a set of queries defined as regions in the multi-dimensional space. The algorithm achieves the effects of both of the two dual methods by using the spatial join, which is a symmetric operation. Experimental results show that the proposed algorithm outperforms earlier methods by up to 36 times for simple selection continuous queries and by up to 7 times for sliding window join continuous queries.

A Load Balancing Method using Partition Tuning for Pipelined Multi-way Hash Join (다중 해시 조인의 파이프라인 처리에서 분할 조율을 통한 부하 균형 유지 방법)

  • Mun, Jin-Gyu;Jin, Seong-Il;Jo, Seong-Hyeon
    • Journal of KIISE:Databases
    • /
    • v.29 no.3
    • /
    • pp.180-192
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new harsh join methods in the shared-nothing multiprocessor environment. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets dynamically via a frequency distribution. Using harsh-based joins, multiple joins can be pipelined to that the early results from a join, before the whole join is completed, are sent to the next join processing without staying in disks. Shared nothing multiprocessor architecture is known to be more scalable to support very large databases. However, this hardware structure is very sensitive to the data skew. Unless the pipelining execution of multiple hash joins includes some dynamic load balancing mechanism, the skew effect can severely deteriorate the system performance. In this parer, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.

A Study on Selecting Bitmap Join Index to Speed up Complex Queries in Relational Data Warehouses (관계형 데이터 웨어하우스의 복잡한 질의의 처리 효율 향상을 위한 비트맵 조인 인덱스 선택에 관한 연구)

  • An, Hyoung-Geun;Koh, Jae-Jin
    • The KIPS Transactions:PartD
    • /
    • v.19D no.1
    • /
    • pp.1-14
    • /
    • 2012
  • As the size of the data warehouse is large, the selection of indices on the data warehouse affects the efficiency of the query processing of the data warehouse. Indices induce the lower query processing cost, but they occupy the large storage areas and induce the index maintenance cost which are accompanied by database updates. The bitmap join indices are well applied when we optimize the star join queries which join a fact table and many dimension tables and the selection on dimension tables in data warehouses. Though the bitmap join indices with the binary representations induce the lower storage cost, the task to select the indexing attributes among the huge candidate attributes which are generated is difficult. The processes of index selection are to reduce the number of candidate attributes to be indexed and then select the indexing attributes. In this paper on bitmap join index selection problem we reduce the number of candidate attributes by the data mining techniques. Compared to the existing techniques which reduce the number of candidate attributes by the frequencies of attributes we consider the frequencies of attributes and the size of dimension tables and the size of the tuples of the dimension tables and the page size of disk. We use the mining of the frequent itemsets as mining techniques and reduce the great number of candidate attributes. We make the bitmap join indices which have the least costs and the least storage area adapted to storage constraints by using the cost functions applied to the bitmap join indices of the candidate attributes. We compare the existing techniques and ours and analyze them in order to evaluate the efficiencies of ours.

Performance Evaluation of Hash Join Algorithm on Flash Memory SSDs (플래쉬 메모리 SSD 기반 해쉬 조인 알고리즘의 성능 평가)

  • Park, Jang-Woo;Park, Sang-Shin;Lee, Sang-Won;Park, Chan-Ik
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.11
    • /
    • pp.1031-1040
    • /
    • 2010
  • Hash join is one of the core algorithms in databases management systems. If a hash join cannot complete in one-pass because the available memory is insufficient (i.e., hash table overflow), however, it may incur a few sequential writes and excessive random reads. With harddisk as the tempoary storage for hash joins, the I/O time would be dominated by slow random reads in its probing phase. Meanwhile, flash memory based SSDs (flash SSDs) are becoming popular, and we will witness in the foreseeable future that flash SSDs replace harddisks in enterprise databases. In contrast to harddisk, flash SSD without any mechanical component has fast latency in random reads, and thus it can boost hash join performance. In this paper, we investigate several important and practical issues when flash SSD is used as tempoary storage for hash join. First, we reveal the va patterns of hash join in detail and explain why flash SSD can outperform harddisk by more than an order of magnitude. Second, we present and analyze the impact of cluster size (i.e., va unit in hash join) on performance. Finally, we emperically demonstrate that, while a commerical query optimizer is error-prone in predicting the execution time with harddisk as temporary storage, it can precisely estimate the execution time with flash SSD. In summary, we show that, when used as temporary storage for hash join, flash SSD will provide more reliable cost estimation as well as fast performance.

Parallel Processing of Multi-Way Spatial Join (다중 공간 조인의 병렬 처리)

  • Ryu, Woo-Seok;Hong, Bong-Hee
    • Journal of KIISE:Databases
    • /
    • v.27 no.2
    • /
    • pp.256-268
    • /
    • 2000
  • Multi-way spatial join is a nested expression of two or more spatial joins. It costs much to process multi-way spatial join, but there have not still reported the scheme of parallel processing of multi-way spatial join. In this paper, parallel processing of multi-way spatial join consists of parallel multi-way spatial filter and parallel spatial refinement. Parallel spatial refinement is executed by the following two steps. The first is the generation of a graph used for reducing duplication of both spatial objects and spatial operations from pairs candidate object table that are the results of multi-way spatial filter. The second is the parallel spatial refinement using that graph. Refinement using the graph is proved to be more efficient than the others. In task creation for parallel refinement, minimum duplication partitioning of the Spatial_Obicct_On_Node graph shows best performance.

  • PDF

A Design of Filtering Technique on LBSNS using Spatial Join (LBSNS에서의 공간조인을 이용한 필터링 기법의 설계)

  • Lee, Eun-Sik;Cho, Dae-Soo
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2011.05a
    • /
    • pp.230-232
    • /
    • 2011
  • Owing to the advent of digital devices which equipped with GPS, such as smartphone and tablet pc, a number of LBSNS applications have been released and even SNS applications serve various Location-Based Services. In twitter's case, the news of interesting area is provided to user not by being subscribed them automatically, but by being searched on web-site. This paper describes the system designed for users want to subscribe the local news without procedure like searching using operators. This system uses PBSM(Partition Based Spatial-Merge Join) which has no index for batch processing and against a massive query. The results from Spatial Join are stored in Materialized View then provided to user.

  • PDF

Causality join query processing for data stream by spatio-temporal sliding window (시공간 슬라이딩윈도우기법을 이용한 데이터스트림의 인과관계 결합질의처리방법)

  • Kwon, O-Je;Li, Ki-Joune
    • Spatial Information Research
    • /
    • v.16 no.2
    • /
    • pp.219-236
    • /
    • 2008
  • Data stream collected from sensors contain a large amount of useful information including causality relationships. The causality join query for data stream is to retrieve a set of pairs (cause, effect) from streams of data. A part of causality pairs may however be lost from the query result, due to the delay from sensors to a data stream management system, and the limited size of sliding windows. In this paper, we first investigate spatial, temporal, and spatio-temporal aspects of the causality join query for data stream. Second, we propose several strategies for sliding window management based on these observations. The accuracy of the proposed strategies is studied by intensive experiments, and the result shows that we improve the accuracy of causality join query in data stream from simple FIFO strategy.

  • PDF

Performance of Spatial Join Operations using Multi-Attribute Access Methods (다중-속성 색인기법을 이용한 공간조인 연산의 성능)

  • 황병연
    • Spatial Information Research
    • /
    • v.7 no.2
    • /
    • pp.271-282
    • /
    • 1999
  • In this paper, we derived an efficient indexing scheme, SJ tree, which handles multi-attribute data and spatial join operations efficiently. In addition, a number of algorithms for manipulating multi-attribute data are given , together with their computational and I/O complexity . Moreover , we how that SJ tree is a kind of generalized B-tree. This means that SJ-tree can be easily implemented on existing built-in B-tree in most storage managers in the sense that the structure of SJ tree is like that of B-tree. The spatial join operation with spatial output is benchmarked using R-tree, B-tree, K-D-B tree, and SJ tree. Results from the benchmark test indicate that SJ tree out performance other indexing schemes on spatial join with point data.

  • PDF