• Title/Summary/Keyword: Hash Join

Search Result 42, Processing Time 0.034 seconds

An Efficient M-way Stream Join Algorithm Exploiting a Bit-vector Hash Table (비트-벡터 해시 테이블을 이용한 효율적인 다중 스트림 조인 알고리즘)

  • Kwon, Tae-Hyung;Kim, Hyeon-Gyu;Lee, Yu-Won;Kim, Myoung-Ho
    • Journal of KIISE:Databases
    • /
    • v.35 no.4
    • /
    • pp.297-306
    • /
    • 2008
  • MJoin is proposed as an algorithm to join multiple data streams efficiently, whose characteristics are unpredictably changed. It extends a symmetric hash join to handle multiple data streams. Whenever a tuple arrives from a remote stream source, MJoin checks whether all of hash tables have matching tuples. However, when a join involves many data streams with low join selectivity, the performance of this checking process is significantly influenced by the checking order of hash tables. In this paper, we propose a BiHT-Join algorithm which extends MJoin to conduct this checking in a constant time regardless of a join order. BiHT-Join maintains a bit-vector which represents the existence of tuples in streams and decides a successful/unsuccessful join through comparing a bit-vector. Based on the bit-vector comparison, BiHT-Join can conduct a hash join only for successful joining tuples based on this decision. Our experimental results show that the proposed BiHT-Join provides better performance than MJoin in the processing of multiple streams.

Performance Evaluation of Hash Join Algorithm on Flash Memory SSDs (플래쉬 메모리 SSD 기반 해쉬 조인 알고리즘의 성능 평가)

  • Park, Jang-Woo;Park, Sang-Shin;Lee, Sang-Won;Park, Chan-Ik
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.11
    • /
    • pp.1031-1040
    • /
    • 2010
  • Hash join is one of the core algorithms in databases management systems. If a hash join cannot complete in one-pass because the available memory is insufficient (i.e., hash table overflow), however, it may incur a few sequential writes and excessive random reads. With harddisk as the tempoary storage for hash joins, the I/O time would be dominated by slow random reads in its probing phase. Meanwhile, flash memory based SSDs (flash SSDs) are becoming popular, and we will witness in the foreseeable future that flash SSDs replace harddisks in enterprise databases. In contrast to harddisk, flash SSD without any mechanical component has fast latency in random reads, and thus it can boost hash join performance. In this paper, we investigate several important and practical issues when flash SSD is used as tempoary storage for hash join. First, we reveal the va patterns of hash join in detail and explain why flash SSD can outperform harddisk by more than an order of magnitude. Second, we present and analyze the impact of cluster size (i.e., va unit in hash join) on performance. Finally, we emperically demonstrate that, while a commerical query optimizer is error-prone in predicting the execution time with harddisk as temporary storage, it can precisely estimate the execution time with flash SSD. In summary, we show that, when used as temporary storage for hash join, flash SSD will provide more reliable cost estimation as well as fast performance.

A Spatial Hash Strip Join Algorithm for Effective Handling of Skewed Data (편중 데이타의 효율적인 처리를 위한 공간 해쉬 스트립 조인 알고리즘)

  • Shim Young-Bok;Lee Jong-Yun
    • Journal of KIISE:Databases
    • /
    • v.32 no.5
    • /
    • pp.536-546
    • /
    • 2005
  • In this paper, we focus on the filtering step of candidate objects for spatial join operations on the input tables that none of the inputs is indexed. Over the last decade, several spatial Join algorithms for the input tables with index have been extensively studied. Those algorithms show excellent performance over most spatial data, while little research on solving the performance degradation in the presence of skewed data has been attempted. Therefore, we propose a spatial hash strip join(SHSJ) algorithm that can refine the problem of skewed data in the conventional spatial hash Join(SHJ) algorithm. The basic idea is similar to the conventional SHJ algorithm, but the differences are that bucket capacities are not limited while allocating data into buckets and SSSJ algorithm is applied to bucket join operations. Finally, as a result of experiment using Tiger/line data set, the performance of the spatial hash strip join operation was improved over existing SHJ algorithm and SSSJ algorithm.

Skewed Data Handling Technique Using an Enhanced Spatial Hash Join Algorithm (개선된 공간 해쉬 조인 알고리즘을 이용한 편중 데이터 처리 기법)

  • Shim Young-Bok;Lee Jong-Yun
    • The KIPS Transactions:PartD
    • /
    • v.12D no.2 s.98
    • /
    • pp.179-188
    • /
    • 2005
  • Much research for spatial join has been extensively studied over the last decade. In this paper, we focus on the filtering step of candidate objects for spatial join operations on the input tables that none of the inputs is indexed. In this case, many algorithms has presented and showed excellent performance over most spatial data. However, if data sets of input table for the spatial join ale skewed, the join performance is dramatically degraded. Also, little research on solving the problem in the presence of skewed data has been attempted. Therefore, we propose a spatial hash strip join (SHSJ) algorithm that combines properties of the existing spatial hash join (SHJ) algorithm based on spatial partition for input data set's distribution and SSSJ algorithm. Finally, in order to show SHSJ the outperform in uniform/skew cases, we experiment SHSJ using the Tiger/line data sets and compare it with the SHJ algorithm.

X+ Join : The improved X join scheme for the duplicate check overhead reduction (엑스플러스 조인 : 조인 중복체크의 오버헤드를 줄이기 위한 개선된 방법)

  • Baek, Joo-Hyun;Park, Sung-Wook;Jung, Sung-Won
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.10c
    • /
    • pp.28-32
    • /
    • 2006
  • 유비쿼터스(Ubiquitous)환경과 같이 외부로부터 입력되는 데이터가 stream의 형식으로 실시간으로 들어오고, 입력의 끝을 알 수 없는 환경에서는 기존의 join방식으로는 문제를 해결 할 수 없다. 또한 이러한 환경 하에서는 데이터의 크기나 특성이 모두 다르고 네트워크 상태에 따라 입력이 많은 영향을 받게 된다. 이런 stream환경의 join연산을 위하여 double pipelined hash join, Xjoin, Pjoin등 많은 알고리즘이 기존의 연구를 대표하여 왔다. 그 중 Xjoin은 symmetric hash join과 hybrid hash join의 특징들을 이용해서 들어오는 data의 흐름에 따라서 reactive하게 join과정을 조절함으로써 streaming data에 대한 join을 수행한다. 그러나 여러 단계의 수행에 따른 연산의 중복결과를 체크하기 위한 overhead로 인해 성능이 떨어진다. 이 논문에서는 이러한 점을 개선하기 위해서 Xjoin의 수행과정을 수정한 방법을 제시할 것이다. 각 partition마다 구분자만을 추가함으로써 간단하게 중복을 만들어내지 않는 방법을 제안하고 불필요한 연산과 I/O를 줄일 수 있도록 partition선택방법을 추가할 것이다. 이를 통해서 중복된 연산인지 체크하는 과정을 상당히 단순화함으로써 좀 더 좋은 성능을 가지게 될 것이고 또한 timestamp를 저장해야 하는 overhead를 줄여서 전체 연산에 필요한 저장 공간을 절약할 수 있다.

  • PDF

Performance Comparison of Join Operations Parallelization by using GPGPU (GPGPU 기반 조인 연산 병렬화 성능 비교)

  • Lee, Jong-Sub;Lee, Sang-Back;Lee, Kyu-Chul
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.28-44
    • /
    • 2018
  • In a database system, the most expensive operation among relational operations is a join operation. Generally, CPU-based join operations uses parallel processing with either 1 core or 16 cores at most, which does not significantly improve the function. On the other hand, GPGPU(General-Purpose computing on Graphics Processing Units) allows parallel processing through thousands of processing units, greatly reducing the time required to perform join operations. Parallelization of the operation using GPGPU uses NVIDIA's CUDA SDK. In this paper, we implement parallelization of the join operation using GPGPU and compare the performances. The used join operations are Nested Loop Join (NLJ), Sort Merge Join (SMJ) and Hash Join (HJ), and GPGPU equipment uses TITAN Xp, GTX 1080 Ti and GTX 1080. We measure and compare the performance of join operations based on CPU and GPGPU. We compare this performance with the performance of the previous study on the join operation based on GPGPU. The results of experiment show that the performance based on GPGPU is 6~328 times faster than the one based on CPU.

Block Allocation Method for Efficiently Managing Temporary Files of Hash Joins on SSDs (SSD상에서 해시조인 임시 파일의 효과적인 관리를 위한 블록 할당 방법)

  • Joontae, Kim;Sangwon, Lee
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.12
    • /
    • pp.429-436
    • /
    • 2022
  • Temporary files are generated when the Hash Join is performed on tables larger than the memory. During the join process, each temporary file is deleted sequentially after it completes the I/O operations. This paper reveals for that the fallocate system call and file deletion-related trim options significantly impact the hash join performance when temporary files are managed on SSDs rather than hard disks. The experiment was conducted on various commercial and research SSDs using PostgreSQL, a representative open-source database. We find that it is possible to improve the join performance up to 3 to 5 times compared to the default combination depending on whether fallocate and trim options are used for temporary files. In addition, we investigate the write amplification and trim command overhead in the SSD according to the combination of the two options for temporary files.

Design of a Spatial Hash Strip Join Algorithm using Efficient Bucket Partitioning and Joining Methods (효율적인 버킷 분할과 조인 방법을 이용한 공간 해쉬 스트립 조인 알고리즘 설계)

  • Shim, Young-Bok;Lee, Jong-Yun;Jung, Soon-Key
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2003.11c
    • /
    • pp.1367-1370
    • /
    • 2003
  • 본 논문에서는 인덱스가 존재하지 않는 두 개의 입력 릴레이션에 대해서도 최적의 조인 연산을 수행할 수 있는 공간 해쉬 조인 알고리즘을 제안한다. 인덱스가 존재하지 않는 릴레이션의 처리에 사용하는 기존의 공간 해쉬 조인(SHJ: Spatial Hash Join)과 Scalable Sweeping-Rased Spatial Join(SSSJ) 알고리즘을 결합하여 SHJ 알고리즘의 단점으로 지적되고 있는 편향된(skewed) 데이터에 대한 조인 연산의 성능저하 문제를 개선한 수 있는 Spatial Hash Strip Join(SHSJ) 알고리즘을 제안한다. SHJ에서 편향된 데이터의 경우 해쉬 버킷의 오버플로우 처리를 위해 버킷 재분할 방법을 사용하고 있는데 반하여 본 논문에서 제안한 SHSJ 알괴리즘에서는 버킷의 재분할 처리 대신에 버킷에 데이터를 삽입하고, 조인 연산과정에서 오버플로우가 발생한 버킷에 대하여 SSSJ 알고리즘을 사용함으로써 편향된 입력 릴레이션의 처리 성능을 제고시킬 수 있도록 한다.

  • PDF

A Skewed Data Handling Method using Spatial Hash Join Algorithm (공간 해쉬 조인 알고리즘을 이용한 편중 데이터 처리 기법)

  • 심영복;이종연
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.19-21
    • /
    • 2004
  • 이 논문은 인덱스가 존재하지 않는 두 입력 테이블의 공간 조인 연산 과정 중 여과 단계 처리에 중점을 둔다. 관련 연구는 Spatial Hash Join(SHJ)과 Scalable Sweeping-Based Spatial Join(SSSJ) 알고리즘이 대표적이다. 하지만 조인을 위한 입력 테이블의 객체들이 편중 분포할 경우 성능이 급격히 저하되는 문제를 가지고 있다. 따라서, 이 논문에서는 이러한 문제를 해결하기 위해 기존 SHJ 알고리즘과 SSSJ 알고리즘의 특성을 이용한 Spatial Hash Strip Join(SHSJ) 알고리즘을 제안한다. 기존 SHJ 알고리즘과의 차이점은 입력 데이터 집합을 버킷에 할당할 때 버킷 용량에 제한을 두지 않는다는 점과 버킷의 조인 단계에서 I/O 성능의 향상을 위해 우수한 SSSJ 알고리즘을 사용한다는 것이다. 끝으로 이 논문에서 제안한 SHSJ 알고리즘의 성능은 실제 Tiger/line 데이터를 이용하여 실험한 결과 기존의 SHJ와 SSSJ 알고리즘 보다 편중된 입력 테이블의 조인 연산에 대해 월등히 우수함이 검증되었다.

  • PDF

A Pipelined Hash Join Method for Load Balancing (부하 균형 유지를 고려한 파이프라인 해시 조인 방법)

  • Moon, Jin-Gue;Park, No-Sang;Kim, Pyeong-Jung;Jin, Seong-Il
    • The KIPS Transactions:PartD
    • /
    • v.9D no.5
    • /
    • pp.755-768
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new hash join methods with load balancing capabilities. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets adaptively via a frequency distribution. Using hash-based joins, multiple joins can be pipelined so that the early results from a join, before the whole join is completed, are sent to the next join processing without staying on disks. Unless the pipelining execution of multiple hash joins includes some load balancing mechanisms, the skew effect can severely deteriorate system performance. In this paper, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.