• Title/Summary/Keyword: join

Search Result 1,155, Processing Time 0.025 seconds

A Skewed Data Handling Method using Spatial Hash Join Algorithm (공간 해쉬 조인 알고리즘을 이용한 편중 데이터 처리 기법)

  • 심영복;이종연
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.19-21
    • /
    • 2004
  • 이 논문은 인덱스가 존재하지 않는 두 입력 테이블의 공간 조인 연산 과정 중 여과 단계 처리에 중점을 둔다. 관련 연구는 Spatial Hash Join(SHJ)과 Scalable Sweeping-Based Spatial Join(SSSJ) 알고리즘이 대표적이다. 하지만 조인을 위한 입력 테이블의 객체들이 편중 분포할 경우 성능이 급격히 저하되는 문제를 가지고 있다. 따라서, 이 논문에서는 이러한 문제를 해결하기 위해 기존 SHJ 알고리즘과 SSSJ 알고리즘의 특성을 이용한 Spatial Hash Strip Join(SHSJ) 알고리즘을 제안한다. 기존 SHJ 알고리즘과의 차이점은 입력 데이터 집합을 버킷에 할당할 때 버킷 용량에 제한을 두지 않는다는 점과 버킷의 조인 단계에서 I/O 성능의 향상을 위해 우수한 SSSJ 알고리즘을 사용한다는 것이다. 끝으로 이 논문에서 제안한 SHSJ 알고리즘의 성능은 실제 Tiger/line 데이터를 이용하여 실험한 결과 기존의 SHJ와 SSSJ 알고리즘 보다 편중된 입력 테이블의 조인 연산에 대해 월등히 우수함이 검증되었다.

  • PDF

A Join Query with Aggregation functions Using Mapreduce (집계 함수를 포함하는 조인 질의의 맵리듀스를 사용한 효율적인 처리 기법)

  • Oh, So Hyeon;Lee, Ki Yong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.04a
    • /
    • pp.132-135
    • /
    • 2015
  • 맵리듀스(MapReduce)는 분산 환경에서의 빅데이터(Big Data), 즉 대용량 데이터를 처리하는 프로그래밍 모델이다. 대용량의 데이터를 분석하기 위해서 집계 함수(Aggregation function)로 데이터를 처리할 수 있다. 본 논문에서는 맵리듀스 환경을 기반으로 SQL 쿼리에서 집계 함수를 더 적은 비용으로 수행하며 효율적으로 처리할 수 있는 두 가지 전략을 제안한다. 두 가지 전략 중 더 높은 성능을 보이는 전략을 더 효율적인 처리 방법으로 판단한다. 첫 번째 전략은 두 테이블을 Join하여 집계 함수를 처리하는 방법이다. 두 번째 전략은 집계 함수를 처리하여 Join에 참여할 튜플의 수를 최소로 줄인 후 Join을 수행하고 다시 집계 함수를 처리하는 방법이다. 두 제안 방법을 비교하기 위하여 실험을 한 결과 두 번째 전략이 더 적은 비용이 드므로 더 효율적인 처리 방법인 것으로 보인다.

Preprocessing Method for Handling Multi-Way Join Continuous Queries over Data Streams (데이터 스트림에서 다중 조인 연속질의의 효과적인 처리를 위한 전처리 기법)

  • Seo, Ki-Yeon;Lee, Joo-Il;Lee, Won-Suk
    • Journal of Internet Computing and Services
    • /
    • v.13 no.3
    • /
    • pp.93-105
    • /
    • 2012
  • A data stream is a series of tuples which are generated in real-time, incessant, immense, and volatile manner. As new information technologies are actively emerging, stream processing methods are being needed to efficiently handle data streams. Especially, finding out an efficient evaluation for a multi-way join would make outstanding contributions toward improving the performance of a data stream management system because a join operation is one of the most resource-consuming operators for evaluating queries. In this paper, in order to evaluate efficiently a multi-way join continuous query, we propose a novel method to decrease the cost of a query by eliminating unsuccessful intermediate results. For this, we propose a matrix-based structure for monitoring data streams and estimate the number of final result tuples of the query and find out unsuccessful tuples by matrix multiplication operations. And then using these information, we process efficiently a multi-way join continuous query by filtering out the unsuccessful tuples in advance before actual evaluation of the query.

Closest Pairs and e-distance Join Query Processing Algorithms using a POI-based Materialization Technique in Spatial Network Databases (공간 네트워크 데이터베이스에서 POI 기반 실체화 기법을 이용한 Closest Pairs 및 e-distance 조인 질의처리 알고리즘)

  • Kim, Yong-Ki;Chang, Jae-Woo
    • Journal of Korea Spatial Information System Society
    • /
    • v.9 no.3
    • /
    • pp.67-80
    • /
    • 2007
  • Recently, many studies on query processing algorithms has been done for spatial networks, such as roads and railways, instead of Euclidean spaces, in order to efficiently support LBS(location-based service) and Telematics applications. However, both a closest pairs query and an e-distance join query require a very high cost in query processing because they can be answered by processing a set of POIs, instead of a single POI. Nevertheless, the query processing cost for closest pairs and e-distance join queries is rapidly increased as the number of k (or the length of radius) is increased. Therefore, we propose both a closest pairs query processing algorithm and an e-distance join query processing algorithm using a POI-based materialization technique so that we can process closest pairs and e-distance join queries in an efficient way. In addition, we show the retrieval efficiency of the proposed algorithms by making a performance comparison of the conventional algorithms.

  • PDF

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

Using a Greedy Algorithm for the Improvement of a MapReduce, Theta join, M-Bucket-I Heuristic (그리디 알고리즘을 이용한 맵리듀스 세타조인 M-Bucket-I 휴리스틱의 개선)

  • Kim, Wooyeol;Shim, Kyuseok
    • Journal of KIISE
    • /
    • v.43 no.2
    • /
    • pp.229-236
    • /
    • 2016
  • Theta join is one of the essential and important types of queries in database systems. As the amount of data needs to be processed increases, processing theta joins with a single machine becomes impractical. Therefore, theta join algorithms using distributed computing frameworks have been studied widely. Although one of the state-of-the-art theta-join algorithms uses M-Bucket-I heuristic, it is hard to use since running time of M-Bucket-I heuristic, which computes a mapping from a record to a reducer (i.e., reducer mapping), is O(n) where n is the size of input data. In this paper, we propose MBI-I algorithm which reduces the running time of M-Bucket-I heuristic to $O(r_{max}log\;n)$ and gives the same result as M-Bucket-I heuristic does. We also conducted several experiments to show algorithm and confirmed that our algorithm can improve the performance of a theta join by 10%.

Transformation-based Spatial Partition Join (변환기반 공간 파티션 조인)

  • 이민재;한욱신;이재길;황규영
    • Journal of KIISE:Databases
    • /
    • v.31 no.4
    • /
    • pp.352-361
    • /
    • 2004
  • Spatial joins find all pairs of spatial objects that satisfy a given spatial relationship. In this paper, we propose the transformation-based spatial partition join algorithm (TSPJ), a new spatial join algorithm that performs join in the transform space without using indexes. Since the existing algorithms deal with extents of spatial objects in the original space, they either need to replicate the spatial objects or have a relatively complex partition structure-resulting in degrading performance. In contrast, TSPJ transforms objects in the original space into points in the transform space and deals only with points having no extents. The transformation does not incur any additional overhead. Thus, our algorithm has advantages over existing ones in that it obviates the need for replicating spatial objects, and its partition structure is simple. As a result, it always has better performance compared with existing algorithms. Extensive experiments show that TSPJ improves performance by 20.5∼38.0% over the existing algorithms compared.

An Efficient Multiple Event Detection in Sensor Networks (센서 네트워크에서 효율적인 다중 이벤트 탐지)

  • Yang, Dong-Yun;Chung, Chin-Wan
    • Journal of KIISE:Databases
    • /
    • v.36 no.4
    • /
    • pp.292-305
    • /
    • 2009
  • Wireless sensor networks have a lot of application areas such as industrial process control, machine and resource management, environment and habitat monitoring. One of the main objects of using wireless sensor networks in these areas is the event detection. To detect events at a user's request, we need a join processing between sensor data and the predicates of the events. If there are too many predicates of events compared with a node's capacity, it is impossible to store them in a node and to do an in-network join with the generated sensor data This paper proposes a predicate-merge based in-network join approach to efficiently detect multiple events, considering the limited capacity of a sensor node and many predicates of events. It reduces the number of the original predicates of events by substituting some pairs of original predicates with some merged predicates. We create an estimation model of a message transmission cost and apply it to the selection algorithm of targets for merged predicates. The experiments validate the cost estimation model and show the superior performance of the proposed approach compared with the existing approaches.

Block Allocation Method for Efficiently Managing Temporary Files of Hash Joins on SSDs (SSD상에서 해시조인 임시 파일의 효과적인 관리를 위한 블록 할당 방법)

  • Joontae, Kim;Sangwon, Lee
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.12
    • /
    • pp.429-436
    • /
    • 2022
  • Temporary files are generated when the Hash Join is performed on tables larger than the memory. During the join process, each temporary file is deleted sequentially after it completes the I/O operations. This paper reveals for that the fallocate system call and file deletion-related trim options significantly impact the hash join performance when temporary files are managed on SSDs rather than hard disks. The experiment was conducted on various commercial and research SSDs using PostgreSQL, a representative open-source database. We find that it is possible to improve the join performance up to 3 to 5 times compared to the default combination depending on whether fallocate and trim options are used for temporary files. In addition, we investigate the write amplification and trim command overhead in the SSD according to the combination of the two options for temporary files.