• Title/Summary/Keyword: Join Algorithm

Search Result 138, Processing Time 0.032 seconds

Uniform Load Distribution Using Sampling-Based Cost Estimation in Parallel Join (병렬 조인에서 샘플링 기반 비용 예측 기법을 이용한 균등 부하 분산)

  • Park, Ung-Gyu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.6
    • /
    • pp.1468-1480
    • /
    • 1999
  • In database systems, join operations are the most complex and time consuming ones which limit performance of such system. Many parallel join algorithms have been proposed for the systems. However, they did not consider data skew, such as attribute value skew (AVS) and join product skew (JPS). In the skewness environments, performance of framework for a uniform load distribution and an efficient parallel join algorithm using the framework to handle AVS and JPS. In our algorithm, we estimate data distributions of input and output relations of join operations using the sampling methodology and evaluate join cost for the estimated data distributions. Finally, using the histogram equalization method we distribute data among nodes to achieve good load balancing among nodes in the local joining phase. For performance comparison, we present simulation model of our algorithm and other join algorithms and present the result of some simulation experiments. The results indicate that our algorithm outperforms other algorithms in the skewed case.

  • PDF

An Efficient Join Algorithm for Data Streams with Overlapping Window (중첩 윈도우를 가진 데이터 스트링을 위한 효율적인 조인 알고리즘)

  • Kim, Hyeon-Gyu;Kang, Woo-Lam;Kim, Myoung-Ho
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.15 no.5
    • /
    • pp.365-369
    • /
    • 2009
  • Overlapping windows are generally used for queries to process continuous data streams. Nevertheless, existing approaches discussed join algorithms only for basic types of windows such as tumbling windows and tuple-driven windows. In this paper, we propose an efficient join algorithm for overlapping windows, which are considered as a more general type of windows. The proposed algorithm is based on an incremental window join. It focuses on producing join results continuously when the memory overflow frequently occurs. It consists of (1) a method to use both of the incremental and full joins selectively, (2) a victim selection algorithm to minimize latency of join processing and (3) an idle time professing algorithm. We show through our experiments that the selective use of incremental and full joins provides better performance than using one of them only.

Transformation of Continuous Aggregation Join Queries over Data Streams

  • Tran, Tri Minh;Lee, Byung-Suk
    • Journal of Computing Science and Engineering
    • /
    • v.3 no.1
    • /
    • pp.27-58
    • /
    • 2009
  • Aggregation join queries are an important class of queries over data streams. These queries involve both join and aggregation operations, with window-based joins followed by an aggregation on the join output. All existing research address join query optimization and aggregation query optimization as separate problems. We observe that, by putting them within the same scope of query optimization, more efficient query execution plans are possible through more versatile query transformations. The enabling idea is to perform aggregation before join so that the join execution time may be reduced. There has been some research done on such query transformations in relational databases, but none has been done in data streams. Doing it in data streams brings new challenges due to the incremental and continuous arrival of tuples. These challenges are addressed in this paper. Specifically, we first present a query processing model geared to facilitate query transformations and propose a query transformation rule specialized to work with streams. The rule is simple and yet covers all possible cases of transformation. Then we present a generic query processing algorithm that works with all alternative query execution plans possible with the transformation, and develop the cost formulas of the query execution plans. Based on the processing algorithm, we validate the rule theoretically by proving the equivalence of query execution plans. Finally, through extensive experiments, we validate the cost formulas and study the performances of alternative query execution plans.

Grid-based Index Generation and k-nearest-neighbor Join Query-processing Algorithm using MapReduce (맵리듀스를 이용한 그리드 기반 인덱스 생성 및 k-NN 조인 질의 처리 알고리즘)

  • Jang, Miyoung;Chang, Jae Woo
    • Journal of KIISE
    • /
    • v.42 no.11
    • /
    • pp.1303-1313
    • /
    • 2015
  • MapReduce provides high levels of system scalability and fault tolerance for large-size data processing. A MapReduce-based k-nearest-neighbor(k-NN) join algorithm seeks to produce the k nearest-neighbors of each point of a dataset from another dataset. The algorithm has been considered important in bigdata analysis. However, the existing k-NN join query-processing algorithm suffers from a high index-construction cost that makes it unsuitable for the processing of bigdata. To solve the corresponding problems, we propose a new grid-based, k-NN join query-processing algorithm. Our algorithm retrieves only the neighboring data from a query cell and sends them to each MapReduce task, making it possible to improve the overhead data transmission and computation. Our performance analysis shows that our algorithm outperforms the existing scheme by up to seven-fold in terms of the query-processing time, while also achieving high extent of query-result accuracy.

Semijoin-Based Spatial Join Processing in Multiple Sensor Networks

  • Kim, Min-Soo;Kim, Ju-Wan;Kim, Myoung-Ho
    • ETRI Journal
    • /
    • v.30 no.6
    • /
    • pp.853-855
    • /
    • 2008
  • This paper presents an energy-efficient spatial join algorithm for multiple sensor networks employing a spatial semijoin strategy. For optimization of the algorithm, we propose a GR-tree index and a grid-ID-based spatial approximation method, which are unique to sensor networks. The GR-tree is a distributed spatial index over the sensor nodes, which efficiently prunes away the nodes that will not participate in a spatial join result. The grid-ID-based approximation provides great reduction in communication cost by approximating many spatial objects in simpler forms. Our experiments demonstrate that the algorithm outperforms existing methods in reducing energy consumption at the nodes.

  • PDF

Using Indirect Predicates in Multi-way Spatial Joins (다중 공간 조인에서 간접 술어의 활용)

  • 박호현;정진완
    • Journal of KIISE:Databases
    • /
    • v.30 no.6
    • /
    • pp.593-605
    • /
    • 2003
  • Since spatial join processing consumes much time, several algorithms have been proposed to improve spatial join performance. The M-way R-tree join (MRJ) is a join algorithm which synchronously traverses M R-trees in the M-way spatial join. In this paper, we introduce indirect predicates which do not directly come from the multi-way join conditions but are indirectly derived from them. By applying the concept of indirect predicates to MRJ, we improve the performance of MRJ. We call such a multi-way R-tree join algorithm using indirect predicates indirect predicate filtering (IPF). Through experiments using synthetic data and real data, we show that IPF significantly

Optimizing Multi-way Join Query Over Data Streams (데이타 스트림에서의 다중 조인 질의 최적화 방법)

  • Park, Hong-Kyu;Lee, Won-Suk
    • Journal of KIISE:Databases
    • /
    • v.35 no.6
    • /
    • pp.459-468
    • /
    • 2008
  • A data stream which is a massive unbounded sequence of data elements continuously generated at a rapid rate. Many recent research activities for emerging applications often need to deal with the data stream. Such applications can be web click monitoring, sensor data processing, network traffic analysis. telephone records and multi-media data. For this. data processing over a data stream are not performed on the stored data but performed the newly updated data with pre-registered queries, and then return a result immediately or periodically. Recently, many studies are focused on dealing with a data stream more than a stored data set. Especially. there are many researches to optimize continuous queries in order to perform them efficiently. This paper proposes a query optimization algorithm to manage continuous query which has multiple join operators(Multi-way join) over data streams. It is called by an Extended Greedy query optimization based on a greedy algorithm. It defines a join cost by a required operation to compute a join and an operation to process a result and then stores all information for computing join cost and join cost in the statistics catalog. To overcome a weak point of greedy algorithm which has poor performance, the algorithm selects the set of operators with a small lay, instead of operator with the smallest cost. The set is influenced the accuracy and execution time of the algorithm and can be controlled adaptively by two user-defined values. Experiment results illustrate the performance of the EGA algorithm in various stream environments.

Segment Join Technique for Processing in Queries Fast (빠른 XML질의 처리를 위한 세그먼트 조인 기법)

  • ;Moon Bongki;Lee Sukho
    • Journal of KIISE:Databases
    • /
    • v.32 no.3
    • /
    • pp.334-343
    • /
    • 2005
  • Complex queries such as path alld twig patterns have been the focus of much research on processing XML data. Structural join algorithms use a form of encoded structural information for elements in an XML document to facilitate join processing. Recently, structural join algorithms such as Twigstack and TSGeneric- have been developed to process such complex queries, and they have been shown that the processing costs of the algorithms are linearly proportional to the sum of input data. However, the algorithms have a shortcoming that their processing costs increase with the length of a queery. To overcome the shortcoming, we propose the segment join technique to augment the structural join with structural indexes such as the 1-Index. The SegmentTwig algorithm based on the segment join technique performs joins between a pair of segments, which is a series of query nodes, rather than joins between a pair of query nodes. Consequently, the query can be processed by reading only a query node per segment. Our experimental study shorts that segment join algorithms outperform the structural join methods consistently and considerably for various data sets.

Performance Study of the Index-based Parallel Join

  • Jeong, Byeong-Soo;Edward Omiecinski
    • The Journal of Information Technology and Database
    • /
    • v.2 no.2
    • /
    • pp.87-109
    • /
    • 1995
  • The index file has been used a access database records effectively. The join operation in a relational database system requires a large execution time, especially in the case of handling large size tables. If the indexes are available on the joining attributes for both relations involved in the join and the join selectivity is relatively small, we can improve the execution time of the join operation. In this paper. we investigate the performance trade-offs of parallel index-based join algorithms where different indexing schemes are used. We also present a comparison of our index-based parallel join algorithms with the hash-based parallel join algorithm.

  • PDF

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.