• Title/Summary/Keyword: Loop Parallelization

Search Result 14, Processing Time 0.028 seconds

On-line Trace Based Automatic Parallelization of Java Programs on Multicore Platforms

  • Sun, Yu;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.2
    • /
    • pp.105-118
    • /
    • 2012
  • We propose two new approaches that automatically parallelize Java programs at runtime. These approaches, which rely on run-time trace information collected during program execution, dynamically recompile Java byte code that can be executed in parallel. One approach utilizes trace information to improve traditional loop parallelization, and the other parallelizes traces instead of loop iterations. We also describe a cost/benefit model that makes intelligent parallelization decisions, as well as a parallel execution environment to execute parallelized programs. These techniques are based on Jikes RVM. Our approach is evaluated by parallelizing sequential Java programs, and its performance is compared to that of the manually parallelized code. According to the experimental results, our approach has low overheads and achieves competitive speedups compared to the manually parallelizing code. Moreover, trace parallelization can exploit parallelism beyond loop iterations.

Locality-Conscious Nested-Loops Parallelization

  • Parsa, Saeed;Hamzei, Mohammad
    • ETRI Journal
    • /
    • v.36 no.1
    • /
    • pp.124-133
    • /
    • 2014
  • To speed up data-intensive programs, two complementary techniques, namely nested loops parallelization and data locality optimization, should be considered. Effective parallelization techniques distribute the computation and necessary data across different processors, whereas data locality places data on the same processor. Therefore, locality and parallelization may demand different loop transformations. As such, an integrated approach that combines these two can generate much better results than each individual approach. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate loop transformation. Applying this transformation results in coarse grain parallelism through exploiting the largest possible groups of outer permutable loops in addition to data locality through dependence satisfaction at inner loops. These groups can be further tiled to improve data locality through exploiting data reuse in multiple dimensions.

A Loop Transformation for Parallelism from Single Loops

  • Jeong, Sam-Jin
    • International Journal of Contents
    • /
    • v.2 no.4
    • /
    • pp.8-11
    • /
    • 2006
  • This paper describes several loop partitioning techniques such as loop splitting method by thresholds and Polychronopoulos' loop splitting method for exploiting parallelism from single loop which already developed. We propose improved loop splitting method for maximizing parallelism of single loops with non-constant dependence distances. By using the distance for the source of the first dependence, and by our defined theorems, we present generalized and optimal algorithms for single loops with non-uniform dependences. The algorithms generalize how to transform general single loops into parallel loops.

  • PDF

Parallelizing Imperfectly Nested Loops

  • Kim, Ki-Chang
    • Journal of Electrical Engineering and information Science
    • /
    • v.1 no.1
    • /
    • pp.140-150
    • /
    • 1996
  • Loops are some of the richest program constructs where parallelism is available. Exploiting fine-grain parallelizm out these constructs is particularly important in light of the growing popularity of superscalar and VLIW machines. This paper explains how the fine-grain parallelization techniques can be generalized to handle nested loops. Our technique integrates nested loop parallelization techniques at the fine-grain level, thus exposing more fine-grain parallelism, and is flexible enough to handle non-perfectly nested loops. Examples and some experimental results are presented to illustrate our approach.

  • PDF

Performance Comparison of Join Operations Parallelization by using GPGPU (GPGPU 기반 조인 연산 병렬화 성능 비교)

  • Lee, Jong-Sub;Lee, Sang-Back;Lee, Kyu-Chul
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.28-44
    • /
    • 2018
  • In a database system, the most expensive operation among relational operations is a join operation. Generally, CPU-based join operations uses parallel processing with either 1 core or 16 cores at most, which does not significantly improve the function. On the other hand, GPGPU(General-Purpose computing on Graphics Processing Units) allows parallel processing through thousands of processing units, greatly reducing the time required to perform join operations. Parallelization of the operation using GPGPU uses NVIDIA's CUDA SDK. In this paper, we implement parallelization of the join operation using GPGPU and compare the performances. The used join operations are Nested Loop Join (NLJ), Sort Merge Join (SMJ) and Hash Join (HJ), and GPGPU equipment uses TITAN Xp, GTX 1080 Ti and GTX 1080. We measure and compare the performance of join operations based on CPU and GPGPU. We compare this performance with the performance of the previous study on the join operation based on GPGPU. The results of experiment show that the performance based on GPGPU is 6~328 times faster than the one based on CPU.

The Procedure Transformation using Data Dependency Elimination Methods (자료 종속성 제거 방법을 이용한 프로시저 변환)

  • Jang, Yu-Suk;Park, Du-Sun
    • The KIPS Transactions:PartA
    • /
    • v.9A no.1
    • /
    • pp.37-44
    • /
    • 2002
  • Most researches of transforming sequential programs into parallel programs have been based on the loop structure transformation method. However, most programs have implicit interprocedure parallelism. This paper suggests a way of extracting parallelism from the loops with procedure calls using the data dependency elimination method. Most parallelization of the loop with procedure calls have been conducted for extracting parallelism from the uniform code. In this paper, we propose interprocedural transformation, which can be apply to both uniform and nonuniform code. We show the examples of uniform, nonuniform, and complex code parallelization. We then evaluated the performance of the various transformation methods using the CRAY-T3E system. The comparison results show that the proposed algorithm out-performs other conventional methods.

Parallel LDPC Decoder for CMMB on CPU and GPU Using OpenCL (OpenCL을 활용한 CPU와 GPU 에서의 CMMB LDPC 복호기 병렬화)

  • Park, Joo-Yul;Hong, Jung-Hyun;Chung, Ki-Seok
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.6
    • /
    • pp.325-334
    • /
    • 2016
  • Recently, Open Computing Language (OpenCL) has been proposed to provide a framework that supports heterogeneous computing platforms. By using an OpenCL framework, digital communication systems can support various protocols in a unified computing environment to achieve both high portability and high performance. This article introduces a parallel software decoder of Low Density Parity Check (LDPC) codes for China Multimedia Mobile Broadcasting (CMMB) on a heterogeneous platform. Each step of LDPC decoding has different parallelization characteristics. In this paper, steps suitable for task-level parallelization are executed on the CPU, and steps suitable for data-level parallelization are processed by the GPU. To improve the performance of the proposed OpenCL kernels for LDPC decoding operations, explicit thread scheduling, loop-unrolling, and effective data transfer techniques are applied. The proposed LDPC decoder achieves high performance by using heterogeneous multi-core processors on a unified computing framework.

The Parallelism Extraction in Loops with Procedure Calls (프로시저 호출을 가진 루프에서 병렬성 추출)

  • 장유숙;박두순
    • Journal of Korea Multimedia Society
    • /
    • v.4 no.3
    • /
    • pp.270-279
    • /
    • 2001
  • Since most program execution time is spent in the loop structure, extracting parallelism from sequential loop programs hale been focused. But, most programs hare implicit parallelism of interprocedure. This paper presents a generalized Parallelism extraction in loop\ulcorner with procedure calls. Most parallelization of loops with Procedure calls just focus on the uniform code which data dependency distance is constant. We presents algorithms which can be applied with uniform code, nonuniform code, and complex code. The proposed algorithm, loop extraction, loop embedding and procedure cloning transformation methods evaluate using CRAY-T3E. The result of performance evaluation is that proposed algorithm is an effect.

  • PDF

Interprocedural Transformations for Parallel Computing (병렬 계산을 위한 프로시저 전환)

  • 장유숙;박두순
    • Journal of Internet Computing and Services
    • /
    • v.2 no.4
    • /
    • pp.91-99
    • /
    • 2001
  • Since roost of the program execution time is spent in the loop structure, the problem of extracting parallelism from sequential loop has been one of the most important research issues. However. roost programs have Implicit interprocedure parallelism. This paper presents a generalized method extracting parallelism in loops having the procedure calls. Most parallelization of loops having procedure calls focus on the uniform code where data dependency distance is constant. We present algorithms which can be applied to uniform code, nonuniform code, and complex code. The performance of the proposed algorithm, loop extraction, loop embedding and procedure cloning transformation methods have been evaluated using CRAY-T3E. The result shows the effective of the proposed algorithm.

  • PDF

The Accuracy of the Non-continuous I Test for One-Dimensional Arrays with References Created by Induction Variables

  • Zhang, Qing
    • Journal of Information Processing Systems
    • /
    • v.10 no.4
    • /
    • pp.523-542
    • /
    • 2014
  • One-dimensional arrays with subscripts formed by induction variables in real programs appear quite frequently. For most famous data dependence testing methods, checking if integer-valued solutions exist for one-dimensional arrays with references created by induction variable is very difficult. The I test, which is a refined combination of the GCD and Banerjee tests, is an efficient and precise data dependence testing technique to compute if integer-valued solutions exist for one-dimensional arrays with constant bounds and single increments. In this paper, the non-continuous I test, which is an extension of the I test, is proposed to figure out whether there are integer-valued solutions for one-dimensional arrays with constant bounds and non-sing ularincrements or not. Experiments with the benchmarks that have been cited from Livermore and Vector Loop, reveal that there are definitive results for 67 pairs of one-dimensional arrays that were tested.