• Title/Summary/Keyword: Parallel computation

Search Result 594, Processing Time 0.025 seconds

A Study on GPU Computing of Bi-conjugate Gradient Method for Finite Element Analysis of the Incompressible Navier-Stokes Equations (유한요소 비압축성 유동장 해석을 위한 이중공액구배법의 GPU 기반 연산에 대한 연구)

  • Yoon, Jong Seon;Jeon, Byoung Jin;Jung, Hye Dong;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.40 no.9
    • /
    • pp.597-604
    • /
    • 2016
  • A parallel algorithm of bi-conjugate gradient method was developed based on CUDA for parallel computation of the incompressible Navier-Stokes equations. The governing equations were discretized using splitting P2P1 finite element method. Asymmetric stenotic flow problem was solved to validate the proposed algorithm, and then the parallel performance of the GPU was examined by measuring the elapsed times. Further, the GPU performance for sparse matrix-vector multiplication was also investigated with a matrix of fluid-structure interaction problem. A kernel was generated to simultaneously compute the inner product of each row of sparse matrix and a vector. In addition, the kernel was optimized to improve the performance by using both parallel reduction and memory coalescing. In the kernel construction, the effect of warp on the parallel performance of the present CUDA was also examined. The present GPU computation was more than 7 times faster than the single CPU by double precision.

Design and Performance Analysis of a Parallel Optimal Branch-and-Bound Algorithm for MIN-based Multiprocessors (MIN-based 다중 처리 시스템을 위한 효율적인 병렬 Branch-and-Bound 알고리즘 설계 및 성능 분석)

  • Yang, Myung-Kook
    • Journal of IKEEE
    • /
    • v.1 no.1 s.1
    • /
    • pp.31-46
    • /
    • 1997
  • In this paper, a parallel Optimal Best-First search Branch-and-Bound(B&B) algorithm(pobs) is designed and evaluated for MIN-based multiprocessor systems. The proposed algorithm decomposes a problem into G subproblems, where each subproblem is processed on a group of P processors. Each processor group uses tile sub-Global Best-First search technique to find a local solution. The local solutions are broadcasted through the network to compute the global solution. This broadcast provides not only the comparison of G local solutions but also the load balancing among the processor groups. A performance analysis is then conducted to estimate the speed-up of the proposed parallel B&B algorithm. The analytical model is developed based on the probabilistic properties of the B&B algorithm. It considers both the computation time and communication overheads to evaluate the realistic performance of the algorithm under the parallel processing environment. In order to validate the proposed evaluation model, the simulation of the parallel B&B algorithm on a MIN-based system is carried out at the same time. The results from both analysis and simulation match closely. It is also shown that the proposed Optimal Best-First search B&B algorithm performs better than other reported schemes with its various advantageous features such as: less subproblem evaluations, prefer load balancing, and limited scope of remote communication.

  • PDF

A Sclable Parallel Labeling Algorithm on Mesh Connected SIMD Computers (메쉬 구조형 SIMD 컴퓨터 상에서 신축적인 병렬 레이블링 알고리즘)

  • 박은진;이갑섭성효경최흥문
    • Proceedings of the IEEK Conference
    • /
    • 1998.10a
    • /
    • pp.731-734
    • /
    • 1998
  • A scalable parallel algorithm is proposed for efficient image component labeling with local operatos on a mesh connected SIMD computer. In contrast to the conventional parallel labeling algorithms, where a single pixel is assigned to each PE, the algorithm presented here is scalable and can assign m$\times$m pixel set to each PE according to the input image size. The assigned pixel set is converted to a single pixel that has representative value, and the amount of the required memory and processing time can be highly reduced. For N$\times$N image, if m$\times$m pixel set is assigned to each PE of P$\times$P mesh, where P=N/m, the time complexity due to the communication of each PE and the computation complexity are reduced to O(PlogP) bit operations and O(P) bit operations, respectively, which is 1/m of each of the conventional method. This method also diminishes the amount of memory in each PE to O(P), and can decrease the number of PE to O(P2) =Θ(N2/m2) as compared to O(N2) of conventional method. Because the proposed parallel labeling algorithm is scalable, we can adapt to the increase of image size without the hardware change of the given mesh connected SIMD computer.

  • PDF

Parallel Processing for Integral Imaging Pickup Using Multiple Threads

  • Jang, Young-Hee;Park, Chan;Park, Jae-Hyeung;Kim, Nam;Yoo, Kwan-Hee
    • International Journal of Contents
    • /
    • v.5 no.4
    • /
    • pp.30-34
    • /
    • 2009
  • Many studies have been done on the integral imaging pickup whose objective is to get efficiently elemental images from a lens array with respect to three-dimensional (3D) objects. In the integral imaging pickup process, it is necessary to render an elemental image from each elemental lens in a lens array for 3D objects, and then to combine them into one total image. The multiple viewpoint rendering (MVR) is one of various methods for integral imaging pickup. This method, however, has the computing and rendering time problem for obtaining element images from a lot of elemental lens. In order to solve the problems, in this paper, we propose a parallel MVR (PMVR) method to generate elemental images in a parallel through distribution of elemental lenses into multiple threads simultaneously. As a result, the computation time of integral imaging using PMVR is reduced significantly rather than a sequential approach and then we showed that the PMVR is very useful.

Implementation of Ray Tracing Processor for the Parallel Processing (병렬처리를 위한 고속 Ray Tracing 프로세서의 설계)

  • Choe, Gyu-Yeol;Jeong, Deok-Jin
    • The Transactions of the Korean Institute of Electrical Engineers A
    • /
    • v.48 no.5
    • /
    • pp.636-642
    • /
    • 1999
  • The synthesis of the 3D images is the most important part of the virtual reality. The ray tracing is the best method for reality in the 3D graphics. But the ray tracing requires long computation time for the synthesis of the 3D images. So, we implement the ray tracing with software and hardware. Specially we design the hit-test unit with FPGA tool for the ray tracing. Hit-test unit is a very important part of ray tracing to improve the speed. In this paper, we proposed a new hit-test algorithm and apply the parallel architecture for hit-test unit to improve the speed. We optimized the arithmetic unit because the critical path of hit-test unit is in the multiplication part. We used the booth algorithm and the baugh-wooley algorithm to reduce the partial product and adapted the CSA and CLA to improve the efficiency of the partial product addition. Our new Ray tracing processor can produce the image about 512ms/F and can be adapted to real-time application with only 10 parallel processors.

  • PDF

One-node and two-node hybrid coarse-mesh finite difference algorithm for efficient pin-by-pin core calculation

  • Song, Seongho;Yu, Hwanyeal;Kim, Yonghee
    • Nuclear Engineering and Technology
    • /
    • v.50 no.3
    • /
    • pp.327-339
    • /
    • 2018
  • This article presents a new global-local hybrid coarse-mesh finite difference (HCMFD) method for efficient parallel calculation of pin-by-pin heterogeneous core analysis. In the HCMFD method, the one-node coarse-mesh finite difference (CMFD) scheme is combined with a nodal expansion method (NEM)-based two-node CMFD method in a nonlinear way. In the global-local HCMFD algorithm, the global problem is a coarse-mesh eigenvalue problem, whereas the local problems are fixed source problems with boundary conditions of incoming partial current, and they can be solved in parallel. The global problem is formulated by one-node CMFD, in which two correction factors on an interface are introduced to preserve both the surface-average flux and the net current. Meanwhile, for accurate and efficient pin-wise core analysis, the local problem is solved by the conventional NEM-based two-node CMFD method. We investigated the numerical characteristics of the HCMFD method for a few benchmark problems and compared them with the conventional two-node NEM-based CMFD algorithm. In this study, the HCMFD algorithm was also parallelized with the OpenMP parallel interface, and its numerical performances were evaluated for several benchmarks.

A PARALLEL PRECONDITIONER FOR GENERALIZED EIGENVALUE PROBLEMS BY CG-TYPE METHOD

  • MA, SANGBACK;JANG, HO-JONG
    • Journal of the Korean Society for Industrial and Applied Mathematics
    • /
    • v.5 no.2
    • /
    • pp.63-69
    • /
    • 2001
  • In this study, we shall be concerned with computing in parallel a few of the smallest eigenvalues and their corresponding eigenvectors of the eigenvalue problem, $Ax={\lambda}Bx$, where A is symmetric, and B is symmetric positive definite. Both A and B are large and sparse. Recently iterative algorithms based on the optimization of the Rayleigh quotient have been developed, and CG scheme for the optimization of the Rayleigh quotient has been proven a very attractive and promising technique for large sparse eigenproblems for small extreme eigenvalues. As in the case of a system of linear equations, successful application of the CG scheme to eigenproblems depends also upon the preconditioning techniques. A proper choice of the preconditioner significantly improves the convergence of the CG scheme. The idea underlying the present work is a parallel computation of the Multi-Color Block SSOR preconditioning for the CG optimization of the Rayleigh quotient together with deflation techniques. Multi-Coloring is a simple technique to obatin the parallelism of order n, where n is the dimension of the matrix. Block SSOR is a symmetric preconditioner which is expected to minimize the interprocessor communication due to the blocking. We implemented the results on the CRAY-T3E with 128 nodes. The MPI(Message Passing Interface) library was adopted for the interprocessor communications. The test problems were drawn from the discretizations of partial differential equations by finite difference methods.

  • PDF

Simulation of Deformable Objects using GLSL 4.3

  • Sung, Nak-Jun;Hong, Min;Lee, Seung-Hyun;Choi, Yoo-Joo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.8
    • /
    • pp.4120-4132
    • /
    • 2017
  • In this research, we implement a deformable object simulation system using OpenGL's shader language, GLSL4.3. Deformable object simulation is implemented by using volumetric mass-spring system suitable for real-time simulation among the methods of deformable object simulation. The compute shader in GLSL 4.3 which helps to access the GPU resources, is used to parallelize the operations of existing deformable object simulation systems. The proposed system is implemented using a compute shader for parallel processing and it includes a bounding box-based collision detection solution. In general, the collision detection is one of severe computing bottlenecks in simulation of multiple deformable objects. In order to validate an efficiency of the system, we performed the experiments using the 3D volumetric objects. We compared the performance of multiple deformable object simulations between CPU and GPU to analyze the effectiveness of parallel processing using GLSL. Moreover, we measured the computation time of bounding box-based collision detection to show that collision detection can be processed in real-time. The experiments using 3D volumetric models with 10K faces showed the GPU-based parallel simulation improves performance by 98% over the CPU-based simulation, and the overall steps including collision detection and rendering could be processed in real-time frame rate of 218.11 FPS.

Many-objective joint optimization for dependency-aware task offloading and service caching in mobile edge computing

  • Xiangyu Shi;Zhixia Zhang;Zhihua Cui;Xingjuan Cai
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.5
    • /
    • pp.1238-1259
    • /
    • 2024
  • Previous studies on joint optimization of computation offloading and service caching policies in Mobile Edge Computing (MEC) have often neglected the impact of dependency-aware subtasks, edge server resource constraints, and multiple users on policy formulation. To remedy this deficiency, this paper proposes a many-objective joint optimization dependency-aware task offloading and service caching model (MaJDTOSC). MaJDTOSC considers the impact of dependencies between subtasks on the joint optimization problem of task offloading and service caching in multi-user, resource-constrained MEC scenarios, and takes the task completion time, energy consumption, subtask hit rate, load variability, and storage resource utilization as optimization objectives. Meanwhile, in order to better solve MaJDTOSC, a many-objective evolutionary algorithm TSMSNSGAIII based on a three-stage mating selection strategy is proposed. Simulation results show that TSMSNSGAIII exhibits an excellent and stable performance in solving MaJDTOSC with different number of users setting and can converge faster. Therefore, it is believed that TSMSNSGAIII can provide appropriate sub-task offloading and service caching strategies in multi-user and resource-constrained MEC scenarios, which can greatly improve the system offloading efficiency and enhance the user experience.

A Study on Parallel AES Cipher Algorithm based on Multi Processor (멀티프로세서 기반의 병렬 AES 암호 알고리즘에 관한 연구)

  • Park, Jung-Oh;Oh, Gi-Oug
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.1
    • /
    • pp.171-181
    • /
    • 2012
  • This paper defines the AES password algorithm used as a symmetric-key-based password algorithm, and proposes the design of parallel password algorithm to utilize the resources of multi-core processor as much as possible. The proposed parallel password algorithm was confirmed for parallel execution of password computation by allocating the password algorithm according to the number of cores, and about 30% of performance increase compared to AES password algorithm. The encryption/decryption performance of the password algorithm was confirmed through binary comparative analysis tool, which confirmed that the binary results were the same for AES password algorithm and proposed parallel password algorithm, and the decrypted binary were also the same. The parallel password algorithm for multi-core environment proposed in this paper can be applied to authentication/payment of financial service in PC, laptop, server, and mobile environment, and can be utilized in the area that required high-speed encryption operation of large-sized data.