• Title/Summary/Keyword: General purpose computing

Search Result 161, Processing Time 0.021 seconds

High Throughput Parallel KMP Algorithm Considering CPU-GPU Memory Hierarchy (CPU-GPU 메모리 계층을 고려한 고처리율 병렬 KMP 알고리즘)

  • Park, Soeun;Kim, Daehee;Lee, Myungho;Park, Neungsoo
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.67 no.5
    • /
    • pp.656-662
    • /
    • 2018
  • Pattern matching algorithm is widely used in many application fields such as bio-informatics, intrusion detection, etc. Among many string matching algorithms, KMP (Knuth-Morris-Pratt) algorithm is commonly used because of its fast execution time when using large texts. However, the processing speed of KMP algorithm is also limited when the text size increases significantly. In this paper, we propose a high throughput parallel KMP algorithm considering CPU-GPU memory hierarchy based on OpenCL in GPGPU (General Purpose computing on Graphic Processing Unit). We focus on the optimization for the allocation of work-times and work-groups, the local memory copy of the pattern data and the failure table, and the overlapping of the data transfer with the string matching operations. The experimental results show that the execution time of the optimized parallel KMP algorithm is about 3.6 times faster than that of the non-optimized parallel KMP algorithm.

Implementation of high performance parallel LU factorization program for multi-threads on GPGPUs (GPGPU의 멀티 쓰레드를 활용한 고성능 병렬 LU 분해 프로그램의 구현)

  • Shin, Bong-Hi;Kim, Young-Tae
    • Journal of Internet Computing and Services
    • /
    • v.12 no.3
    • /
    • pp.131-137
    • /
    • 2011
  • GPUs were originally designed for graphic processing, and GPGPUs are general-purpose GPUs for numerical computation with high performance and low electric power. In this paper, we implemented the parallel LU factorization program for GPGPUs. In CUDA, which is computational environment for Nvidia GPGPUs, domains are divided into blocks, and multi-threads compute each sub-blocks Simultaneously. In LU factorization program, computation order should be artificially decided due to the data dependence. To resolve the data dependancy, we suggested a parallel LU program for GPGPUs, and also explained parallel reduction algorithm for partial pivoting of LU factorization. We finally present performance analysis to show efficiency of the parallel LU factorization program based on multi-threads on GPGPUs.

A Smart Slab Allocator for Wireless Sensor Operating Systems (무선 센서 운영체제를 위한 지능형 슬랩 할당기)

  • Min, Hong;Yi, Sang-Ho;Heo, Jun-Young;Kim, Seok-Hyun;Cho, Yoo-Kun;Hong, Ji-Man
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.7
    • /
    • pp.708-712
    • /
    • 2008
  • Existing dynamic memory allocation schemes for general purpose operating system can not directly apply to the wireless sensor networks (WSNs). Because these schemes did not consider features of WSNs, they consume a lot of energy and waste the memory space caused by fragmentation. In this paper, we found features of WSNs applications and made the model which adapts these issues. Through this research, we suggest the slab allocator that reduces the execution time and the memory management space. Also, we evaluate the performance of our scheme by comparing to one of the previous systems.

Parallel Process System and its Application to Steam Generator Structural Analysis

  • Chang Yoon-Suk;Ko Han-Ok;Choi Jae-Boong;Kim Young-Jin
    • Journal of Mechanical Science and Technology
    • /
    • v.19 no.11
    • /
    • pp.2007-2015
    • /
    • 2005
  • A large-scale analysis to evaluate complex material and structural behaviors is one of interesting topic in diverse engineering and scientific fields. Also, the utilization of massively parallel processors has been a recent trend of high performance computing. The objective of this paper is to introduce a parallel process system which consists of general purpose finite element analysis solver as well as parallelized PC cluster. The later was constructed using eight processing elements and the former was developed adopting both hierarchical domain decomposition method and balancing domain decomposition method. Then, to verify the efficiency of the established system, it was applied for structural analysis of steam generator in nuclear power plant. Since the prototypal evaluation results agreed well to the corresponding reference solutions it is believed that, after reinforcement of PC cluster by increasing number of processing elements, the promising parallel process system can be utilized as a useful tool for advanced structural integrity evaluation.

The study on the Efficient methodology to apply the GPU for military information system improvement (국방정보시스템 성능향상을 위한 효율적인 GPU적용방안 연구)

  • Kauh, Janghyuk;Lee, Dongho
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.11 no.1
    • /
    • pp.27-35
    • /
    • 2015
  • Increasing the number of GPU (Graphic Processor Unit) cores, the studies on High Performance Computing Platform using GPU have actively been made in recent. This trend has led to the development of GPGPU (General Purpose GPU) and CUDA (Compute Unified Device Architecture) Framework. In this paper, we explain the many benefits of the GPU based system, and propose the ICIDF(Identify Compute-Intensive Data set and Function) methodology to apply GPU technology to legacy military information system for performance improvement. To demonstrate the efficiency of this methodology, we applied this method to AES CPU based program obtained from the Internet web site. Simply changing the data structure made improved the performance of AES program. As a result, the performance of AES based GPU program is improved gradually up to 10 times. Depending on the developer's ability, additional performance improvement can be expected. The problem to be solved is heat issue, but this problem has been much improved by the development of the cooling technology.

A framework for parallel processing in multiblock flow computations (다중블록 유동해석에서 병렬처리를 위한 시스템의 구조)

  • Park, Sang-Geun;Lee, Geon-U
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.21 no.8
    • /
    • pp.1024-1033
    • /
    • 1997
  • The past several years have witnessed an ever-increasing acceptance and adoption of parallel processing, both for high performance scientific computing as well as for more general purpose applications. Furthermore with increasing needs to perform the complex flow calculations in an efficient manner, the use of the message passing model on distributed networks has emerged as an important alternative to the expensive supercomputers. This work attempts to provide a generic framework to enable the parallelization of all CFD-related works using the master-slave model. This framework consists of (1) input geometry, (2) domain decomposition, (3) grid generation, (4) flow computations, (5) flow visualization, and (6) output display as the sequential components, but performs computations for (2) to (5) in parallel on the workstation clustering. The flow computations are parallized by having multiple copies of the flow-code to solve a PDE on different spatial regions on different processors, while their flow data are exchanged across the region boundaries, and the solution is time-stepped. The Parallel Virtual Machine (PVM) is used for distributed communication in this work.

DMGL: An OpenGL ES Based Mobile 3D Rendering Libraries (DMGL: OpenGL ES 기반 모바일 3D 렌더링 라이브러리)

  • Hwang, Gyu-Hyun;Park, Sang-Hun
    • Journal of Korea Multimedia Society
    • /
    • v.11 no.8
    • /
    • pp.1160-1168
    • /
    • 2008
  • Recent technological innovations of mobile hardware which make it possible to implement real-time 3D rendering effects under mobile environment have provided a potential to develop realistic mobile application programs. This paper presents platform independent, OpenGL ES based, real-time mobile rendering libraries, called DMGL for supporting high quality 3D rendering on handhold devices. The libraries allows the programmers who develops mobile graphics softwares to generate varying advanced real-time 3D graphics effects without great effort. Moreover, GPGPU-based libraries give a set of functions to solve complex equations for simulating natural phenomena such as smoke and fire, and to render the results in real-time.

  • PDF

Performance Evaluation of an On-Chip Multiprocessor for Object Recognition (객체 인식을 위한 다중처리 마이크로프로세서의 성능 평가)

  • Chung, Yong-Wha;Park, Kyoung;Choi, Sung-Hoon;Hahn, Woo-Jong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.6
    • /
    • pp.558-566
    • /
    • 2000
  • Object recognition is a challenging application for high-performance computing. Currently, the superscalar architecture dominates todays microprocessor marketplace. As more transistors are integrated onto larger die, however, an on-chip multiprocessor is regarded as a promising alternative to the superscalar microprocessor. This paper examines the behavior of the object recognition on the on-chip multiprocessor, which will be employed in general-purpose parallel machines. To obtain the performance characteristics of the microprocessor, a program-driven simulator and its programming environment were developed. The simulation results showed that the on-chip multiprocessor can exploit thread level parallelisms effectively and offer a promising architecture for the object recognition application.

  • PDF

Depth-adaptive Sharpness Adjustments for Stereoscopic Perception Improvement and Hardware Implementation

  • Kim, Hak Gu;Kang, Jin Ku;Song, Byung Cheol
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.3
    • /
    • pp.110-117
    • /
    • 2014
  • This paper reports a depth-adaptive sharpness adjustment algorithm for stereoscopic perception improvement, and presents its field-programmable gate array (FPGA) implementation results. The first step of the proposed algorithm was to estimate the depth information of an input stereo video on a block basis. Second, the objects in the input video were segmented according to their depths. Third, the sharpness of the foreground objects was enhanced and that of the background was maintained or weakened. This paper proposes a new sharpness enhancement algorithm to suppress visually annoying artifacts, such as jagging and halos. The simulation results show that the proposed algorithm can improve stereoscopic perception without intentional depth adjustments. In addition, the hardware architecture of the proposed algorithm was designed and implemented on a general-purpose FPGA board. Real-time processing for full high-definition stereo videos was accomplished using 30,278 look-up tables, 24,553 registers, and 1,794,297 bits of memory at an operating frequency of 200MHz.

Algorithmic GPGPU Memory Optimization

  • Jang, Byunghyun;Choi, Minsu;Kim, Kyung Ki
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.14 no.4
    • /
    • pp.391-406
    • /
    • 2014
  • The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-pattern-aware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPGPU workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.