• Title/Summary/Keyword: parallel performance

Search Result 2,858, Processing Time 0.026 seconds

Hardware Design and Implementation of a Parallel Processor for High-Performance Multimedia Processing (고성능 멀티미디어 처리용 병렬프로세서 하드웨어 설계 및 구현)

  • Kim, Yong-Min;Hwang, Chul-Hee;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.5
    • /
    • pp.1-11
    • /
    • 2011
  • As the use of mobile multimedia devices is increasing in the recent year, the needs for high-performance multimedia processors are increasing. In this regard, we propose a SIMD (Single Instruction Multiple Data) based parallel processor that supports high-performance multimedia applications with low energy consumption. The proposed parallel processor consists of 16 processing elements (PEs) and operates on a 3-stage pipelining. Experimental results indicated that the proposed parallel processor outperforms conventional parallel processors in terms of performance. In addition, our proposed parallel processor outperforms commercial high-performance TI C6416 DSP in terms of performance (1.4-31.4x better) and energy efficiency (5.9-8.1x better) with same 130nm technology and 720 clock frequency. The proposed parallel processor was developed with verilog HDL and verified with a FPGA prototype system.

Design of Parallel Algorithms for Conventional Matched-Field Processing over Array of DSP Processors (다중 DSP 프로세서 기반의 병렬 수중정합장처리 알고리즘 설계)

  • Kim, Keon-Wook
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.44 no.4 s.316
    • /
    • pp.101-108
    • /
    • 2007
  • Parallel processing algorithms, coupled with advanced networking and distributed computing architectures, improve the overall computational performance, dependability, and versatility of a digital signal processing system In this paper, novel parallel algorithms are introduced and investigated for advanced sonar algorithm, conventional matched-field processing (CMFP). Based on a specific domain, each parallel algorithm decomposes the sequential workload in order to obtain scalable parallel speedup. Depending on the processing requirement of the algorithm, the computational performance of the parallel algorithm reveals different characteristics. The high-complexity algorithm, CMFP shows scalable parallel performance on the array of DSP processors. The impact on parallel performance due to workload balancing, communication scheme, algorithm complexity, processor speed, network performance, and testbed configuration is explored.

Implementation and Performance Analysis of a Parallel SIMPLER Model Based on Domain Decomposition (영역 분할에 의한 SIMPLER 모델의 병렬화와 성능 분석)

  • Kwak Ho Sang;Lee Sangsan
    • Journal of computational fluids engineering
    • /
    • v.3 no.1
    • /
    • pp.22-29
    • /
    • 1998
  • Parallel implementation is conducted for a SIMPLER finite volume model. The present parallelism is based on domain decomposition and explicit message passing using MPI and SHMEM. Two parallel solvers to tridiagonal matrix equation are employed. The implementation is verified on the Cray T3E system for a benchmark problem of natural convection in a sidewall-heated cavity. The test results illustrate good scalability of the present parallel models. Performance issues are elaborated in view of convergence as well as conventional parallel overheads and single processor performance. The effectiveness of a localized matrix solution algorithm is demonstrated.

  • PDF

A Study on Improvement of Low-power Memory Architecture in IoT/edge Computing (IoT/에지 컴퓨팅에서 저전력 메모리 아키텍처의 개선 연구)

  • Cho, Doosan
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.24 no.1
    • /
    • pp.69-77
    • /
    • 2021
  • The widely used low-cost design methodology for IoT devices is very popular. In such a networked device, memory is composed of flash memory, SRAM, DRAM, etc., and because it processes a large amount of data, memory design is an important factor for system performance. Therefore, each device selects optimized design factors such as function, performance and cost according to market demand. The design of a memory architecture available for low-cost IoT devices is very limited with the configuration of SRAM, flash memory, and DRAM. In order to process as much data as possible in the same space, an architecture that supports parallel processing units is usually provided. Such parallel architecture is a design method that provides high performance at low cost. However, it needs precise software techniques for instruction and data mapping on the parallel architecture. This paper proposes an instruction/data mapping method to support optimized parallel processing performance. The proposed method optimizes system performance by actively using hardware and software parallelism.

PERFORMANCE ANALYSIS OF THE PARALLEL CUPID CODE IN DISTRIBUTED MEMORY SYSTEM BASED ETHERNET AND INFINIBAND NETWORK (이더넷과 인피니밴드 네트워크 기반의 분산 메모리 시스템에서 병렬성능 분석)

  • Jeon, B.J.;Choi, H.G.
    • Journal of computational fluids engineering
    • /
    • v.19 no.2
    • /
    • pp.24-29
    • /
    • 2014
  • In this study, a parallel performance of CUPID-code has been investigated for both Ethernet and Infiniband network system to examine the effect of cache memory and network-speed. Bi-conjugate gradient solver of CUPID-code has been parallelised by using domain decomposition method and message passing interface (MPI). It is shown that the parallel performance of Ethernet-network system is worse than that of Infiniband-network system due to the slow network-speed and a small cache memory. It is also found that the parallel performance of each system deteriorates for a small problem due to the communication overhead, but the performance of Infiniband-network system is better than Ethernet-network system due to a much faster network-speed. For a large problem, the parallel performance depends less on network system.

Parallel implementations and their performance evaluations of a SOFM neural network on the multicomputer (다중컴퓨터망에서 SOFM 신경회로망의 병렬구현 및 성능평가)

  • 김선종;최흥문
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.33B no.10
    • /
    • pp.90-97
    • /
    • 1996
  • This paper presents an efficient parallel implementation and its performance evaluations of a SOFM neural netowrk on the multicomputer. We investigate the parallel performance as the size of a neural network N, the number of the patterns L, and the number of the processors p increase. We propose an analytica performance evaluation model for eac of the parallel implementations and verified the validity of the model through experiments. Analytical result show that the number of processors for a maximum speedup of the network decomposition nd the training-set decomposition increases in proportion to .root.N and .root.L, respectively. The performances of the both decompositions depend on the number of training patterns L and the size of the neural network N and, if L.geq.0.423N, the performance of trhe training-set decomposition is proved to be better than that of the network decomposition.

  • PDF

A Numerical Analysis on Performance of Parallel Type Ejector for High Altitude Simulation (고공 환경 모사를 위한 병렬형 이젝터 구성에 따른 특성 연구)

  • Shin, Donghae;Yu, Isang;Shin, Minku;Oh, Jeonghwa;Ko, Youngsung;Kim, Sunjin
    • Journal of the Korean Society of Propulsion Engineers
    • /
    • v.23 no.1
    • /
    • pp.52-60
    • /
    • 2019
  • In this study, the performance and structure of a parallel ejector comprised of multiple single ejectors were confirmed through numerical analysis. The same design variables (mass suction ratio, compression ratio, and expansion ratio) relevant to the performance of a single ejector were considered in the design of the parallel ejector. Analytical results showed that there was no significant difference in the performance of either system related to the operating mass suction ratio; however, the systemsize was significantly reduced. In addition, it was confirmed that when ejectors of the same performance capacity are arranged in parallel, the combined mass suction ratio is lower than that of the single ejector, allowing a lower pressure to be realized. The results of the analysis indicated that the parallel ejector's performance is not significantly different from that of any single ejector, but confirmed that the parallel ejector can offer a configurationdependent advantage in size and operation.

Development of Real time Air Quality Prediction System

  • Oh, Jai-Ho;Kim, Tae-Kook;Park, Hung-Mok;Kim, Young-Tae
    • Proceedings of the Korean Environmental Sciences Society Conference
    • /
    • 2003.11a
    • /
    • pp.73-78
    • /
    • 2003
  • In this research, we implement Realtime Air Diffusion Prediction System which is a parallel Fortran model running on distributed-memory parallel computers. The system is designed for air diffusion simulations with four-dimensional data assimilation. For regional air quality forecasting a series of dynamic downscaling technique is adopted using the NCAR/Penn. State MM5 model which is an atmospheric model. The realtime initial data have been provided daily from the KMA (Korean Meteorological Administration) global spectral model output. It takes huge resources of computation to get 24 hour air quality forecast with this four step dynamic downscaling (27km, 9km, 3km, and lkm). Parallel implementation of the realtime system is imperative to achieve increased throughput since the realtime system have to be performed which correct timing behavior and the sequential code requires a large amount of CPU time for typical simulations. The parallel system uses MPI (Message Passing Interface), a standard library to support high-level routines for message passing. We validate the parallel model by comparing it with the sequential model. For realtime running, we implement a cluster computer which is a distributed-memory parallel computer that links high-performance PCs with high-speed interconnection networks. We use 32 2-CPU nodes and a Myrinet network for the cluster. Since cluster computers more cost effective than conventional distributed parallel computers, we can build a dedicated realtime computer. The system also includes web based Gill (Graphic User Interface) for convenient system management and performance monitoring so that end-users can restart the system easily when the system faults. Performance of the parallel model is analyzed by comparing its execution time with the sequential model, and by calculating communication overhead and load imbalance, which are common problems in parallel processing. Performance analysis is carried out on our cluster which has 32 2-CPU nodes.

  • PDF

A Study on the Parallel Line Pivoted Pad Thrust Bearing (평행선 지지식 추력베어링에 관한 연구)

  • 이경우;김종수;제양규
    • Tribology and Lubricants
    • /
    • v.15 no.1
    • /
    • pp.24-28
    • /
    • 1999
  • This paper describes a new pivoting technique to improve bearing performance in pivoted pad thrust bearings. This new technique adjusts the pivot line in a line pivoted pad thrust bearing to be parallel to the trailing edge of a sector shaped pad. Bearing performance factors such as load carrying capacity, frictional torque and flow rate are numerically investigated for conventional point-pivoted and line-pivoted pads and for the new parallel-line pivoting technique. It is shown that the load carving capacity can be maximized with the new technique.

An efficient Storage Reclamation Algorithm for RISC Parallel Processing (RISC 병렬 처리를 위한 기억공간의 효율적인 활용 알고리즘)

  • 이철원;임인칠
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.28B no.9
    • /
    • pp.703-711
    • /
    • 1991
  • In this paper, an efficient storage reclamation algorithm for RISC parallel processing in the object orented programming environments is presented. The memory management for the dynamic memory allocation and the frequent memory access in object oriented programming is the main factor that decreases RISC parallel processing performance. The proposed algorithm can be efficiently allocated the memory space of RISCy computer which is required the frequent memory access, so it can be increased RISC parallel processing performance. The proposed algorithm is verified the efficiency by implementing C language on SUN SPARC(4.3 BSD UNIX).

  • PDF