• Title/Summary/Keyword: Parallel pipeline

Search Result 172, Processing Time 0.024 seconds

1V-2.7ns 32b self-timed parallel carry look-ahead adder with wave pipeline dclock control (웨이브 파이프라인 클럭 제어에 의한 1V-2.7ns 32비트 자체동기방식 병렬처리 덧셈기의 설계)

  • 임정식;조제영;손일헌
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.7
    • /
    • pp.37-45
    • /
    • 1998
  • A 32-b self-timed parallel carry look-ahead adder (PCLA) designed for 0.5.mum. single threshold low power CMOS technology is demonstrated to operate with 2.7nsec delay at 8mW under 1V power supply. Compared to static PCLA and DPL adder, the self-timed PCLA designed with NORA logic provides the best performance at the power consumption comparable to other adder structures. The wave pipelined clock control play a crucial role in achieving the low power, high performance of this adder by eliminating the unnecessary power consumption due to the short-circuit current during the precharge phase. Th enoise margin has been improved by adopting the physical design of staic CMOS logic structure with controlled transistor sizes.

  • PDF

A Study on VLSI-Oriented 2-D Systolic Array Processor Design for APP (Algebraic Path Problem) (VLSI 지향적인 APP용 2-D SYSTOLIC ARRAY PROCESSOR 설계에 관한 연구)

  • 이현수;방정희
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.30B no.7
    • /
    • pp.1-13
    • /
    • 1993
  • In this paper, the problems of the conventional special-purpose array processor such as the deficiency of flexibility have been investigated. Then, a new modified methodology has been suggested and applied to obtain the common solution of the three typical App algorithms like SP(Shortest Path), TC(Transitive Closure), and MST(Minimun Spanning Tree) among the various APP algorithms using the similar method to obtain the solution. In the newly proposed APP parallel algorithm, real-time Processing is possible, without the structure enhancement and the functional restriction. In addition, we design 2-demensional bit-parallel low-triangular systolic array processor and the 1-PE in detail. For its evaluation, we consider its computational complexity according to bit-processing method and describe relationship of total chip size and execution time. Therefore, the proposed processor obtains, on which a large data inputs in real-time, 3n-4 execution time which is optimal o(n) time complexity, o(n$^{2}$) space complexity which is the number of total gate and pipeline period rate is one.

  • PDF

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

  • Oh, Jaeg-Eun;Hwang, Seok-Joong;Nguyen, Huong Giang;Kim, A-Reum;Kim, Seon-Wook;Kim, Chul-Woo;Kim, Jong-Kook
    • ETRI Journal
    • /
    • v.30 no.4
    • /
    • pp.576-586
    • /
    • 2008
  • In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.

  • PDF

QoS Guarantee in Partial Failure of Clustered VOD Server (클러스터 VOD 서버의 부분적 장애에서 QoS 보장)

  • Lee, Joa-Hyoung;Jung, In-Bum
    • The KIPS Transactions:PartC
    • /
    • v.16C no.3
    • /
    • pp.363-372
    • /
    • 2009
  • For large scale VOD service, cluster servers are spotlighted to their high performance and low cost. A cluster server usually consists of a front-end node and multiple back-end nodes. Though increasing the number of back-end nodes can result in the more QoS streams for clients, the possibility of failures in back-end nodes is proportionally increased. The failure causes not only the stop of all streaming service but also the loss of the current playing positions. In this paper, when a back-end node becomes a failed state, the recovery mechanisms are studied to support the unceasing streaming service. For the actual VOD service environment, we implement a cluster-based VOD servers composed of general PCs and adopt the parallel processing for MPEG movies. From the implemented VOD server, a video block recovery mechanism is designed on parity algorithms. However, without considering the architecture of cluster-based VOD server, the application of the basic technique causes the performance bottleneck of the internal network for recovery and also results in the inefficiency CPU usage of back-end nodes. To address these problems, we propose a new failure recovery mechanism based on the pipeline computing concept.

Memory Reduction Method of Radix-22 MDF IFFT for OFDM Communication Systems (OFDM 통신시스템을 위한 radix-22 MDF IFFT의 메모리 감소 기법)

  • Cho, Kyung-Ju
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.13 no.1
    • /
    • pp.42-47
    • /
    • 2020
  • In OFDM-based very high-speed communication systems, FFT/IFFT processor should have several properties of low-area and low-power consumption as well as high throughput and low processing latency. Thus, radix-2k MDF (multipath delay feedback) architectures by adopting pipeline and parallel processing are suitable. In MDF architecture, the feedback memory which increases in proportion to the input signal word-length has a large area and power consumption. This paper presents a feedback memory size reduction method of radix-22 MDF IFFT processor for OFDM applications. The proposed method focuses on reducing the feedback memory size in the first two stages of MDF architectures since the first two stages occupy about 75% of the total feedback memory. In OFDM transmissions, IFFT input signals are composed of modulated data and pilot, null signals. In order to reduce the IFFT input word-length, the integer mapping which generates mapped data composed of two signed integer corresponding to modulated data and pilot/null signals is proposed. By simulation, it is shown that the proposed method has achieved a feedback memory reduction up to 39% compared to conventional approach.

A Design of 4×4 Block Parallel Interpolation Motion Compensation Architecture for 4K UHD H.264/AVC Decoder (4K UHD급 H.264/AVC 복호화기를 위한 4×4 블록 병렬 보간 움직임보상기 아키텍처 설계)

  • Lee, Kyung-Ho;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.5
    • /
    • pp.102-111
    • /
    • 2013
  • In this paper, we proposed a $4{\times}4$ block parallel architecture of interpolation for high-performance H.264/AVC Motion Compensation in 4K UHD($3840{\times}2160$) video real time processing. To improve throughput, we design $4{\times}4$ block parallel interpolation. For supplying the $9{\times}9$ reference data for interpolation, we design 2D cache buffer which consists of the $9{\times}9$ memory arrays. We minimize redundant storage of the reference pixel by applying the Search Area Stripe Reuse scheme(SASR), and implement high-speed plane interpolator with 3-stage pipeline(Horizontal Vertical 1/2 interpolation, Diagonal 1/2 interpolation, 1/4 interpolation). The proposed architecture was simulated in 0.13um standard cell library. The maximum operation frequency is 150MHz. The gate count is 161Kgates. The proposed H.264/AVC Motion Compensation can support 4K UHD at 72 frames per second by running at 150MHz.

Parallel Structure Design Method for Mass Spring Simulation (질량스프링 시뮬레이션을 위한 병렬 구조 설계 방법)

  • Sung, Nak-Jun;Choi, Yoo-Joo;Hong, Min
    • Journal of the Korea Computer Graphics Society
    • /
    • v.25 no.3
    • /
    • pp.55-63
    • /
    • 2019
  • Recently, the GPU computing method has been utilized to improve the performance of the physics simulation field. In particular, in the case of a deformed object simulation requiring a large amount of computation, a GPU-based parallel processing algorithm is required to guarantee real-time performance. We have studied the parallel structure design method to improve the performance of the mass spring simulation method which is one of the methods of implementing the deformation object simulation. We used OpenGL's GLSL, a graphics library that allows direct access to the GPU, and implemented the GPGPU environment using an independent pipeline, the compute shader. In order to verify the effectiveness of the parallel structure design method, the mass - spring system was implemented based on CPU and GPU. Experimental results show that the proposed method improves computation speed by about 6,000% compared to the CPU Environment. It is expected that the lightweight simulation technology can be effectively applied to the augmented reality and the virtual reality field by using the design method proposed later in this research.

Recognition of the 3-D motion of a human arm with HIGIPS

  • Yao, Feng-Hui;Tamaki, Akikazu;Kato, Kiyoshi
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1991.10b
    • /
    • pp.1724-1729
    • /
    • 1991
  • This paper gives an overview of HIGIPS design concepts and prototype HIGIPS configuration, and discusses its application to recognition of the 3-D motion of a human arm. HIGIPS which employs the combination of pipeline architecture and multiprocessor architecture, is a high-speed, high-performance and low cost N * M multimicroprocessor parallel machine, where N is the number of pipeline stages and M is the number of processors in each stage. The algorithm to recognize the motion of a human arm with a single TV camera was developed on personal computer (NEC PC9801 series). As a constraint condition, some simple ring marks are used. Each joint of the arm is attached with a ring mark to obtain its centroid position when the arm moves. These centroid positions in the three-dimensional space are linked at each of the successive pictures of the moving arm to recover its overall motion. This algorithm takes about 2 seconds to process one image frame on the general-purpose personal computer. This paper mainly discuses how to partition this algorithm and execute on HIGIPS, and shows the speed up. From this application, it is clear that HIGIPS is an efficient machine for image processing and recognizing.

  • PDF

A Study on Hardware Implementation of 128-bit LEA Encryption Block (128비트 LEA 암호화 블록 하드웨어 구현 연구)

  • Yoon, Gi Ha;Park, Seong Mo
    • Smart Media Journal
    • /
    • v.4 no.4
    • /
    • pp.39-46
    • /
    • 2015
  • This paper describes hardware implementation of the encryption block of the '128 bit block cipher LEA' among various lightweight encryption algorithms for IoT (Internet of Things) security. Round function blocks and key-schedule blocks are designed by parallel circuits for high throughput. The encryption blocks support secret-key of 128 bits, and are designed by FSM method and 24/n stage(n=1, 2, 3, 4, 8, 12) pipeline methods. The LEA-128 encryption blocks are modeled using Verilog-HDL and implemented on FPGA, and according to the synthesis results, minimum area and maximum throughput are provided.

Comparison of Parallelized Network Coding Performance (네트워크 코딩의 병렬처리 성능비교)

  • Choi, Seong-Min;Park, Joon-Sang;Ahn, Sang-Hyun
    • The KIPS Transactions:PartC
    • /
    • v.19C no.4
    • /
    • pp.247-252
    • /
    • 2012
  • Network coding has been shown to improve various performance metrics in network systems. However, if network coding is implemented as software a huge time delay may be incurred at encoding/decoding stage so it is imperative for network coding to be parallelized to reduce time delay when encoding/decoding. In this paper, we compare the performance of parallelized decoders for random linear network coding (RLC) and pipeline network coding (PNC), a recent development in order to alleviate problems of RLC. We also compare multi-threaded algorithms on multi-core CPUs and massively parallelized algorithms on GPGPU for PNC/RLC.