• Title/Summary/Keyword: Pipelined Design

Search Result 195, Processing Time 0.029 seconds

A Scheduling Algorithm for the Synthesis of a Pipelined Datapath using Collision Count (충돌수를 이용한 파이프라인 데이타패스 합성 스케쥴링 알고리즘)

  • Yu, Dong-Jin;Yoo, Hee-Jin;Park, Do-Soon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.11
    • /
    • pp.2973-2979
    • /
    • 1998
  • As this paper is a scheduling algorithm for the synthesis of a pipelined datapath under resource constraints in high level synthesis, the proposed heuristic algorithm uses a priority function based on the collision count of resourecs. In order to schedule the pipelined datapath under resource constraints, we define the collision count and the priority function based on the collision count, a number of resource, and the mobility of operations to resolve a resource collision. The proposed algorithm supports chaining, multicycling, and structural pipelining to design the realistic hardware. The evaluation of the Performance is compared with other systems using the results of the synthesis for a 16point FIR filter and a 5th order elliptic wave filter, where in most cases, the optimal solution is obtained.

  • PDF

A Design and Implementation of 32-bit RISC-V RV32IM Pipelined Processor in Embedded Systems (임베디드 환경에서의 32-bit RISC-V RV32IM 파이프라인 프로세서 설계 및 구현)

  • Subin Park;Yongwoo Kim
    • Journal of the Semiconductor & Display Technology
    • /
    • v.22 no.4
    • /
    • pp.81-86
    • /
    • 2023
  • Recently, demand for embedded systems requiring low power and high specifications has been increasing, and RISC-V processors are being widely applied. RISC-V, a RISC-based open instruction set architecture (ISA), has been developed and researched by UC Berkeley and other researchers since 2010. RV32I ISA is sufficient to support integer operations such as addition and subtraction instructions, but M-extension should be defined for multiplication and division instructions. This paper proposes an RV32I, RV32IM processor, and indicates benchmark performance scores compared to an existing processor. Additionally, A non-stalling method was proposed to support a 2-stage pipelined DSP multiplier to the 5-stage pipelined RV32IM processor. Proposed RV32I and RV32IM processors satisfied a maximum operating frequency of 50 MHz on Artix-7 FPGA. The performance of the proposed processors was verified using benchmark programs from Dhrystone and Coremark. As a result, the Coremark benchmark results of the proposed processor showed that it outperformed the existing RV32IM processor by 23.91%.

  • PDF

A Pipelined Parallel Optimized Design for Convolution-based Non-Cascaded Architecture of JPEG2000 DWT (JPEG2000 이산웨이블릿변환의 컨볼루션기반 non-cascaded 아키텍처를 위한 pipelined parallel 최적화 설계)

  • Lee, Seung-Kwon;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.7
    • /
    • pp.29-38
    • /
    • 2009
  • In this paper, a high performance pipelined computing design of parallel multiplier-temporal buffer-parallel accumulator is present for the convolution-based non-cascaded architecture aiming at the real time Discrete Wavelet Transform(DWT) processing. The convolved multiplication of DWT would be reduced upto 1/4 by utilizing the filter coefficients symmetry and the up/down sampling; and it could be dealt with 3-5 times faster computation by LUT-based DA multiplication of multiple filter coefficients parallelized for product terms with an image data. Further, the reutilization of computed product terms could be achieved by storing in the temporal buffer, which yields the saving of computation as well as dynamic power by 50%. The convolved product terms of image data and filter coefficients are realigned and stored in the temporal buffer for the accumulated addition. Then, the buffer management of parallel aligned storage is carried out for the high speed sequential retrieval of parallel accumulations. The convolved computation is pipelined with parallel multiplier-temporal buffer-parallel accumulation in which the parallelization of temporal buffer and accumulator is optimize, with respect to the performance of parallel DA multiplier, to improve the pipelining performance. The proposed architecture is back-end designed with 0.18um library, which verifies the 30fps throughput of SVGA(800$\times$600) images at 90MHz.

Design and Simulation for Out-of-Order Execution Processor of a Fully Pipelined Scheme (완전한 파이프라인 방식의 비순차실행 프로세서의 설계 및 모의실행)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.5
    • /
    • pp.143-149
    • /
    • 2020
  • Currently, a multi-core processor is mainly used as a central processing unit of a computer system, and a high-performance out-of-order processor is adopted as each core to maximize system performance. The early out-of-order execution processor with Tomasulo algorithm aimed at floating-point instructions, and it took several cycles to execute by the use of complex structures such as reorder buffer and reservation station. However, in order for the processor to properly utilize out-of-order execution and increase the throughput of instructions, it must operate in a fully pipelined manner. In this paper, a fully pipelined out-of-order processor with speculative execution is designed with VHDL and verified with GHDL. As a result of the simulation, a program composed of ARM instructions is successfully performed.

Design of Pipelined Parallel CRC Circuits (파이프라인 구조를 적용한 병렬 CRC 회로 설계)

  • Yi, Hyun-Bean;Kim, Ki-Tae;Kwon, Young-Min;Park, Sung-Ju
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.43 no.6 s.312
    • /
    • pp.40-47
    • /
    • 2006
  • This paper introduces an efficient CRC logic partitioning algorithm to design pipelined parallel CRC circuits aimed at improving speed performance. Focusing on the cases that the input data width is greater than the polynomial degree, equations are derived to divide the parallel CRC logic and decide the length of the pipeline stage. Through design experiments on different types of parallel CRC circuits, we have found a significant reduction in delay by adopting our approach.

A Design of Full Flash 8-Bit CMOS A/D Converter (Full Flash 8-Bit CMOS A/D 변환기 설계)

  • Choi, Young-Gyu;Yi, Cheon-Hee
    • Journal of the Korean Institute of Telematics and Electronics
    • /
    • v.27 no.11
    • /
    • pp.126-134
    • /
    • 1990
  • In order to implement high-speed data acquistion system in CMOS VLSI technology, means must be found to overcome the relatively low transconductance and large device mismatch characteristic of MOS device. Because of these device limitations, circuit design approaches tradition-ally used in high-speed bipolar analog-to-digital converter(ADC) are suited to CMOS implementation. Also the design of VLSI CMOS comparator wherein voltage comparision is accomplished by means of a pipelined cascade RSA (Regenerative Sense Amplifier). So, in this paper we designed the A/D converter incorporates the pipelined CMOS comparator.

  • PDF

A Design of Pipelined-parallel CABAC Decoder Adaptive to HEVC Syntax Elements (HEVC 구문요소에 적응적인 파이프라인-병렬 CABAC 복호화기 설계)

  • Bae, Bong-Hee;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.5
    • /
    • pp.155-164
    • /
    • 2015
  • This paper describes a design and implementation of CABAC decoder, which would handle HEVC syntax elements in adaptively pipelined-parallel computation manner. Even though CABAC offers the high compression rate, it is limited in decoding performance due to context-based sequential computation, and strong data dependency between context models, as well as decoding procedure bin by bin. In order to enhance the decoding computation of HEVC CABAC, the flag-type syntax elements are adaptively pipelined by precomputing consecutive flag-type ones; and multi-bin syntax elements are decoded by processing bins in parallel up to three. Further, in order to accelerate Binary Arithmetic Decoder by reducing the critical path delay, the update and renormalization of context modeling are precomputed parallel for the cases of LPS as well as MPS, and then the context modeling renewal is selected by the precedent decoding result. It is simulated that the new HEVC CABAC architecture could achieve the max. performance of 1.01 bins/cycle, which is two times faster with respect to the conventional approach. In ASIC design with 65nm library, the CABAC architecture would handle 224 Mbins/sec, which could decode QFHD HEVC video data in real time.

Design and Implementation of An I/O System for Irregular Application under Parallel System Environments (병렬 시스템 환경하에서 비정형 응용 프로그램을 위한 입출력 시스템의 설계 및 구현)

  • No, Jae-Chun;Park, Seong-Sun;;Gwon, O-Yeong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.26 no.11
    • /
    • pp.1318-1332
    • /
    • 1999
  • 본 논문에서는 입출력 응용을 위해 collective I/O 기법을 기반으로 한 실행시간 시스템의 설계, 구현 그리고 그 성능평가를 기술한다. 여기서는 모든 프로세서가 동시에 I/O 요구에 따라 스케쥴링하며 I/O를 수행하는 collective I/O 방안과 프로세서들이 여러 그룹으로 묶이어, 다음 그룹이 데이터를 재배열하는 통신을 수행하는 동안 오직 한 그룹만이 동시에 I/O를 수행하는 pipelined collective I/O 등의 두 가지 설계방안을 살펴본다. Pipelined collective I/O의 전체 과정은 I/O 노드 충돌을 동적으로 줄이기 위해 파이프라인된다. 이상의 설계 부분에서는 동적으로 충돌 관리를 위한 지원을 제공한다. 본 논문에서는 다른 노드의 메모리 영역에 이미 존재하는 데이터를 재 사용하여 I/O 비용을 줄이기 위해 collective I/O 방안에서의 소프트웨어 캐슁 방안과 두 가지 모형에서의 chunking과 온라인 압축방안을 기술한다. 그리고 이상에서 기술한 방안들이 입출력을 위해 높은 성능을 보임을 기술하는데, 이 성능결과는 Intel Paragon과 ASCI/Red teraflops 기계 상에서 실험한 것이다. 그 결과 응용 레벨에서의 bandwidth는 peak point가 55%까지 측정되었다.Abstract In this paper we present the design, implementation and evaluation of a runtime system based on collective I/O techniques for irregular applications. We present two designs, namely, "Collective I/O" and "Pipelined Collective I/O". In the first scheme, all processors participate in the I/O simultaneously, making scheduling of I/O requests simpler but creating a possibility of contention at the I/O nodes. In the second approach, processors are grouped into several groups, so that only one group performs I/O simultaneously, while the next group performs communication to rearrange data, and this entire process is pipelined to reduce I/O node contention dynamically. In other words, the design provides support for dynamic contention management. Then we present a software caching method using collective I/O to reduce I/O cost by reusing data already present in the memory of other nodes. Finally, chunking and on-line compression mechanisms are included in both models. We demonstrate that we can obtain significantly high-performance for I/O above what has been possible so far. The performance results are presented on an Intel Paragon and on the ASCI/Red teraflops machine. Application level I/O bandwidth up to 55% of the peak is observed.he peak is observed.

A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System (MPI 브로드캐스트 통신을 위한 서킷 스위칭 기반의 파이프라인 체인 알고리즘 설계)

  • Yun, Heejun;Chung, Wonyoung;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37B no.9
    • /
    • pp.795-805
    • /
    • 2012
  • This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

A design of Giga-bit security module using Fully pipe-lined CTR-AES (Full-pipelined CTR-AES를 이용한 Giga-bit 보안모듈 설계)

  • Vinh, T.Q.;Park, Ju-Hyun;Kim, Young-Chul;Kim, Kwang-Ok
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.6
    • /
    • pp.1026-1031
    • /
    • 2008
  • Nowdays, homes and small businesses rely more and more PON(Passive Optical Networks) for financial transactions, private communications and even telemedicine. Thus, encryption for these data transactions is very essential due to the multicast nature of the PON In this parer, we presented our implementation of a counter mode AES based on Virtex4 FPGA. Our design exploits three advanced features; 1) Composite field arithmetic SubByte, 2) efficient MixColumn transformation 3) and on-the-fly key-scheduling for fully pipelined architecture. By pipeling the composite field implementation of the S-box, the area cost is reduced to average 17 percent. By designing the on-the-fly key-scheduling, we implemented an efficient key-expander module which is specialized for a pipelined architecture.