• Title/Summary/Keyword: 파이프라인 아키텍처

Search Result 21, Processing Time 0.023 seconds

CNN Accelerator Architecture using 3D-stacked RRAM Array (3차원 적층 구조 저항변화 메모리 어레이를 활용한 CNN 가속기 아키텍처)

  • Won Joo Lee;Yoon Kim;Minsuk Koo
    • Journal of IKEEE
    • /
    • v.28 no.2
    • /
    • pp.234-238
    • /
    • 2024
  • This paper presents a study on the integration of 3D-stacked dual-tip RRAM with a CNN accelerator architecture, leveraging its low drive current characteristics and scalability in a 3D stacked configuration. The dual-tip structure is utilized in a parallel connection format in a synaptic array to implement multi-level capabilities. It is configured within a Network-on-chip style accelerator along with various hardware blocks such as DAC, ADC, buffers, registers, and shift & add circuits, and simulations were performed for the CNN accelerator. The quantization of synaptic weights and activation functions was assumed to be 16-bit. Simulation results of CNN operations through a parallel pipeline for this accelerator architecture achieved an operational efficiency of approximately 370 GOPs/W, with accuracy degradation due to quantization kept within 3%.

An Implementation of Efficient Quicksort Utilizing SIMD-Based VBP Technique (SIMD 기반의 VBP 기법을 적용한 효율적인 퀵정렬의 구현)

  • Hong, Gilseok;Kim, Hongyeon;Kang, Seonghyeon;Min, Jun-Ki
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.8
    • /
    • pp.498-503
    • /
    • 2017
  • SIMD (Single Instruction Multiple Data) is a representative parallelization architecture that processes multiple data loaded in a SIMD register with a single instruction. Quicksort is a sorting algorithm that picks an element as a pivot from the array and reorders the array such that all elements having the values less than the pivot value are located in the left side on the pivot as well as all elements having the value greater than the pivot value are located in the right side on the pivot and then the algorithm performs the same task on both sublist recursively. In this paper, we propose an efficient Quicksort algorithm applying the SIMD instructions which minimally invokes conditional branches to avoid the performance degradation incurred by branch misprediction in a pipeline architecture. In addition, we improve the performance of the Quicksort algorithm by fetching data into a SIMD register as a byte unit to apply VBP (Vertical Bit Parallel) and the early pruning technique.

Design of an Automatic Generation System for Cycle-accurate Instruction-set Simulators for DSP Processors (DSP 프로세서용 인스트럭션 셋 시뮬레이터 자동생성기의 설계에 관한 연구)

  • Hong, Sung-Min;Park, Chang-Soo;Hwang, Sun-Young
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.32 no.9A
    • /
    • pp.931-939
    • /
    • 2007
  • This paper describes the system which automatically generates instruction-set simulators cores using the SMDL. SMDL describes structure and instruction-set information of a target DSP machine. Analyzing behavioral information of each pipeline stage of all instructions on a target ASIPS, the proposed system automatically generates a cycle-accurate instruction set simulator in C++ for a target processor. The proposed system has been tested by generating instruction-set simulators for ARM9E-S, ADSP-TS20x, and TMS320C2x architectures. Experiments were performed by checking the functions of the $4{\times}4$ matrix multiplication, 16-bit IIR filter, 32-bit multiplication, and the FFT using the generated simulators. Experimental results show the functional accuracy of the generated simulators.

Design of A Deblocking Filter Based on Macroblock Overlap Scheme for H.264/AVC (H.264/AVC용 매크로블록 겹침 기법에 기반한 디블록킹 필터의 설계)

  • Kim, Won-Sam;Sonh, Seung-Il
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.4
    • /
    • pp.699-706
    • /
    • 2008
  • H.264/AVC is a new international standard for the compression of video images, in which a deblocking filter has been adopted to remoye blocking artifacts. This paper proposes an efficient architecture of deblocking filter in H.264/AVC. By making good use of data dependence between neighboring $4{\times}4$ blocks, the memory sire is reduced and the throughput of the deblocking filter processing is increased. The designed deblocking filter further enhances the parallelism by simultaneously executing horizontal and vertical filtering within a macroblock in pipeline method and adopting overlap between macroblocks. The implementation result shows that the proposed architecture enhances the performance of deblocking filter processing from 1.75 to 4.23 times than that of the conventional deblocking filter. Hence the Proposed architecture of deblocking filter is able to perform real-time deblocking in high-resolution($2048{\times}1024$) video applications.

A Pipelined Parallel Optimized Design for Convolution-based Non-Cascaded Architecture of JPEG2000 DWT (JPEG2000 이산웨이블릿변환의 컨볼루션기반 non-cascaded 아키텍처를 위한 pipelined parallel 최적화 설계)

  • Lee, Seung-Kwon;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.7
    • /
    • pp.29-38
    • /
    • 2009
  • In this paper, a high performance pipelined computing design of parallel multiplier-temporal buffer-parallel accumulator is present for the convolution-based non-cascaded architecture aiming at the real time Discrete Wavelet Transform(DWT) processing. The convolved multiplication of DWT would be reduced upto 1/4 by utilizing the filter coefficients symmetry and the up/down sampling; and it could be dealt with 3-5 times faster computation by LUT-based DA multiplication of multiple filter coefficients parallelized for product terms with an image data. Further, the reutilization of computed product terms could be achieved by storing in the temporal buffer, which yields the saving of computation as well as dynamic power by 50%. The convolved product terms of image data and filter coefficients are realigned and stored in the temporal buffer for the accumulated addition. Then, the buffer management of parallel aligned storage is carried out for the high speed sequential retrieval of parallel accumulations. The convolved computation is pipelined with parallel multiplier-temporal buffer-parallel accumulation in which the parallelization of temporal buffer and accumulator is optimize, with respect to the performance of parallel DA multiplier, to improve the pipelining performance. The proposed architecture is back-end designed with 0.18um library, which verifies the 30fps throughput of SVGA(800$\times$600) images at 90MHz.

Design and Implementation of a Six-Stage Pipeline RV32I Processor Based on RISC-V Architecture (RISC-V 아키텍처 기반 6단계 파이프라인 RV32I프로세서의 설계 및 구현)

  • Kyoungjin Min;Seojin Choi;Yubeen Hwang;Sunhee Kim
    • Journal of the Semiconductor & Display Technology
    • /
    • v.23 no.2
    • /
    • pp.76-81
    • /
    • 2024
  • UC Berkeley developed RISC-V, which is an open-source Instruction Set Architecture. This paper proposes a 32-bit 6-stage pipeline architecture based on the RV32I RSIC-V. The performance of the proposed 6-stage pipeline architecture is compared with the existing 32-bit 5-stage pipeline architecture also based on the RV32I processor ISA to determine the impact of the number of pipeline stages on performance. The RISC-V processor is designed in Verilog-HDL and implemented using Quartus Prime 20.1. To compare performance the Dhrystone benchmark is used. Subsequently, peripherals such as GPIO, TIMER, and UART are connected to verify operation through an FPGA. The maximum clock frequency for the 5-stage pipeline processor is 42.02 MHz, while for the 6-stage pipeline processor, it was 49.9MHz, representing an 18.75% increase.

  • PDF

High Throughput Parallel Design of 2-D $8{\times}8$ Integer Transforms for H.264/AVC (H.264/AVC 를 위한 높은 처리량의 2-D $8{\times}8$ integer transforms 병렬 구조 설계)

  • Sharma, Meeturani;Tiwari, Honey;Cho, Yong-Beom
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.49 no.8
    • /
    • pp.27-34
    • /
    • 2012
  • In this paper, the implementation of high throughput two-dimensional (2-D) $8{\times}8$ forward and inverse integer DCT transform for H.264 is presented. The forward and inverse transforms are represented using simple shift and addition operations. Matrix decomposition and matrix operation such as the Kronecker product and direct sum are used to reduce the computation complexity. The proposed design uses integer computations and does not use transpose memory and hence, the resource consumption is also reduced. The maximum operating frequency of the proposed pipelined architecture is 1.184 GHz, which achieves 25.27 Gpixels/sec throughput rate with the hardware cost of 44864 gates. High throughput and low hardware makes the proposed design useful for real time H.264/AVC high definition processing.

A Study on the Instruction Set Architecture of Multimedia Extension Processor (멀티미디어 확장 프로세서의 명령어 집합 구조에 관한 연구)

  • O, Myeong-Hun;Lee, Dong-Ik;Park, Seong-Mo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.6
    • /
    • pp.420-435
    • /
    • 2001
  • As multimedia technology has rapidly grown recently, many researches to process multimedia data efficiently using general-purpose processors have been studied. In this paper, we proposed multimedia instructions which can process multimedia data effectively, and suggested a processor architecture for those instructions. The processor was described with Verilog-HDL in the behavioral level and simulated with CADENCE$^{TM}$ tool. Proposed multimedia instructions are total 48 instructions which can be classified into 7 groups. Multimedia data have 64-bit format and are processed as parallel subwords of 8-bit 8 bytes, 16-bit 4 half words or 32-bit 2 words. Modeled processor is developed based on the Integer Unit of SPARC V.9. It has five-stage pipeline RISC architecture with Harvard principle.e.

  • PDF

Parallel Architecture Design of H.264/AVC CAVLC for UD Video Realtime Processing (UD(Ultra Definition) 동영상 실시간 처리를 위한 H.264/AVC CAVLC 병렬 아키텍처 설계)

  • Ko, Byung Soo;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.5
    • /
    • pp.112-120
    • /
    • 2013
  • In this paper, we propose high-performance H.264/AVC CAVLC encoder for UD video real time processing. Statistical values are obtained in one cycle through the parallel arithmetic and logical operations, using non-zero bit stream which represents zero coefficient or non-zero coefficient. To encode codeword per one cycle, we remove recursive operation in level encoding through parallel comparison for coefficient and escape value. In oder to implement high-speed circuit, proposed CAVLC encoder is designed in two-stage {statical scan, codeword encoding} pipeline. Reducing the encoding table, the arithmetic unit is used to encode non-coefficient and to calculate the codeword. The proposed architecture was simulated in 0.13um standard cell library. The gate count is 33.4Kgates. The architecture can support Ultra Definition Video ($3840{\times}2160$) at 100 frames per second by running at 100MHz.

A Design of 4×4 Block Parallel Interpolation Motion Compensation Architecture for 4K UHD H.264/AVC Decoder (4K UHD급 H.264/AVC 복호화기를 위한 4×4 블록 병렬 보간 움직임보상기 아키텍처 설계)

  • Lee, Kyung-Ho;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.5
    • /
    • pp.102-111
    • /
    • 2013
  • In this paper, we proposed a $4{\times}4$ block parallel architecture of interpolation for high-performance H.264/AVC Motion Compensation in 4K UHD($3840{\times}2160$) video real time processing. To improve throughput, we design $4{\times}4$ block parallel interpolation. For supplying the $9{\times}9$ reference data for interpolation, we design 2D cache buffer which consists of the $9{\times}9$ memory arrays. We minimize redundant storage of the reference pixel by applying the Search Area Stripe Reuse scheme(SASR), and implement high-speed plane interpolator with 3-stage pipeline(Horizontal Vertical 1/2 interpolation, Diagonal 1/2 interpolation, 1/4 interpolation). The proposed architecture was simulated in 0.13um standard cell library. The maximum operation frequency is 150MHz. The gate count is 161Kgates. The proposed H.264/AVC Motion Compensation can support 4K UHD at 72 frames per second by running at 150MHz.