• Title/Summary/Keyword: Parallel pipeline

Search Result 172, Processing Time 0.033 seconds

An Implementation of Efficient Quicksort Utilizing SIMD-Based VBP Technique (SIMD 기반의 VBP 기법을 적용한 효율적인 퀵정렬의 구현)

  • Hong, Gilseok;Kim, Hongyeon;Kang, Seonghyeon;Min, Jun-Ki
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.8
    • /
    • pp.498-503
    • /
    • 2017
  • SIMD (Single Instruction Multiple Data) is a representative parallelization architecture that processes multiple data loaded in a SIMD register with a single instruction. Quicksort is a sorting algorithm that picks an element as a pivot from the array and reorders the array such that all elements having the values less than the pivot value are located in the left side on the pivot as well as all elements having the value greater than the pivot value are located in the right side on the pivot and then the algorithm performs the same task on both sublist recursively. In this paper, we propose an efficient Quicksort algorithm applying the SIMD instructions which minimally invokes conditional branches to avoid the performance degradation incurred by branch misprediction in a pipeline architecture. In addition, we improve the performance of the Quicksort algorithm by fetching data into a SIMD register as a byte unit to apply VBP (Vertical Bit Parallel) and the early pruning technique.

Design of High Speed Binary Arithmetic Encoder for CABAC Encoder (CABAC 부호화기를 위한 고속 이진 산술 부호화기의 설계)

  • Park, Seungyong;Jo, Hyungu;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.4
    • /
    • pp.774-780
    • /
    • 2017
  • This paper proposes an efficient binary arithmetic encoder hardware architecture for CABAC encoding, which is an entropy coding method of HEVC. CABAC is an entropy coding method that is used in HEVC standard. Entropy coding removes statistical redundancy and supports a high compression ratio of images. However, the binary arithmetic encoder causes a delay in real time processing and parallel processing is difficult because of the high dependency between data. The operation of the proposed CABAC BAE hardware structure is to separate the renormalization and process the conventional iterative algorithm in parallel. The new scheme was designed as a four-stage pipeline structure that can reduce critical path optimally. The proposed CABAC BAE hardware architecture was designed with Verilog HDL and implemented in 65nm technology. Its gate count is 8.07K and maximum operating speed of 769MHz. It processes the four bin per clock cycle. Maximum processing speed increased by 26% from existing hardware architectures.

A Simple Multi-rate Parallel Interference Canceller for the IMT-2000 3GPP System (IMT-2000 3GPP 시스템을 위한 간단한 다중 전송률 병렬형 간섭제거기)

  • Kim, Jin-Kyeom;Oh, Seong-Keun;Sunwoo, Myung-Hoon
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.38 no.12
    • /
    • pp.10-19
    • /
    • 2001
  • In this paper, we propose an effective but simple multi-rate parallel interference canceller(PIC) for the international mobile telecommunications-2000(IMT-2000) 3rd generation partnership project (3GPP) system. For effective multi-rate processing, we define the basic block as one symbol period of the dedicated physical control channel(DPCCH) having the lowest data rate and common to all users. Then, decision and interference cancellation are performed at every basic block. For an asynchronous channel, we propose an advance removal scheme that removes in advance multiple access interference(MAI) due to the next blockof other users with shorter delay. Introducing a pipeline structure at a sample base, we can implement efficiently the PIC using the advance removal scheme with a minimum hardware and no extra computations. Through computer simulations, we analyze the bit error rate(BER) performance of the proposed PIC with respect to signal-to-noise ratio(SNR) and the number of users.

  • PDF

Design and implementation of an interpolator for high speed UWB system (고속 UWB 시스템을 위한 인터폴레이터의 설계 및 구현)

  • Kim, Sang-Dong;Lee, Jong-Hun;Jung, Woo-Young;Chong, Jong-Wha
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.44 no.1
    • /
    • pp.64-69
    • /
    • 2007
  • This paper designs an interpolator for a high speed ultra wide bandwidth (UWB). The UWB wireless technology will play a key role in short-range wireless connectivity supporting very high bit rates availability, low power consumption, and location capabilities. Because the UWB needs high operating speed, a cubic interpolator based on variable parameters for the UWB needs to be operated at a high speed. In order to improve an operating speed, the modified cubic interpolator is based on both a parallel processing and a pipelining in the existing interpolator simultaneously. Experimental results show that a maximum operating speed and period of the proposed interpolator using Stratix II EP2S60F1020C3 is 102.42MHz and 9.764ns, respectively. Compared to the conventional interpolator, the designed cubic parameter interpolator has been improved more than about 190%.

Implementation of Channel Coding System using Viterbi Decoder of Pipeline-based Multi-Window (파이프라인 기반 다중윈도방식의 비터비 디코더를 이용한 채널 코딩 시스템의 구현)

  • Seo Young-Ho;Kim Dong-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.9 no.3
    • /
    • pp.587-594
    • /
    • 2005
  • In the paper, after we propose a viterbi decoder which has multiple buffering and parallel processing decoding scheme through expanding time-divided imput signal, and map a FPGA, we implement a channel coding system together with PC-based software. Continuous input signal is buffered as order of decoding length and is parallel decoded using a high speed cell for viterbi decoding. Output data rate increases linearly with the cell formed the viterbi decoder, and flexible operation can be satisfied by programming controller and modifying input buffer. The tell for viterbi decoder consists of HD block for calculating hamming distance, CM block for calculating value in each state, TB block for trace-back operation, and LIFO. The implemented cell of viterbi decoder used 351 LAB(Logic Arrary Block) and stably operated in maximum 139MHz in APEX20KC EP20K600CB652-7 FPGA of ALTERA. The whole viterbi decoder including viterbi decoding cells, input/output buffers, and a controller occupied the hardware resource of $23\%$ and has the output data rate of 1Gbps.

frequency Domain processor nor ADSL G.LITE Modem (ADSL G.LITE모뎀을 위한 주파수 영역 프로세서의 설계)

  • 고우석;기준석;고태호;윤대희
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.26 no.12C
    • /
    • pp.233-239
    • /
    • 2001
  • Among the operations in frequency domain for ADSL G.LITE Modem to perform, FFT and FEQ are most computation-intensive part, of which many researches have been focused on the efficient implementation. Previous papers suggested hardwares suitable for ADSL G.DMT system, which is not feasible for simple G.LITE system. The analysis of frequency domain operations and computational efficiency according to the allocation of hardware resources is performed in this paper. The suggested processor has the structure of one real multiplier and two real adders connected in parallel, which can perform the operations efficiently through the pipeline- and/or parallel-type job scheduling. The suggested processor uses less hardware resources than Kiss\`s ALU structure or FFT/IFFT processor suggested by Wang, so the suggested one is more suitable for G.LITE system than previous works.

  • PDF

A Study on the Design of FFT Processor for UWB Ultrafast Wireless Communication Systems (UWB 초고속 무선통신 시스템을 위한 FFT 프로세서 설계에 관한 연구)

  • Lee, Sang-Il;Chun, Young-Il
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.12
    • /
    • pp.2140-2145
    • /
    • 2008
  • We design and synthesize a 128-point FFT processor for multi-band OFDM, which can be applied to a UWB transceiver. The structure of a 128-point FFT processor is based on a Radix-2 FFT algorithm and a R2SDF pipeline architecture. The algorithm is efficiently modeled in VHDL and the result is simulated using Modelsim. Finally, they are synthesized on Xilinx Vertex-II FPGA, and an operational frequency of 18.7MHz has been obtained. It is expected that the proposed 128-point FFT processor can be applied to an entire FFT block as one of parallel processed FFTs. In order to obtain the enhanced maximum frequency of operation, we design the FFT module consisting of four 128-point FFT processors for parallel process. As a result, we achieve the performance requirement of computing the FFT module in multi-band OFDM symbol timing in 90nm ASIC process.

CNN Accelerator Architecture using 3D-stacked RRAM Array (3차원 적층 구조 저항변화 메모리 어레이를 활용한 CNN 가속기 아키텍처)

  • Won Joo Lee;Yoon Kim;Minsuk Koo
    • Journal of IKEEE
    • /
    • v.28 no.2
    • /
    • pp.234-238
    • /
    • 2024
  • This paper presents a study on the integration of 3D-stacked dual-tip RRAM with a CNN accelerator architecture, leveraging its low drive current characteristics and scalability in a 3D stacked configuration. The dual-tip structure is utilized in a parallel connection format in a synaptic array to implement multi-level capabilities. It is configured within a Network-on-chip style accelerator along with various hardware blocks such as DAC, ADC, buffers, registers, and shift & add circuits, and simulations were performed for the CNN accelerator. The quantization of synaptic weights and activation functions was assumed to be 16-bit. Simulation results of CNN operations through a parallel pipeline for this accelerator architecture achieved an operational efficiency of approximately 370 GOPs/W, with accuracy degradation due to quantization kept within 3%.

Multi-Threaded Parallel H.264/AVC Decoder for Multi-Core Systems (멀티코어 시스템을 위한 멀티스레드 H.264/AVC 병렬 디코더)

  • Kim, Won-Jin;Cho, Keol;Chung, Ki-Seok
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.47 no.11
    • /
    • pp.43-53
    • /
    • 2010
  • Wide deployment of high resolution video services leads to active studies on high speed video processing. Especially, prevalent employment of multi-core systems accelerates researches on high resolution video processing based on parallelization of multimedia software. In this paper, we propose a novel parallel H.264/AVC decoding scheme on a multi-core platform. Parallel H.264/AVC decoding is challenging not only because parallelization may incur significant synchronization overhead but also because software may have complicated dependencies. To overcome such issues, we propose a novel approach called Multi-Threaded Parallelization(MTP). In MTP, to reduce synchronization overhead, a separate thread is allocated to each stage in the pipeline. In addition, an efficient memory reuse technique is used to reduce the memory requirement. To verify the effectiveness of the proposed approach, we parallelized FFmpeg H.264/AVC decoder with the proposed technique using OpenMP, and carried out experiments on an Intel Quad-Core platform. The proposed design performs better than FFmpeg H.264/AVC decoder before the parallelization by 53%. We also reduced the amount of memory usage by 65% and 81% for a high-definition(HD) and a full high-definition(FHD) video, respectively compared with that of popular existing method called 2Dwave.

Construction of the Multiple Processing Unit by De Bruijn Graph (De Bruijn 그래프에 의한 다중처리기 구성)

  • Park, Chun-Myoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.10 no.12
    • /
    • pp.2187-2192
    • /
    • 2006
  • This paper presents a method of constructing the universal multiple processing element unit(UMPEU) by De Bruijn Graph. The second method is as following. First, we propose transformation operators in order to construct the De Bruijn UMPEU using properties of graph. Second, we construct the transformation table of De Bruijn graph using above transformation operators. Finally we construct the De Bruijn graph using transformation table. The proposed UMPEU be able to construct the De Bruijn graph for any prime number and integer value of finite fields. Also the UMPEU is applied to fault-tolerant computing system, pipeline class. parallel processing network, switching function and its circuits.