• Title/Summary/Keyword: Pipelining Operation

Search Result 24, Processing Time 0.022 seconds

Implementation of Efficient Channel Decoder for WiBro System (WiBro 시스템을 위한 효율적인 구조의 채널 복호화기 구현)

  • Kim, Jang-Hun;Han, Chul-Hee
    • Proceedings of the IEEK Conference
    • /
    • 2007.07a
    • /
    • pp.177-178
    • /
    • 2007
  • WiBro system provides reliable broadband communication services for mobile and portable subcribers. It allows interference-free reception under the conditions of multipath propagation and transmission errors. Thus, powerful channel-error correction ability Is required. CC/CTC Decoder which Is mandatory for WiBro system needs lots of computations for real-time operation. So, it is desired to design a CC/CTC Decoder having highly optimized hardware scheme for low latency operation under high data rates. This paper proposes an efficient CC/CTC Decoder structure for high data rate WiBro system. Particularly, the proposed CTC Decoder architecture reduces decoding delay by applying pipelining and multiple decoding blocks. Simulation results show that reduction of about 80% of processing time is enabled with the proposed CC/CTC Decoder despite of increase in are.

  • PDF

Delayed Scheduling under Resource Constrains (자원제약하에서의 지연 스케쥴링)

  • Shin, In-Soo;Lee, Keun-Man
    • The Transactions of the Korea Information Processing Society
    • /
    • v.4 no.10
    • /
    • pp.2571-2580
    • /
    • 1997
  • In this paper, we deal with the resource constrain scheduling to execute behavior algorithm under resource limit. Expecially, we proposed a scheduling algorithm, called delayed scheduling, which finds the lower bound control step to assign operation under resource limit. We take in account the actual scheduling problems including multicycle operation and functional pipelining. Integer Linear Programing formulations are used to the scheduling problems in order to get optimal scheduling result. Experiment was done on the DFG model of fifth-order digital wave filter, to show it's effectiveness.

  • PDF

Optimizing 2-stage Tiling-based Matrix Multiplication in FPGA-based Neural Network Accelerator (FPGA기반 뉴럴네트워크 가속기에서 2차 타일링 기반 행렬 곱셈 최적화)

  • Jinse, Kwon;Jemin, Lee;Yongin, Kwon;Jeman, Park;Misun, Yu;Taeho, Kim;Hyungshin, Kim
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.6
    • /
    • pp.367-374
    • /
    • 2022
  • The acceleration of neural networks has become an important topic in the field of computer vision. An accelerator is absolutely necessary for accelerating the lightweight model. Most accelerator-supported operators focused on direct convolution operations. If the accelerator does not provide GEMM operation, it is mostly replaced by CPU operation. In this paper, we proposed an optimization technique for 2-stage tiling-based GEMM routines on VTA. We improved performance of the matrix multiplication routine by maximizing the reusability of the input matrix and optimizing the operation pipelining. In addition, we applied the proposed technique to the DarkNet framework to check the performance improvement of the matrix multiplication routine. The proposed GEMM method showed a performance improvement of more than 2.4 times compared to the non-optimized GEMM method. The inference performance of our DarkNet framework has also improved by at least 2.3 times.

A Fuzzy Microprocessor for Real-time Control Applications

  • Katashiro, Takeshi
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1993.06a
    • /
    • pp.1394-1397
    • /
    • 1993
  • A Fuzzy Microprocessor(FMP) is presented, which is suitable for real-time control applications. The features include high speed inference of maximum 114K FLIPS at 20MHz system clocks, capability of up to 128-rule construction, and handing of 8 input variables with 8-bit resolution. In order to realize these features, the fuzzifier circuit and the processing element(PE) are well optimized for LSI implementation. The chip fabricated in 1.2$\mu\textrm{m}$ CMOS technology contains 71K transistors in 82.8 $\textrm{mm}^2$ die size and is packaged in 100-pin plastic QFP.

  • PDF

DSP Optimization for Rain Detection and Removal Algorithm (비 검출 및 제거 알고리즘의 DSP 최적화)

  • Choi, Dong Yoon;Seo, Seung Ji;Song, Byung Cheol
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.9
    • /
    • pp.96-105
    • /
    • 2015
  • This paper proposes a DSP optimization solution of rain detection and removal algorithm. We propose rain detection and removal algorithms considering camera motion, and also presents optimization results in algorithm level and DSP level. At algorithm level, this paper utilizes a block level binary pattern analysis, and reduces the operation time by using the fast motion estimation algorithm. Also, the algorithm is optimized at DSP level through inter memory optimization, EDMA, and software pipelining for real-time operation. Experiment results show that the proposed algorithm is superior to the other algorithms in terms of visual quality as well as processing speed.

음성인식용 DTW PE의 IC화를 위한 ADD 및 ABS 회로의 설계

  • 정광재;문홍진;최규훈;김종교
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.15 no.8
    • /
    • pp.648-658
    • /
    • 1990
  • There are many methods for speed up counting in speech recongition. A multiple processing method is the one way to achieve the aim using systolic array. This arithmetic operation by the array is achieved pipelining skill. And the operation is multiprocessing by processing element(PE) that is incresing counting efficiencies. The DTW PE cell is seperated into three large blocks. "MIN" is the one block for counting accumulated minimum distance, "ADD" block calculated these minimum distances, and "ABS" seeks for the absolut values to the total sum of local distances. We have accomplished circuit design and verification about the "ADD" and "ABS" blocks, and performed total layout '||'&'||' DRC(design rule check) using 3um CMOS N-Well rule base.le check) using 3$\mu$m CMOS N-Well rule base.

  • PDF

A study on the low power architecture of multi-giga bit synchronous DRAM's (Giga Bit급 저전력 synchronous DRAM 구조에 대한 연구)

  • 유회준;이정우
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.34C no.11
    • /
    • pp.1-11
    • /
    • 1997
  • The transient current components of the dRAM are analyzed and the sensing current, data path operation current and DC leakage current are revealed to be the major curretn components. It is expected that the supply voltage of less than 1.5V with low VT MOS witll be used in multi-giga bit dRAM. A low voltage dual VT self-timed CMOS logic in which the subthreshold leakage current path is blocked by a large high-VT MOS is proposed. An active signal at each node of the nature speeds up the signal propagation and enables the synchronous DRAM to adopt a fast pipelining scheme. The sensing current can be reduced by adopting 8 bit prefetch scheme with 1.2V VDD. Although the total cycle time for the sequential 8 bit read is the same as that of the 3.3V conventional DRAM, the sensing current is loered to 0.7mA or less than 2.3% of the current of 3.3V conventional DRAM. 4 stage pipeline scheme is used to rduce the power consumption in the 4 giga bit DRAM data path of which length and RC delay amount to 3 cm and 23.3ns, respectively. A simple wave pipeline scheme is used in the data path where 4 sequential data pulses of 5 ns width are concurrently transferred. With the reduction of the supply voltage from 3.3V to 1.2V, the operation current is lowered from 22mA to 2.5mA while the operation speed is enhanced more than 4 times with 6 ns cycle time.

  • PDF

Hyperelliptic Curve Crypto-Coprocessor over Affine and Projective Coordinates

  • Kim, Ho-Won;Wollinger, Thomas;Choi, Doo-Ho;Han, Dong-Guk;Lee, Mun-Kyu
    • ETRI Journal
    • /
    • v.30 no.3
    • /
    • pp.365-376
    • /
    • 2008
  • This paper presents the design and implementation of a hyperelliptic curve cryptography (HECC) coprocessor over affine and projective coordinates, along with measurements of its performance, hardware complexity, and power consumption. We applied several design techniques, including parallelism, pipelining, and loop unrolling, in designing field arithmetic units, group operation units, and scalar multiplication units to improve the performance and power consumption. Our affine and projective coordinate-based HECC processors execute in 0.436 ms and 0.531 ms, respectively, based on the underlying field GF($2^{89}$). These results are about five times faster than those for previous hardware implementations and at least 13 times better in terms of area-time products. Further results suggest that neither case is superior to the other when considering the hardware complexity and performance. The characteristics of our proposed HECC coprocessor show that it is applicable to high-speed network applications as well as resource-constrained environments, such as PDAs, smart cards, and so on.

  • PDF

VLSI Design for Folded Wavelet Transform Processor using Multiple Constant Multiplication (MCM과 폴딩 방식을 적용한 웨이블릿 변환 장치의 VLSI 설계)

  • Kim, Ji-Won;Son, Chang-Hoon;Kim, Song-Ju;Lee, Bae-Ho;Kim, Young-Min
    • Journal of Korea Multimedia Society
    • /
    • v.15 no.1
    • /
    • pp.81-86
    • /
    • 2012
  • This paper presents a VLSI design for lifting-based discrete wavelet transform (DWT) 9/7 filter using multiplierless multiple constant multiplication (MCM) architecture. This proposed design is based on the lifting scheme using pattern search for folded architecture. Shift-add operation is adopted to optimize the multiplication process. The conventional serial operations of the lifting data flow can be optimized into parallel ones by employing paralleling and pipelining techniques. This optimized design has simple hardware architecture and requires less computation without performance degradation. Furthermore, hardware utilization reaches 100%, and the number of registers required is significantly reduced. To compare our work with previous methods, we implemented the architecture using Verilog HDL. We also executed simulation based on the logic synthesis using $0.18{\mu}m$ CMOS standard cells. The proposed architecture shows hardware reduction of up to 60.1% and 44.1% respectively at 200 MHz clock compared to previous works. This implementation results indicate that the proposed design performs efficiently in hardware cost, area, and power consumption.

Optimization for H.264/AVC De-blocking Filter on the TMS320C64x+ DSP (TMS320C64x+ DSP에서의 H.264/AVC 디블록킹 필터 최적화)

  • Lee, Jin-Seop;Kang, Dae-Beom;Sim, Dong-Gyu;Lee, Soo-Youn
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.48 no.2
    • /
    • pp.41-52
    • /
    • 2011
  • It is important to reduce computational complexity of de-blocking filter for real-time implementation, because it accounts for a great part of total computational complexity of the decoder. Because there are a lot of conditional branches and memory accesses in a decoding loop, it is not easy to speed up the de-blocking filter. Therefore, this paper presents a new algorithm of de-blocking filter minimizing conditional branches and memory accesses. The proposed structure of de-blocking filter enables filter operation to parallelize by software pipelining. The proposed optimization method was implemented on a TMS320DM6467 EVM board and we achieved approximately 46% cycle reduction, compared with that of FFmpeg.