• Title/Summary/Keyword: pipelining

Search Result 141, Processing Time 0.022 seconds

High-Speed Low-Complexity Reed-Solomon Decoder using Pipelined Berlekamp-Massey Algorithm and Its Folded Architecture

  • Park, Jeong-In;Lee, Ki-Hoon;Choi, Chang-Seok;Lee, Han-Ho
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.10 no.3
    • /
    • pp.193-202
    • /
    • 2010
  • This paper presents a high-speed low-complexity pipelined Reed-Solomon (RS) (255,239) decoder using pipelined reformulated inversionless Berlekamp-Massey (pRiBM) algorithm and its folded version (PF-RiBM). Also, this paper offers efficient pipelining and folding technique of the RS decoders. This architecture uses pipelined Galois-Field (GF) multipliers in the syndrome computation block, key equation solver (KES) block, Forney block, Chien search block and error correction block to enhance the clock frequency. A high-speed pipelined RS decoder based on the pRiBM algorithm and its folded version have been designed and implemented with 90-nm CMOS technology in a supply voltage of 1.1 V. The proposed RS(255,239) decoder operates at a clock frequency of 700 MHz using the pRiBM architecture and also operates at a clock frequency of 750 MHz using the PF-RiBM, respectively. The proposed architectures feature high clock frequency and low-complexity.

Design of low-noise II R filter with high-density and low-power properties (고집적, 저전력 특성을 갖는 저잡음 IIR 필터 설계)

  • Bae Sung-hwan;Kim Dae-ik
    • The KIPS Transactions:PartA
    • /
    • v.12A no.1 s.91
    • /
    • pp.7-12
    • /
    • 2005
  • Scattered look-ahead(SLA) pipelining method can be efficiently used for high-speed or low-power applications of digital II R filters. Although the pipelined filters are guaranteed to be stable by this method, these filters suffer from large roundoff noise when the poles are crowded within some critical regions. An angle and radius constrained II R fille. design approach using modified Remez exchange algorithm and least squares algorithm is proposed to avoid tight pole-crowding in pipelined filters, resulting in improved frequency responses and reduced coefficient sensitivities. Experimental results demonstrate that our proposed method leads to chip area reduction by $33{\%}$ and low power by $45{\%}$ against the conventional method.

New High Speed Parallel Multiplier for Real Time Multimedia Systems (실시간 멀티미디어 시스템을 위한 새로운 고속 병렬곱셈기)

  • Cho, Byung-Lok;Lee, Mike-Myung-Ok
    • The KIPS Transactions:PartA
    • /
    • v.10A no.6
    • /
    • pp.671-676
    • /
    • 2003
  • In this paper, we proposed a new First Partial product Addition (FPA) architecture with new compressor (or parallel counter) to CSA tree built in the process of adding partial product for improving speed in the fast parallel multiplier to improve the speed of calculating partial product by about 20% compared with existing parallel counter using full Adder. The new circuit reduces the CLA bit finding final sum by N/2 using the novel FPA architecture. A 5.14nS of multiplication speed of the $16{\times}16$ multiplier is obtained using $0.25\mu\textrm{m}$ CMOS technology. The architecture of the multiplier is easily opted for pipeline design and demonstrates high speed performance.

A Hardware Implementation of Moving Object Detection Algorithm using Gaussian Mixture Model (가우시안 혼합 모델을 이용한 이동 객체 검출 알고리듬의 하드웨어 구현)

  • Kim, Gyeong-hun;An, Hyo-Sik;Shin, Kyung-wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.05a
    • /
    • pp.407-409
    • /
    • 2015
  • In this paper, a hardware implementation of MOD(Moving Object Detection) algorithm is described, which is based GMM(Gaussian Mixture Model) and background subtraction. The EGML(Effective Gaussian Mixture Learning) is used to model and update background. Some approximations of EGML calculations are applied to reduce hardware complexity, and pipelining technique is used to improve operating speed. Gaussian parameters are adjustable according to various environment conditions to achieve better MOD performance. MOD processor is verified by using FPGA-in-the-loop verification, and it can operate with 109 MHz clock frequency on XC5VSX95T FPGA device.

  • PDF

Physical-Aware Approaches for Speeding Up Scan Shift Operations in SoCs

  • Lee, Taehee;Chang, Ik Joon;Lee, Chilgee;Yang, Joon-Sung
    • ETRI Journal
    • /
    • v.38 no.3
    • /
    • pp.479-486
    • /
    • 2016
  • System-on-chip (SoC) designs have a number of flip-flops; the more flip-flops an SoC has, the longer the associated scan test application time will be. A scan shift operation accounts for a significant portion of a scan test application time. This paper presents physical-aware approaches for speeding up scan shift operations in SoCs. To improve the speed of a scan shift operation, we propose a layout-aware flip-flop insertion and scan shift operation-aware physical implementation procedure. The proposed combined method of insertion and procedure effectively improves the speed of a scan shift operation. Static timing analyses of state-of-the-art SoC designs show that the proposed approaches help increase the speeds of scan shift operations by up to 4.1 times that reached under a conventional method. The faster scan shift operation speeds help to shorten scan test application times, thus reducing test costs.

High-throughput and low-area implementation of orthogonal matching pursuit algorithm for compressive sensing reconstruction

  • Nguyen, Vu Quan;Son, Woo Hyun;Parfieniuk, Marek;Trung, Luong Tran Nhat;Park, Sang Yoon
    • ETRI Journal
    • /
    • v.42 no.3
    • /
    • pp.376-387
    • /
    • 2020
  • Massive computation of the reconstruction algorithm for compressive sensing (CS) has been a major concern for its real-time application. In this paper, we propose a novel high-speed architecture for the orthogonal matching pursuit (OMP) algorithm, which is the most frequently used to reconstruct compressively sensed signals. The proposed design offers a very high throughput and includes an innovative pipeline architecture and scheduling algorithm. Least-squares problem solving, which requires a huge amount of computations in the OMP, is implemented by using systolic arrays with four new processing elements. In addition, a distributed-arithmetic-based circuit for matrix multiplication is proposed to counterbalance the area overhead caused by the multi-stage pipelining. The results of logic synthesis show that the proposed design reconstructs signals nearly 19 times faster while occupying an only 1.06 times larger area than the existing designs for N = 256, M = 64, and m = 16, where N is the number of the original samples, M is the length of the measurement vector, and m is the sparsity level of the signal.

Programmable Multimedia Platform for Video Processing of UHD TV (UHD TV 영상신호처리를 위한 프로그래머블 멀티미디어 플랫폼)

  • Kim, Jaehyun;Park, Goo-man
    • Journal of Broadcast Engineering
    • /
    • v.20 no.5
    • /
    • pp.774-777
    • /
    • 2015
  • This paper introduces the world's first programmable video-processing platform for the enhancement of the video quality of the 8K(7680x4320) UHD(Ultra High Definition) TV operating up to 60 frames per second. In order to support required computing capacity and memory bandwidth, the proposed platform implemented several key features such as symmetric multi-cluster architecture for parallel data processing, a ring-data path between the clusters for data pipelining and hardware accelerators for computing filter operations. The proposed platform based on RP(Reconfigurable Processor) processes video quality enhancement algorithms and handles effectively new UHD broadcasting standards and display panels.

Parallel 2D-DWT Hardware Architecture for Image Compression Using the Lifting Scheme (이미지 압축을 위한 Lifting Scheme을 이용한 병렬 2D-DWT 하드웨어 구조)

  • Kim, Jong-Woog;Chong, Jong-Wha
    • Journal of IKEEE
    • /
    • v.6 no.1 s.10
    • /
    • pp.80-86
    • /
    • 2002
  • This paper presents a fast hardware architecture to implement a 2-D DWT(Discrete Wavelet Transform) computed by lifting scheme framework. The conventional 2-D DWT hardware architecture has problem in internal memory, hardware resource, and latency. The proposed architecture was based on the 4-way partitioned data set. This architecture is configured with a pipelining parallel architecture for 4-way partitioning method. Due to the use of this architecture, total latency was improved by 50%, and memory size was reduced by using lifting scheme.

  • PDF

DSP Optimization for Rain Detection and Removal Algorithm (비 검출 및 제거 알고리즘의 DSP 최적화)

  • Choi, Dong Yoon;Seo, Seung Ji;Song, Byung Cheol
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.9
    • /
    • pp.96-105
    • /
    • 2015
  • This paper proposes a DSP optimization solution of rain detection and removal algorithm. We propose rain detection and removal algorithms considering camera motion, and also presents optimization results in algorithm level and DSP level. At algorithm level, this paper utilizes a block level binary pattern analysis, and reduces the operation time by using the fast motion estimation algorithm. Also, the algorithm is optimized at DSP level through inter memory optimization, EDMA, and software pipelining for real-time operation. Experiment results show that the proposed algorithm is superior to the other algorithms in terms of visual quality as well as processing speed.

CUDA based Lossless Asynchronous Compression of Ultra High Definition Game Scenes using DPCM-GR (DPCM-GR 방식을 이용한 CUDA 기반 초고해상도 게임 영상 무손실 비동기 압축)

  • Kim, Youngsik
    • Journal of Korea Game Society
    • /
    • v.14 no.6
    • /
    • pp.59-68
    • /
    • 2014
  • Memory bandwidth requirements of UHD (Ultra High Definition $4096{\times}2160$) game scenes have been much more increasing. This paper presents a lossless DPCM-GR based compression algorithm using CUDA for solving the memory bandwidth problem without sacrificing image quality, which is modified from DDPCM-GR [4] to support bit parallel pipelining. The memory bandwidth efficiency increases because of using the shared memory of CUDA. Various asynchronous transfer configurations which can overlap the kernel execution and data transfer between host and CUDA are implemented with the page-locked host memory. Experimental results show that the maximum 31.3 speedup is obtained according to CPU time. The maximum 30.3% decreases in the computation time among various configurations.