• Title/Summary/Keyword: Verilog HDL

Search Result 416, Processing Time 0.024 seconds

Hardware Design of In-loop Filter for High Performance HEVC Encoder (고성능 HEVC 부호기를 위한 루프 내 필터 하드웨어 설계)

  • Park, Seungyong;Im, Junseong;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.2
    • /
    • pp.335-342
    • /
    • 2016
  • This paper proposes efficient hardware structure of in-loop filter for a high-performance HEVC (High Efficiency Video Coding) encoder. HEVC uses in-loop filter consisting of deblocking filter and SAO (Sample Adaptive Offset) to improve the picture quality in a reconstructed image due to a quantization error. However, in-loop filter causes an increase in complexity due to the additional encoder and decoder operations. A proposed in-loop filter is implemented as a three-stage pipeline to perform the deblocking filtering and SAO operation with a reduced number of cycles. The proposed deblocking filter is also implemented as a six-stage pipeline to improve efficiency and performs a new filtering order for efficient memory architecture. The proposed SAO processes six pixels parallelly at a time to reduce execution cycles. The proposed in-loop filter encoder architecture is designed by Verilog HDL, and implemented by 131K logic gates in TSMC $0.13{\mu}m$ process. At 164MHz, the proposed in-loop filter encoder can support 4K Ultra HD video encoding at 60fps in real time.

Hardware Design of High Performance In-loop Filter in HEVC Encoder for Ultra HD Video Processing in Real Time (UHD 영상의 실시간 처리를 위한 고성능 HEVC In-loop Filter 부호화기 하드웨어 설계)

  • Im, Jun-seong;Dennis, Gookyi;Ryoo, Kwang-ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.401-404
    • /
    • 2015
  • This paper proposes a high-performance in-loop filter in HEVC(High Efficiency Video Coding) encoder for Ultra HD video processing in real time. HEVC uses in-loop filter consisting of deblocking filter and SAO(Sample Adaptive Offset) to solve the problems of quantization error which causes image degradation. In the proposed in-loop filter encoder hardware architecture, the deblocking filter and SAO has a 2-level hybrid pipeline structure based on the $32{\times}32CTU$ to reduce the execution time. The deblocking filter is performed by 6-stage pipeline structure, and it supports minimization of memory access and simplification of reference memory structure using proposed efficient filtering order. Also The SAO is implemented by 2-statge pipeline for pixel classification and applying SAO parameters and it uses two three-layered parallel buffers to simplify pixel processing and reduce operation cycle. The proposed in-loop filter encoder architecture is designed by Verilog HDL, and implemented by 205K logic gates in TSMC 0.13um process. At 110MHz, the proposed in-loop filter encoder can support 4K Ultra HD video encoding at 30fps in realtime.

  • PDF

An Intra Prediction Hardware Design for High Performance HEVC Encoder (고성능 HEVC 부호기를 위한 화면내 예측 하드웨어 설계)

  • Park, Seung-yong;Guard, Kanda;Ryoo, Kwang-ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.875-878
    • /
    • 2015
  • In this paper, we propose an intra prediction hardware architecture with less processing time, computations and reduced hardware area for a high performance HEVC encoder. The proposed intra prediction hardware architecture uses common operation units to reduce computational complexity and uses $4{\times}4$ block unit to reduce hardware area. In order to reduce operation time, common operation unit uses one operation unit to generate predicted pixels and filtered pixels in all prediction modes. Intra prediction hardware architecture introduces the $4{\times}4$ PU design processing to reduce the hardware area and uses intemal registers to support $32{\times}32$ PU processmg. The proposed hardware architecture uses ten common operation units which can reduce execution cycles of intra prediction. The proposed Intra prediction hardware architecture is designed using Verilog HDL(Hardware Description Language), and has a total of 41.5k gates in TSMC $0.13{\mu}m$ CMOS standard cell library. At 150MHz, it can support 4K UHD video encoding at 30fps in real time, and operates at a maximum of 200MHz.

  • PDF

Design of Message Passing Engine Based on Processing Node Status for MPI Collective Communication (MPI 집합통신을 위한 프로세싱 노드 상태 기반의 메시지 전달 엔진 설계)

  • Chung, Won-Young;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37 no.8B
    • /
    • pp.668-676
    • /
    • 2012
  • In this paper, on the assumption that MPI collective communication function is converted into a group of point-to-point communication functions in the transaction level, an algorithm that optimizes broadcast, scatter and gather function among MPI collective communication is proposed. The MPI hardware engine that operates the proposed algorithm was designed, and it was named the OCC-MPE (Optimized Collective Communication Message Passing Engine). The OCC-MPE operates point-to-point communication by using the standard send mode. The transmission order is arranged according to the algorithm that proposes the most frequently used broadcast, scatter and gather functions among the collective communications, so the whole communication time is reduced. To measure the performance of the proposed algorithm, the OCC-MPE with the Bus Functional Model (BFM) based on SystemC was designed. After evaluating the performance through the BFM based on SystemC, the proposed OCC-MPE is designed by using VerilogHDL. As a result of synthesizing with the TSMC $0.18{\mu}m$, the gate count of each OCC-MPE is approximately 1978.95 with four processing nodes. That occupies approximately 4.15% in the whole system, which means it takes up a relatively small amount. Improved performance is expected with relatively small amounts of area increase if the OCC-MPE operated by the proposed algorithm is added to the MPSoC (Multi-Processor System on a Chip).

Bit-serial Discrete Wavelet Transform Filter Design (비트 시리얼 이산 웨이블렛 변환 필터 설계)

  • Park Tae geun;Kim Ju young;Noh Jun rye
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.30 no.4A
    • /
    • pp.336-344
    • /
    • 2005
  • Discrete Wavelet Transform(DWT) is the oncoming generation of compression technique that has been selected for MPEG4 and JEPG2000, because it has no blocking effects and efficiently determines frequency property of temporary time. In this paper, we propose an efficient bit-serial architecture for the low-power and low-complexity DWT filter, employing two-channel QMF(Qudracture Mirror Filter) PR(Perfect Reconstruction) lattice filter. The filter consists of four lattices(filter length=8) and we determine the quantization bit for the coefficients by the fixed-length PSNR(peak-signal-to-noise ratio) analysis and propose the architecture of the bit-serial multiplier with the fixed coefficient. The CSD encoding for the coefficients is adopted to minimize the number of non-zero bits, thus reduces the hardware complexity. The proposed folded 1D DWT architecture processes the other resolution levels during idle periods by decimations and its efficient scheduling is proposed. The proposed architecture requires only flip-flops and full-adders. The proposed architecture has been designed and verified by VerilogHDL and synthesized by Synopsys Design Compiler with a Hynix 0.35$\mu$m STD cell library. The maximum operating frequency is 200MHz and the throughput is 175Mbps with 16 clock latencies.

A Design of AES-based WiBro Security Processor (AES 기반 와이브로 보안 프로세서 설계)

  • Kim, Jong-Hwan;Shin, Kyung-Wook
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.44 no.7 s.361
    • /
    • pp.71-80
    • /
    • 2007
  • This paper describes an efficient hardware design of WiBro security processor (WBSec) supporting for the security sub-layer of WiBro wireless internet system. The WBSec processor, which is based on AES (Advanced Encryption Standard) block cipher algorithm, performs data oncryption/decryption, authentication/integrity, and key encryption/decryption for packet data protection of wireless network. It carries out the modes of ECB, CTR, CBC, CCM and key wrap/unwrap with two AES cores working in parallel. In order to achieve an area-efficient implementation, two design techniques are considered; First, round transformation block within AES core is designed using a shared structure for encryption/decryption. Secondly, SubByte/InvSubByte blocks that require the largest hardware in AES core are implemented using field transformation technique. It results that the gate count of WBSec is reduced by about 25% compared with conventional LUT (Look-Up Table)-based design. The WBSec processor designed in Verilog-HDL has about 22,350 gates, and the estimated throughput is about 16-Mbps at key wrap mode and maximum 213-Mbps at CCM mode, thus it can be used for hardware design of WiBro security system.

Design of an Improved Anti-Collision Unit for an RFID Reader System Based on Gen2 (Gen2 리더 시스템의 개선된 충돌방지 유닛 설계)

  • Sim, Jae-Hee;Lee, Yong-Joo;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.34 no.2A
    • /
    • pp.177-183
    • /
    • 2009
  • In this paper, we propose an improved anti-collision algorithm. We have designed an anti-collision unit using this algorithm for the 18000-6 Type C Class 1 Generation 2 standard (Gen2). The Gen2 standard uses a Q-algorithm for incremental method on the Dynamic Slot-Aloha algorithm. It has basically enhanced performance over the Slot-Aloha algorithm. Unfortunately, there are several non-clarified parts: initial $Q_{fp}$ value, weighted C, and the ending point of the algorithm. If an incorrect value is selected, it causes degradation in performance. Thus we propose an improved anti-collision algorithm by clearly defining the vague parts of the existing algorithm. Simulation results showed an improved performance of up to 34.8% using an optimized value of C and the initial $Q_{fp}$ value. With the ending condition, performance is 34.7%. The anti-collision unit is designed using the Verilog HDL. The module was synthesized using Synopsys' Design Compiler and the TSMC $0.2{\mu}m$ standard cell library. The synthesized result yielded 3,847 gates, and was guaranteed under the proposed working frequency of 19.2MHz.

A Design of Memory-efficient 2k/8k FFT/IFFT Processor using R4SDF/R4SDC Hybrid Structure (R4SDF/R4SDC Hybrid 구조를 이용한 메모리 효율적인 2k/8k FFT/IFFT 프로세서 설계)

  • 신경욱
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.8 no.2
    • /
    • pp.430-439
    • /
    • 2004
  • This paper describes a design of 8192/2048-point FFT/IFFT processor (CFFT8k2k), which performs multi-carrier modulation/demodulation in OFDM-based DVB-T receiver. Since a large size FFT requires a large buffer memory, two design techniques are considered to achieve memory-efficient implementation of 8192-point FFT/IFFT. A hybrid structure, which is composed of radix-4 single-path delay feedback (R4SDF) and radix-4 single-path delay commutator (R4SDC), reduces its memory by 20% compared to R4SDC structure. In addition, a memory reduction of about 24% is achieved by a novel two-step convergent block floating-point scaling. As a result, it requires only 57% of memory used in conventional design, reducing chip area and power consumption. The CFFT8k2k core is designed in Verilog-HDL, and has about 102,000 Bates, RAM of 292k bits, and ROM of 39k bits. Using gate-level netlist with SDF which is synthesized using a $0.25-{\um}m$ CMOS library, timing simulation show that it can safely operate with 50-MHz clock at 2.5-V supply, resulting that a 8192-point FFT/IFFT can be computed every 164-${\mu}\textrm{s}$. The functionality of the core is fully verified by FPGA implementation, and the average SQNR of 60-㏈ is achieved.

Hardware Design of High Performance HEVC Deblocking Filter for UHD Videos (UHD 영상을 위한 고성능 HEVC 디블록킹 필터 설계)

  • Park, Jaeha;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.1
    • /
    • pp.178-184
    • /
    • 2015
  • This paper proposes a hardware architecture for high performance Deblocking filter(DBF) in High Efficiency Video Coding for UHD(Ultra High Definition) videos. This proposed hardware architecture which has less processing time has a 4-stage pipelined architecture with two filters and parallel boundary strength module. Also, the proposed filter can be used in low-voltage design by using clock gating architecture in 4-stage pipeline. The segmented memory architecture solves the hazard issue that arises when single port SRAM is accessed. The proposed order of filtering shortens the delay time that arises when storing data into the single port SRAM at the pre-processing stage. The DBF hardware proposed in this paper was designed with Verilog HDL, and was implemented with 22k logic gates as a result of synthesis using TSMC 0.18um CMOS standard cell library. Furthermore, the dynamic frequency can process UHD 8k($7680{\times}4320$) samples@60fps using a frequency of 150MHz with an 8K resolution and maximum dynamic frequency is 285MHz. Result from analysis shows that the proposed DBF hardware architecture operation cycle for one process coding unit has improved by 32% over the previous one.

Comparison among Gamma(${\gamma}$) Line Systems for Non-Linear Gamma Curve (비선형 감마 커브를 위한 감마 라인 시스템의 비교)

  • Jang, Won-Woo;Lee, Sung-Mok;Ha, Joo-Young;Kim, Joo-Hyun;Kim, Sang-Choon;Kang, Bong-Soon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.11 no.2
    • /
    • pp.265-272
    • /
    • 2007
  • This proposed gamma (${\gamma}$) correction system is developed to reduce the difference between non-linear gamma curve produced by a typical formula and result produced by the proposed algorithm. In order to reduce the difference, the proposed system is using the Least Squares Polynomial which is calculating the best fitting polynomial through a set of points which is sampled. Each system is consisting of continuous several kinds of equations and having their own overlap sections to get more precise. Based on the algorithm verified by MATLAB, the proposed systems are implemented by using Verilog-HDL. This paper will compare the previous algorithm of gamma system such as Existing system with Seed Table with the latest that such as Proposed system. The former and the latter system have 1, 2 clock latency; each 1 result per clock. Because each of the error range (LSB) is $1{\sim}+1,\;0{\sim}+36$, we can how that Proposed system is improved. Under the condition of SAMSUNG STD90 0.35 worst case, each gate count is 2,063, 2,564 gates and each maximum data arrival time is 29.05[ns], 17.52[ns], respectively.