• Title/Summary/Keyword: clock cycles

Search Result 148, Processing Time 0.02 seconds

Design of Systolic Multipliers in GF(2$^{m}$ ) Using an Irreducible All One Polynomial (기약 All One Polynomial을 이용한 유한체 GF(2$^{m}$ )상의 시스톨릭 곱셈기 설계)

  • Gwon, Sun Hak;Kim, Chang Hun;Hong, Chun Pyo
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.29 no.8C
    • /
    • pp.1047-1054
    • /
    • 2004
  • In this paper, we present two systolic arrays for computing multiplications in CF(2$\^$m/) generated by an irreducible all one polynomial (AOP). The proposed two systolic mays have parallel-in parallel-out structure. The first systolic multiplier has area complexity of O(㎡) and time complexity of O(1). In other words, the multiplier consists of m(m+1)/2 identical cells and produces multiplication results at a rate of one every 1 clock cycle, after an initial delay of m/2+1 cycles. Compared with the previously proposed related multiplier using AOP, our design has 12 percent reduced hardware complexity and 50 percent reduced computation delay time. The other systolic multiplier, designed for cryptographic applications, has area complexity of O(m) and time complexity of O(m), i.e., it is composed of m+1 identical cells and produces multiplication results at a rate of one every m/2+1 clock cycles. Compared with other linear systolic multipliers, we find that our design has at least 43 percent reduced hardware complexity, 83 percent reduced computation delay time, and has twice higher throughput rate Furthermore, since the proposed two architectures have a high regularity and modularity, they are well suited to VLSI implementations. Therefore, when the proposed architectures are used for GF(2$\^$m/) applications, one can achieve maximum throughput performance with least hardware requirements.

Optimized Hardware Design of Deblocking Filter for H.264/AVC (H.264/AVC를 위한 디블록킹 필터의 최적화된 하드웨어 설계)

  • Jung, Youn-Jin;Ryoo, Kwang-Ki
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.47 no.1
    • /
    • pp.20-27
    • /
    • 2010
  • This paper describes a design of 5-stage pipelined de-blocking filter with power reduction scheme and proposes a efficient memory architecture and filter order for high performance H.264/AVC Decoder. Generally the de-blocking filter removes block boundary artifacts and enhances image quality. Nevertheless filter has a few disadvantage that it requires a number of memory access and iterated operations because of filter operation for 4 time to one edge. So this paper proposes a optimized filter ordering and efficient hardware architecture for the reduction of memory access and total filter cycles. In proposed filter parallel processing is available because of structured 5-stage pipeline consisted of memory read, threshold decider, pre-calculation, filter operation and write back. Also it can reduce power consumption because it uses a clock gating scheme which disable unnecessary clock switching. Besides total number of filtering cycle is decreased by new filter order. The proposed filter is designed with Verilog-HDL and functionally verified with the whole H.264/AVC decoder using the Modelsim 6.2g simulator. Input vectors are QCIF images generated by JM9.4 standard encoder software. As a result of experiment, it shows that the filter can make about 20% total filter cycles reduction and it requires small transposition buffer size.

A Novel Globally Asynchronous, Locally Dynamic System Bus Architecture Based on Multitasking Bus (다중처리가 가능한 새로운 Globally Asynchronous, Locally Dynamic System 버스 구조)

  • Choi, Chang-Won;Shin, Hyeon-Chul;Wee, Jae-Kyung
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.5
    • /
    • pp.71-81
    • /
    • 2008
  • In this paper, we propose a novel Globally Asynchronous, Locally Dynamic System(GALDS) bus and demonstrate its performance. The proposed GALDS bus is the bidirectional multitasking bus with the segmented bus architecture supporting the concurrent operation of multi-masters and multi-slaves. By analyzing system tasks, the bus architecture chooses the optimal frequency for each If among multiples of bus frequency and thus we can reduce the overall power consumption. For efficient data communications between IPs operating in different frequencies, we designed an asynchronous and bidirectional FIFO based on an asynchronous wrapper with hand-shaking interface. In addition, since systems can be easily expandable by inserting bus segments, the proposed architecture has advantages in IP reusability and structural flexibility As a test example, a four-segment bus haying four masters and four slaves were designed by using Verilog HDL. We demonstrate multitasking operations with read/write data transfers by simulation when the ratios of operation frequency are 1:1, 1:2, 1:4 and 1:8. The data transfer mode is a 16 burst increment mode compatible with Advanced Microcontroller Bus Architecture(AMBA). The maximum operation latency of the proposed GALDS bus is 22 clock cycles for the bus write operation, and 44 clock cycles for read.

A Small-Area Hardware Implementation of Hash Algorithm Standard HAS-160 (해쉬 알고리듬 표준 HAS-l60의 저면적 하드웨어 구현)

  • Kim, Hae-Ju;Jeon, Heung-Woo;Shin, Kyung-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.3
    • /
    • pp.715-722
    • /
    • 2010
  • This paper describes a hardware design of hash function processor which implements Korean Hash Algorithm Standard HAS-160. The HAS-160 processor compresses a message with arbitrary lengths into a hash code with a fixed length of 160-bit. To achieve high-speed operation with small-area, arithmetic operation for step-operation is implemented by using a hybrid structure of 5:3 and 3:2 carry-save adders and carry-select adder. It computes a 160-bit hash code from a message block of 512 bits in 82 clock cycles, and has 312 Mbps throughput at 50 MHz@3.3-V clock frequency. The designed HAS-160 processor is verified by FPGA implementation, and it has 17,600 gates on a layout area of about $1\;mm^2$ using a 0.35-${\mu}m$ CMOS cell library.

A VLSI Design and Implementation of a Single-Chip Encoder/Decoder with Dictionary Search Processor(DISP) using LZSS Algorithm and Entropy Coding (LZSS 알고리즘과 엔트로피 부호를 이용한 사전탐색처리장치를 갖는 부호기/복호기 단일-칩의 VLSI 설계 및 구현)

  • Kim, Jong-Seop;Jo, Sang-Bok
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.2
    • /
    • pp.103-113
    • /
    • 2001
  • This paper described a design and implementation of a single-chip encoder/decoder using the LZSS algorithm and entropy coding in 0.6${\mu}{\textrm}{m}$ CMOS technology. Dictionary storage for the dictionary search processor(DISP) used a 2K$\times$8bit on-chip memory with 50MHz clock speed. It performs compression on byte-oriented input data at a data rate of one byte per clock cycle except when one out of every 33 cycles is used to update the string window of dictionary. In result, the average compression ratio is 46% by applied entropy coding of the LZSS codeword output. This is to improved on the compression performance of 7% much more then LZSS.

  • PDF

Design of a Dispatch Unit & Operand Selection Unit for Improving the SIMT Based GP-GPU Instruction Performance (SIMT구조 GP-GPU의 명령어 처리 성능 향상을 위한 Dispatch Unit과 Operand Selection Unit설계)

  • Kwak, Jae Chang
    • Journal of IKEEE
    • /
    • v.19 no.3
    • /
    • pp.455-459
    • /
    • 2015
  • This paper proposes a dispatch unit of GP-GPU with SIMT architecture to support the acceleration of general-purpose operation as well as graphics processing. If all the information of an operand used instructions issued from the warp scheduler is decoded, an unnecessary operand load occurs, resulting in register loads. To resolve this problem, this paper proposes a method that can reduce the operand load and the load on the resister by decoding only the information of the operand using a pre-decoding method. The operand information from the dispatch unit is passed to the operand selection unit with preventing register bank collisions. Thus the overall performance are improved. In the simulation test, the total clock cycles required by processing 10,000 arbitrary instructions issued from the wrap scheduler using ModelSim SE 10.0b are measured. It shows that the application of the dispatch unit equipped with the pre-decoding function proposed in this paper can make an improvement of about 12% in processing performance compared to the conventional method.

An Efficient Hardware Implementation of Lightweight Block Cipher LEA-128/192/256 for IoT Security Applications (IoT 보안 응용을 위한 경량 블록암호 LEA-128/192/256의 효율적인 하드웨어 구현)

  • Sung, Mi-Ji;Shin, Kyung-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.7
    • /
    • pp.1608-1616
    • /
    • 2015
  • This paper describes an efficient hardware implementation of lightweight encryption algorithm LEA-128/192/256 which supports for three master key lengths of 128/192/256-bit. To achieve area-efficient and low-power implementation of LEA crypto- processor, the key scheduler block is optimized to share hardware resources for encryption/decryption key scheduling of three master key lengths. In addition, a parallel register structure and novel operating scheme for key scheduler is devised to reduce clock cycles required for key scheduling, which results in an increase of encryption/decryption speed by 20~30%. The designed LEA crypto-processor has been verified by FPGA implementation. The estimated performances according to master key lengths of 128/192/256-bit are 181/162/109 Mbps, respectively, at 113 MHz clock frequency.

A Lightweight Hardware Implementation of ECC Processor Supporting NIST Elliptic Curves over GF(2m) (GF(2m) 상의 NIST 타원곡선을 지원하는 ECC 프로세서의 경량 하드웨어 구현)

  • Lee, Sang-Hyun;Shin, Kyung-Wook
    • Journal of IKEEE
    • /
    • v.23 no.1
    • /
    • pp.58-67
    • /
    • 2019
  • A design of an elliptic curve cryptography (ECC) processor that supports both pseudo-random curves and Koblitz curves over $GF(2^m)$ defined by the NIST standard is described in this paper. A finite field arithmetic circuit based on a word-based Montgomery multiplier was designed to support five key lengths using a datapath of fixed size, as well as to achieve a lightweight hardware implementation. In addition, Lopez-Dahab's coordinate system was adopted to remove the finite field division operation. The ECC processor was implemented in the FPGA verification platform and the hardware operation was verified by Elliptic Curve Diffie-Hellman (ECDH) key exchange protocol operation. The ECC processor that was synthesized with a 180-nm CMOS cell library occupied 10,674 gate equivalents (GEs) and a dual-port RAM of 9 kbits, and the maximum clock frequency was estimated at 154 MHz. The scalar multiplication operation over the 223-bit pseudo-random elliptic curve takes 1,112,221 clock cycles and has a throughput of 32.3 kbps.

A Public-Key Crypto-Core supporting Edwards Curves of Edwards25519 and Edwards448 (에드워즈 곡선 Edwards25519와 Edwards448을 지원하는 공개키 암호 코어)

  • Yang, Hyeon-Jun;Shin, Kyung-Wook
    • Journal of IKEEE
    • /
    • v.25 no.1
    • /
    • pp.174-179
    • /
    • 2021
  • An Edwards curve cryptography (EdCC) core supporting point scalar multiplication (PSM) on Edwards curves of Edwards25519 and Edwards448 was designed. For area-efficient implementation, finite field multiplier based on word-based Montgomery multiplication algorithm was designed, and the extended twisted Edwards coordinates system was adopted to implement point operations without division operation. As a result of synthesizing the EdCC core with 100 MHz clock, it was implemented with 24,073 equivalent gates and 11 kbits RAM, and the maximum operating frequency was estimated to be 285 MHz. The evaluation results show that the EdCC core can compute 299 and 66 PSMs per second on Edwards25519 and Edwards448 curves, respectively. Compared to the ECC core with similar structure, the number of clock cycles required for 256-bit PSM was reduced by about 60%, resulting in 7.3 times improvement in computational performance.

Montgomery Multiplier Supporting Dual-Field Modular Multiplication (듀얼 필드 모듈러 곱셈을 지원하는 몽고메리 곱셈기)

  • Kim, Dong-Seong;Shin, Kyung-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.6
    • /
    • pp.736-743
    • /
    • 2020
  • Modular multiplication is one of the most important arithmetic operations in public-key cryptography such as elliptic curve cryptography (ECC) and RSA, and the performance of modular multiplier is a key factor influencing the performance of public-key cryptographic hardware. An efficient hardware implementation of word-based Montgomery modular multiplication algorithm is described in this paper. Our modular multiplier was designed to support eleven field sizes for prime field GF(p) and binary field GF(2k) as defined by SEC2 standard for ECC, making it suitable for lightweight hardware implementations of ECC processors. The proposed architecture employs pipeline scheme between the partial product generation and addition operation and the modular reduction operation to reduce the clock cycles required to compute modular multiplication by 50%. The hardware operation of our modular multiplier was demonstrated by FPGA verification. When synthesized with a 65-nm CMOS cell library, it was realized with 33,635 gate equivalents, and the maximum operating clock frequency was estimated at 147 MHz.