• Title/Summary/Keyword: systolic array architecture

Search Result 62, Processing Time 0.036 seconds

Design of LSB Multiplier using Cellular Automata (셀룰러 오토마타를 이용한 LSB 곱셈기 설계)

  • 하경주;구교민
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.7 no.3
    • /
    • pp.1-8
    • /
    • 2002
  • Modular Multiplication in Galois Field GF(2/sup m/) is a basic operation for many applications, particularly for public key cryptography. This paper presents a new architecture that can process modular multiplication on GF(2/sup m/) per m clock cycles using a cellular automata. Proposed architecture is more efficient in terms of the space and time than that of systolic array. Furthermore it can be efficiently used for the hardware design for exponentiation computation.

  • PDF

A Design of Reconfigurable Neural Network Processor (재구성 가능한 신경망 프로세서의 설계)

  • 장영진;이현수
    • Proceedings of the IEEK Conference
    • /
    • 1999.11a
    • /
    • pp.368-371
    • /
    • 1999
  • In this paper, we propose a neural network processor architecture with on-chip learning and with reconfigurability according to the data dependencies of the algorithm applied. For the neural network model applied, the proposed architecture can be configured into either SIMD or SRA(Systolic Ring Array) without my changing of on-chip configuration so as to obtain a high throughput. However, changing of system configuration can be controlled by user program. To process activation function, which needs amount of cycles to get its value, we design it by using PWL(Piece-Wise Linear) function approximation method. This unit has only single latency and the processing ability of non-linear function such as sigmoid gaussian function etc. And we verified the processing mechanism with EBP(Error Back-Propagation) model.

  • PDF

FPGA-Based Acceleration of Range Doppler Algorithm for Real-Time Synthetic Aperture Radar Imaging (실시간 SAR 영상 생성을 위한 Range Doppler 알고리즘의 FPGA 기반 가속화)

  • Jeong, Dongmin;Lee, Wookyung;Jung, Yunho
    • Journal of IKEEE
    • /
    • v.25 no.4
    • /
    • pp.634-643
    • /
    • 2021
  • In this paper, an FPGA-based acceleration scheme of range Doppler algorithm (RDA) is proposed for the real time synthetic aperture radar (SAR) imaging. Hardware architectures of matched filter based on systolic array architecture and a high speed sinc interpolator to compensate range cell migration (RCM) are presented. In addition, the proposed hardware was implemented and accelerated on Xilinx Alveo FPGA. Experimental results for 4096×4096-size SAR imaging showed that FPGA-based implementation achieves 2 times acceleration compared to GPU-based design. It was also confirmed the proposed design can be implemented with 60,247 CLB LUTs, 103,728 CLB registers, 20 block RAM tiles and 592 DPSs at the operating frequency of 312 MHz.

VLSI Implementation of Low-Power Motion Estimation Using Reduced Memory Accesses and Computations (메모리 호출과 연산횟수 감소기법을 이용한 저전력 움직임추정 VLSI 구현)

  • Moon, Ji-Kyung;Kim, Nam-Sub;Kim, Jin-Sang;Cho, Won-Kyung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.32 no.5A
    • /
    • pp.503-509
    • /
    • 2007
  • Low-power motion estimation is required for video coding in portable information devices. In this paper, we propose a low-power motion estimation algorithm and 1-D systolic may VLSI architecture using full search block matching algorithm (FSBMA). Main power dissipation sources of FSBMA are complex computations and frequent memory accesses for data in the search area. In the proposed algorithm, memory accesses and computations are reduced by using 1D PE (processing array) array architecture performing motion estimation of two neighboring blocks in parallel and by skipping unnecessary computations during motion estimation. The VLSI implementation results of the algorithm show that the proposed VLSI architecture can save 9.3% power dissipation and can operate two times faster than an existing low-power motion estimator.

Design of an efficient VLSI architecture of SADCT based on systolic array (시스톨릭 어레이에 기반한 SADCT의 효율적 VLSI 구조 설계)

  • Gang, Tae Jun;Jeong, Ui Yun;Ha, Yeong Ho
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.38 no.3
    • /
    • pp.46-46
    • /
    • 2001
  • 본 논문에서는 시스톨릭 어레이에 기반한 모양 적응적 이산 여현 변환(SADCT)의 효율적 VLSI 구조를 제안한다. 모양 적응적 이산 여현 변환은 이산 여현 변환과 달리 변환 크기가 각 블록에서의 객체의 모양에 따라 가변적이므로 기존의 시간 순환구조에서는 각 처리소자의 이용도와 처리속도가 모두 저하된다. 본 논문에서는 이러한 단점을 극복하기 위해 메모리를 필요로 하지 않는 시스톨릭 어레이에 기반한 구조를 제안한다. 제안된 구조에서는 1차원 SADCT를 연속적으로 수행함으로 처리속도를 향상시키고 첫 번째 열의 처리소자들을 마지막 열의 처리소자들과 연결하고, 입력 데이터는 각각의 재배열된 블록에서의 최대 데이터 크기에 따라 각 열에 병렬로 입력하여 처리소자의 이용도를 향상시켰다. 제안된 구조는 VHDL로 기술하고 MentorTM를 이용하여 기능검증을 수행하였다. 검증결과, 하드웨어 복잡도가 다소 증가하나, 처리속도는 기존의 방법에 비해 두 배정도 향상되었다.

Design of the Adaptive Systolic Array Architecture for Efficient Sparse Matrix Multiplication (희소 행렬 곱셈을 효율적으로 수행하기 위한 유동적 시스톨릭 어레이 구조 설계)

  • Seo, Juwon;Kong, Joonho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.11a
    • /
    • pp.24-26
    • /
    • 2022
  • 시스톨릭 어레이는 DNN training 등 인공지능 연산의 대부분을 차지하는 행렬 곱셈을 수행하기 위한 하드웨어 구조로 많이 사용되지만, sparsity 가 높은 행렬을 연산할 때 불필요한 동작으로 인해 효율성이 크게 떨어진다. 본 논문에서 제안된 유동적 시스톨릭 어레이는 matrix condensing, weight switching, 그리고 direct output path 의 방법과 구조를 통해 sparsity 가 높은 행렬 곱셈의 수행 사이클을 줄일 수 있다. 시뮬레이션을 통해 기존 시스톨릭 어레이와 유동적 시스톨릭 어레이의 성능을 비교하였으며 8×8, 16×16, 32×32 의 크기를 가진 행렬을 동일 크기의 시스톨릭 어레이로 연산하였을 때 필요 사이클 수를 최대 12 사이클 절감할 수 있는 것을 확인하였다.

Low-latency Montgomery AB2 Multiplier Using Redundant Representation Over GF(2m)) (GF(2m) 상의 여분 표현을 이용한 낮은 지연시간의 몽고메리 AB2 곱셈기)

  • Kim, Tai Wan;Kim, Kee-Won
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.12 no.1
    • /
    • pp.11-18
    • /
    • 2017
  • Finite field arithmetic has been extensively used in error correcting codes and cryptography. Low-complexity and high-speed designs for finite field arithmetic are needed to meet the demands of wider bandwidth, better security and higher portability for personal communication device. In particular, cryptosystems in GF($2^m$) usually require computing exponentiation, division, and multiplicative inverse, which are very costly operations. These operations can be performed by computing modular AB multiplications or modular $AB^2$ multiplications. To compute these time-consuming operations, using $AB^2$ multiplications is more efficient than AB multiplications. Thus, there are needs for an efficient $AB^2$ multiplier architecture. In this paper, we propose a low latency Montgomery $AB^2$ multiplier using redundant representation over GF($2^m$). The proposed $AB^2$ multiplier has less space and time complexities compared to related multipliers. As compared to the corresponding existing structures, the proposed $AB^2$ multiplier saves at least 18% area, 50% time, and 59% area-time (AT) complexity. Accordingly, it is well suited for VLSI implementation and can be easily applied as a basic component for computing complex operations over finite field, such as exponentiation, division, and multiplicative inverse.

A Fast Inversion for Low-Complexity System over GF(2 $^{m}$) (경량화 시스템에 적합한 유한체 $GF(2^m)$에서의 고속 역원기)

  • Kim, So-Sun;Chang, Nam-Su;Kim, Chang-Han
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.42 no.9 s.339
    • /
    • pp.51-60
    • /
    • 2005
  • The design of efficient cryptosystems is mainly appointed by the efficiency of the underlying finite field arithmetic. Especially, among the basic arithmetic over finite field, the rnultiplicative inversion is the most time consuming operation. In this paper, a fast inversion algerian in finite field $GF(2^m)$ with the standard basis representation is proposed. It is based on the Extended binary gcd algorithm (EBGA). The proposed algorithm executes about $18.8\%\;or\;45.9\%$ less iterations than EBGA or Montgomery inverse algorithm (MIA), respectively. In practical applications where the dimension of the field is large or may vary, systolic array sDucture becomes area-complexity and time-complexity costly or even impractical in previous algorithms. It is not suitable for low-weight and low-power systems, i.e., smartcard, the mobile phone. In this paper, we propose a new hardware architecture to apply an area-efficient and a synchronized inverter on low-complexity systems. It requires the number of addition and reduction operation less than previous architectures for computing the inverses in $GF(2^m)$ furthermore, the proposed inversion is applied over either prime or binary extension fields, more specially $GF(2^m)$ and GF(P) .

Design of a Bit-Serial Divider in GF(2$^{m}$ ) for Elliptic Curve Cryptosystem (타원곡선 암호시스템을 위한 GF(2$^{m}$ )상의 비트-시리얼 나눗셈기 설계)

  • 김창훈;홍춘표;김남식;권순학
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.12C
    • /
    • pp.1288-1298
    • /
    • 2002
  • To implement elliptic curve cryptosystem in GF(2$\^$m/) at high speed, a fast divider is required. Although bit-parallel architecture is well suited for high speed division operations, elliptic curve cryptosystem requires large m(at least 163) to support a sufficient security. In other words, since the bit-parallel architecture has an area complexity of 0(m$\^$m/), it is not suited for this application. In this paper, we propose a new serial-in serial-out systolic array for computing division operations in GF(2$\^$m/) using the standard basis representation. Based on a modified version of tile binary extended greatest common divisor algorithm, we obtain a new data dependence graph and design an efficient bit-serial systolic divider. The proposed divider has 0(m) time complexity and 0(m) area complexity. If input data come in continuously, the proposed divider can produce division results at a rate of one per m clock cycles, after an initial delay of 5m-2 cycles. Analysis shows that the proposed divider provides a significant reduction in both chip area and computational delay time compared to previously proposed systolic dividers with the same I/O format. Since the proposed divider can perform division operations at high speed with the reduced chip area, it is well suited for division circuit of elliptic curve cryptosystem. Furthermore, since the proposed architecture does not restrict the choice of irreducible polynomial, and has a unidirectional data flow and regularity, it provides a high flexibility and scalability with respect to the field size m.

Trace-Back Viterbi Decoder with Sequential State Transition Control (순서적 역방향 상태천이 제어에 의한 역추적 비터비 디코더)

  • 정차근
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.40 no.11
    • /
    • pp.51-62
    • /
    • 2003
  • This paper presents a novel survivor memeory management and decoding techniques with sequential backward state transition control in the trace back Viterbi decoder. The Viterbi algorithm is an maximum likelihood decoding scheme to estimate the likelihood of encoder state for channel error detection and correction. This scheme is applied to a broad range of digital communication such as intersymbol interference removing and channel equalization. In order to achieve the area-efficiency VLSI chip design with high throughput in the Viterbi decoder in which recursive operation is implied, more research is required to obtain a simple systematic parallel ACS architecture and surviver memory management. As a method of solution to the problem, this paper addresses a progressive decoding algorithm with sequential backward state transition control in the trace back Viterbi decoder. Compared to the conventional trace back decoding techniques, the required total memory can be greatly reduced in the proposed method. Furthermore, the proposed method can be implemented with a simple pipelined structure with systolic array type architecture. The implementation of the peripheral logic circuit for the control of memory access is not required, and memory access bandwidth can be reduced Therefore, the proposed method has characteristics of high area-efficiency and low power consumption with high throughput. Finally, the examples of decoding results for the received data with channel noise and application result are provided to evaluate the efficiency of the proposed method.