• Title/Summary/Keyword: 병렬곱셈 연산기

Search Result 70, Processing Time 0.033 seconds

Hardware Implementation of Elliptic Curve Scalar Multiplier over GF(2n) with Simple Power Analysis Countermeasure (SPA 대응 기법을 적용한 이진체 위의 타원곡선 스칼라곱셈기의 하드웨어 구현)

  • 김현익;정석원;윤중철
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.9
    • /
    • pp.73-84
    • /
    • 2004
  • This paper suggests a new scalar multiplication algerian to resist SPA which threatens the security of cryptographic primitive on the hardware recently, and discusses how to apply this algerian Our algorithm is better than other SPA countermeasure algorithms aspect to computational efficiency. Since known SPA countermeasure algorithms have dependency of computation. these are difficult to construct parallel architecture efficiently. To solve this problem our algorithm removes dependency and computes a multiplication and a squaring during inversion with parallel architecture in order to minimize loss of performance. We implement hardware logic with VHDL(VHSIC Hardware Description Language) to verify performance. Synthesis tool is Synplify Pro 7.0 and target chip is Xillinx VirtexE XCV2000EFGl156. Total equivalent gate is 60,508 and maximum frequency is 30Mhz. Our scalar multiplier can be applied to digital signature, encryption and decryption, key exchange, etc. It is applied to a embedded-micom it protects SPA and provides efficient computation.

Digit-Serial Finite Field Multipliers for GF($3^m$) (GF($3^m$)의 Digit-Serial 유한체 곱셈기)

  • Chang, Nam-Su;Kim, Tae-Hyun;Kim, Chang-Han;Han, Dong-Guk;Kim, Ho-Won
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.10
    • /
    • pp.23-30
    • /
    • 2008
  • Recently, a considerable number of studies have been conducted on pairing based cryptosystems. The efficiency of pairing based cryptosystems depends on finite fields, similar to existing public key cryptosystems. In general, pairing based ctyptosystems are defined over finite fields of chracteristic three, GF($3^m$), based on trinomials. A multiplication in GF($3^m$) is the most dominant operation. This paper proposes a new most significant digit(MSD)-first digit- serial multiplier. The proposed MSD-first digit-serial multiplier has the same area complexity compared to previous multipliers, since the modular reduction step is performed in parallel. And the critical path delay is reduced from 1MUL+(log ${\lceil}n{\rceil}$+1)ADD to 1MUL+(log ${\lceil}n+1{\rceil}$)ADD. Therefore, when the digit size is not $2^k$, the time delay is reduced by one addition.

$AB^2$ Semi-systolic Multiplier ($AB^2$ 세미시스톨릭 곱셈기)

  • 이형목;김현성;전준철;유기영
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.04a
    • /
    • pp.892-894
    • /
    • 2002
  • 본 논문은 유한 체 GF(/2 sup m/)상에서 A$B^2$연산을 위해 AOP(All One Polynomial)에 기반한 새로운 MSB(Most Significant bit) 유선 알고리즘을 제시하고, 제시한 알고리즘에 기반하여 병렬 입출력 세미시스톨릭 구조를 제안한다. 제안된 구조는 표준기저(standard basis)에 기반하고 모듈라(modoular) 연산을 위해 다항식의 계수가 모두 1인 m차의 기약다항식 AOP를 사용한다. 제안된 구조에서 AND와 XOR게이트의 딜레이(deray)를 각각 /D sub AND$_2$/와/D sub XOR$_2$/라 하면 각 셀 당 임계경로는 /D sub AND$_2$+D sub XOR/이고 지연시간은 m+1이다. 제안된 구조는 기존의 구조보다 임계경로와 지연시간 면에서 보다 효율적이다. 또한 구조 자체가 정규성, 모듈성, 병렬성을 가지기 때문에 VLSI 구현에 효율적이다. 더욱이 제안된 구조는 유한 체상에서 지수 연산을 필요로 하는 Diffie-Hellman 키 교환 방식, 디지털 서명 알고리즘 및 EIGamal 암호화 방식과 같은 알고리즘을 위한 기본 구조로 사용할 수 있다. 이러한 알고리즘을 응용해서 타원 곡선(elliptic curve)에 기초한 암호화 시스템(Cryptosystem)의 구현에 사용될 수 있다.

  • PDF

An Implementation of a Convolutional Accelerator based on a GPGPU for a Deep Learning (Deep Learning을 위한 GPGPU 기반 Convolution 가속기 구현)

  • Jeon, Hee-Kyeong;Lee, Kwang-yeob;Kim, Chi-yong
    • Journal of IKEEE
    • /
    • v.20 no.3
    • /
    • pp.303-306
    • /
    • 2016
  • In this paper, we propose a method to accelerate convolutional neural network by utilizing a GPGPU. Convolutional neural network is a sort of the neural network learning features of images. Convolutional neural network is suitable for the image processing required to learn a lot of data such as images. The convolutional layer of the conventional CNN required a large number of multiplications and it is difficult to operate in the real-time on the embedded environment. In this paper, we reduce the number of multiplications through Winograd convolution operation and perform parallel processing of the convolution by utilizing SIMT-based GPGPU. The experiment was conducted using ModelSim and TestDrive, and the experimental results showed that the processing time was improved by about 17%, compared to the conventional convolution.

Low Space Complexity Bit Parallel Multiplier For Irreducible Trinomial over GF($2^n$) (삼항 기약다항식을 이용한 GF($2^n$)의 효율적인 저면적 비트-병렬 곱셈기)

  • Cho, Young-In;Chang, Nam-Su;Kim, Chang-Han;Hong, Seok-Hie
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.12
    • /
    • pp.29-40
    • /
    • 2008
  • The efficient hardware design of finite field multiplication is an very important research topic for and efficient $f(x)=x^n+x^k+1$ implementation of cryptosystem based on arithmetic in finite field GF($2^n$). We used special generating trinomial to construct a bit-parallel multiplier over finite field with low space complexity. To reduce processing time, The hardware architecture of proposed multiplier is similar with existing Mastrovito multiplier. The complexity of proposed multiplier is depend on the degree of intermediate term $x^k$ and the space complexity of the new multiplier is $2k^2-2k+1$ lower than existing multiplier's. The time complexity of the proposed multiplier is equal to that of existing multiplier or increased to $1T_X(10%{\sim}12.5%$) but space complexity is reduced to maximum 25%.

Fault Detection Architecture of the Field Multiplication Using Gaussian Normal Bases in GF(2n (가우시안 정규기저를 갖는 GF(2n)의 곱셈에 대한 오류 탐지)

  • Kim, Chang Han;Chang, Nam Su;Park, Young Ho
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.24 no.1
    • /
    • pp.41-50
    • /
    • 2014
  • In this paper, we proposed an error detection in Gaussian normal basis multiplier over $GF(2^n)$. It is shown that by using parity prediction, error detection can be very simply constructed in hardware. The hardware overheads are only one AND gate, n+1 XOR gates, and one 1-bit register in serial multipliers, and so n AND gates, 2n-1 XOR gates in parallel multipliers. This method are detect in odd number of bit fault in C = AB.

Efficient Bit-Parallel Multiplier for Binary Field Defind by Equally-Spaced Irreducible Polynomials (Equally Spaced 기약다항식 기반의 효율적인 이진체 비트-병렬 곱셈기)

  • Lee, Ok-Suk;Chang, Nam-Su;Kim, Chang-Han;Hong, Seok-Hie
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.18 no.2
    • /
    • pp.3-10
    • /
    • 2008
  • The choice of basis for representation of element in $GF(2^m)$ affects the efficiency of a multiplier. Among them, a multiplier using redundant representation efficiently supports trade-off between the area complexity and the time complexity since it can quickly carry out modular reduction. So time of a previous multiplier using redundant representation is faster than time of multiplier using others basis. But, the weakness of one has a upper space complexity compared to multiplier using others basis. In this paper, we propose a new efficient multiplier with consideration that polynomial exponentiation operations are frequently used in cryptographic hardwares. The proposed multiplier is suitable fer left-to-right exponentiation environment and provides efficiency between time and area complexity. And so, it has both time delay of $T_A+({\lceil}{\log}_2m{\rceil})T_x$ and area complexity of (2m-1)(m+s). As a result, the proposed multiplier reduces $2(ms+s^2)$ compared to the previous multiplier using equally-spaced polynomials in area complexity. In addition, it reduces $T_A+({\lceil}{\log}_2m+s{\rceil})T_x$ to $T_A+({\lceil}{\log}_2m{\rceil})T_x$ in the time complexity.($T_A$:Time delay of one AND gate, $T_x$:Time delay of one XOR gate, m:Degree of equally spaced irreducible polynomial, s:spacing factor)

Implementation of High-radix Modular Exponentiator for RSA using CRT (CRT를 이용한 하이래딕스 RSA 모듈로 멱승 처리기의 구현)

  • 이석용;김성두;정용진
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.10 no.4
    • /
    • pp.81-93
    • /
    • 2000
  • In a methodological approach to improve the processing performance of modulo exponentiation which is the primary arithmetic in RSA crypto algorithm, we present a new RSA hardware architecture based on high-radix modulo multiplication and CRT(Chinese Remainder Theorem). By implementing the modulo multiplier using radix-16 arithmetic, we reduced the number of PE(Processing Element)s by quarter comparing to the binary arithmetic scheme. This leads to having the number of clock cycles and the delay of pipelining flip-flops be reduced by quarter respectively. Because the receiver knows p and q, factors of N, it is possible to apply the CRT to the decryption process. To use CRT, we made two s/2-bit multipliers operating in parallel at decryption, which accomplished 4 times faster performance than when not using the CRT. In encryption phase, the two s/2-bit multipliers can be connected to make a s-bit linear multiplier for the s-bit arithmetic operation. We limited the encryption exponent size up to 17-bit to maintain high speed, We implemented a linear array modulo multiplier by projecting horizontally the DG of Montgomery algorithm. The H/W proposed here performs encryption with 15Mbps bit-rate and decryption with 1.22Mbps, when estimated with reference to Samsung 0.5um CMOS Standard Cell Library, which is the fastest among the publications at present.

A Design of Superscalar Digital Signal Processor (다중 명령어 처리 DSP 설계)

  • Park, Sung-Wook
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.18 no.3
    • /
    • pp.323-328
    • /
    • 2008
  • This paper presents a Digital Signal Processor achieving high through-put for both decision intensive and computation intensive tasks. The proposed processor employees a multiplier, two ALU and load/store. Unit as operational units. Those four units are controlled and works parallel by superscalar control scheme, which is different from prior DSP architecture. The performance evaluation was done by implementing AC-3 decoding algorithm and 37.8% improvement was achieved. This study is valuable especially for the consumer electronics applications, which require very low cost.

An Optimized Hardware Design for High Performance Residual Data Decoder (고성능 잔여 데이터 복호기를 위한 최적화된 하드웨어 설계)

  • Jung, Hong-Kyun;Ryoo, Kwang-Ki
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.13 no.11
    • /
    • pp.5389-5396
    • /
    • 2012
  • In this paper, an optimized residual data decoder architecture is proposed to improve the performance in H.264/AVC. The proposed architecture is an integrated architecture that combined parallel inverse transform architecture and parallel inverse quantization architecture with common operation units applied new inverse quantization equations. The equations without division operation can reduce execution time and quantity of operation for inverse quantization process. The common operation unit uses multiplier and left shifter for the equations. The inverse quantization architecture with four common operation units can reduce execution cycle of inverse quantization to one cycle. The inverse transform architecture consists of eight inverse transform operation units. Therefore, the architecture can reduce the execution cycle of inverse transform to one cycle. Because inverse quantization operation and inverse transform operation are concurrency, the execution cycle of inverse transform and inverse quantization operation for one $4{\times}4$ block is one cycle. The proposed architecture is synthesized using Magnachip 0.18um CMOS technology. The gate count and the critical path delay of the architecture are 21.9k and 5.5ns, respectively. The throughput of the architecture can achieve 2.89Gpixels/sec at the maximum clock frequency of 181MHz. As the result of measuring the performance of the proposed architecture using the extracted data from JM 9.4, the execution cycle of the proposed architecture is about 88.5% less than that of the existing designs.