• Title/Summary/Keyword: clock cycles

Search Result 148, Processing Time 0.025 seconds

Bit-Parallel Systolic Divider in Finite Field GF(2m) (유한 필드 GF(2m)상의 비트-패러럴 시스톨릭 나눗셈기)

  • 김창훈;김종진;안병규;홍춘표
    • The KIPS Transactions:PartA
    • /
    • v.11A no.2
    • /
    • pp.109-114
    • /
    • 2004
  • This paper presents a high-speed bit-parallel systolic divider for computing modular division A($\chi$)/B($\chi$) mod G($\chi$) in finite fields GF$(2^m)$. The presented divider is based on the binary GCD algorithm and verified through FPGA implementation. The proposed architecture produces division results at a rate of one every 1 clock cycles after an initial delay of 5m-2. Analysis shows that the proposed divider provides a significant reduction in both chip area and computational delay time compared to previously proposed systolic dividers with the same I/O format. In addition, since the proposed architecture does not restrict the choice of irreducible polynomials and has regularity and modularity, it provides a high flexibility and Scalability with respect to the field size m. Therefore, the proposed divider is well suited to VLSI implementation.

An Efficient Architecture of Transform & Quantization Module in MPEG-4 Video Code (MPEG-4 영상코덱에서 DCTQ module의 효율적인 구조)

  • 서기범;윤동원
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.40 no.11
    • /
    • pp.29-36
    • /
    • 2003
  • In this paper, an efficient VLSI architecture for DCTQ module, which consists of 2D-DCT, quantization, AC/DC prediction block, scan conversion, inverse quantization and 2D-IDCT, is presented. The architecture of the module is designed to handle a macroblock data within 1064 cycles and suitable for MPEG-4 video codec handling 30 frame CIF image for both encoder and decoder simultaneously. Only single 1-D DCT/IDCT cores are used for the design instead of 2-D DCT/IDCT, respectively. 1-bit serial distributed arithmetic architecture is adopted for 1-D DCT/IDCT to reduce the hardware area in this architecture. To reduce the power consumption of DCTQ modu1e, we propose the method not to operate the DCTQ modu1e exploiting the SAE(sum of absolute error) value from motion estimation and cbp(coded block pattern). To reduce the AC/DC prediction memory size, the memory architecture and memory access method for AC/DC prediction block is proposed. As the result, the maximum utilization of hardware can be achieved, and power consumption can be minimized. The proposed design is operated on 27MHz clock. The experimental results show that the accuracy of DCT and IDCT meet the IEEE specification.

Design of MSB-First Digit-Serial Multiplier for Finite Fields GF(2″) (유한 필드 $GF(2^m)$상에서의 MSB 우선 디지트 시리얼 곱셈기 설계)

  • 김창훈;한상덕;홍춘표
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.6C
    • /
    • pp.625-631
    • /
    • 2002
  • This paper presents a MSB-first digit-serial systolic array for computing modular multiplication of A(x)B(x) mod G(x) in finite fields $GF(2^m)$. From the MSB-first multiplication algorithm in $GF(2^m)$, we obtain a new data dependence graph and design an efficient digit-serial systolic multiplier. For circuit synthesis, we obtain VHDL code for multiplier, If input data come in continuously, the implemented multiplier can produce multiplication results at a rate of one every [m/L] clock cycles, where L is the selected digit size. The analysis results show that the proposed architecture leads to a reduction of computational delay time and it has much more simple structure than existing digit-serial systolic multiplier. Furthermore, since the propose architecture has the features of unidirectional data flow and regularity, it shows good extension characteristics with respect to m and L.

VLIS Design of OCB-AES Cryptographic Processor (OCB-AES 암호 프로세서의 VLSI 설계)

  • Choi Byeong-Yoon;Lee Jong-Hyoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.9 no.8
    • /
    • pp.1741-1748
    • /
    • 2005
  • In this paper, we describe VLSI design and performance evaluation of OCB-AES crytographic algorithm that simulataneously provides privacy and authenticity. The OCB-AES crytographic algorithm sovles the problems such as long operation time and large hardware of conventional crytographic system, because the conventional system must implement the privancy and authenticity sequentially with seqarated algorithms and hardware. The OCB-AES processor with area-efficient modular offset generator and tag generator is designed using IDEC Samsung 0.35um standard cell library and consists of about 55,700 gates. Its cipher rate is about 930Mbps and the number of clock cycles needed to generate the 128-bit tags for authenticity and integrity is (m+2)${\times}$(Nr+1), where m and Nr represent the number of block for message and number of rounds for AES encryption, respectively. The OCB-AES processor can be applicable to soft cryptographic IP of IEEE 802.11i wireless LAN and Mobile SoC.

Implementation of MPEG-4 BSAC Audio Decoder using ARM926EJ-S Processors (ARM926EJ-S 프로세서를 이용한 MPEG-4 BSAC 오디오 복호화기의 구현)

  • Jeon, Young-Taek;Park, Young-Cheol
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.1 no.2
    • /
    • pp.91-98
    • /
    • 2008
  • Domestic standard for Korean T-DMB includes MPEG-4 BSAC (Bit Sliced Arithmetic Coding) audio coding that has been established in 2003. This paper presents an implementation and optimization of MPEG-4 BSAC Audio Decoder on ARM926EJ-S processor. Tools and modules of the BSAC audio decoder were implemented with 32-bit fixed point operations. Further optimization was accomplished using ARM926EJ-S Inline Assembly. The optimization was based on the total number of multiplications and MAC (Multiply and Accumulation) operations causing most of core cycles of ARM926EJ-S, and also based on analysis of ARMv5 instructions. The result of optimization was evaluated on the basis of MIPS (Million Instruction per second). Implementation results show that BSAC bitstream at 96kbps can be decoded in real-time at 65MHz CPU clocks.

  • PDF

Efficient Radix-4 Systolic VLSI Architecture for RSA Public-key Cryptosystem (RSA 공개키 암호화시스템의 효율적인 Radix-4 시스톨릭 VLSI 구조)

  • Park Tae geun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.29 no.12C
    • /
    • pp.1739-1747
    • /
    • 2004
  • In this paper, an efficient radix-4 systolic VLSI architecture for RSA public-key cryptosystem is proposed. Due to the simple operation of iterations and the efficient systolic mapping, the proposed architecture computes an n-bit modular exponentiation in n$^{2}$ clock cycles since two modular multiplications for M$_{i}$ and P$_{i}$ in each exponentiation process are interleaved, so that the hardware is fully utilized. We encode the exponent using Radix-4. SD (Signed Digit) number system to reduce the number of modular multiplications for RSA cryptography. Therefore about 20% of NZ (non-zero) digits in the exponent are reduced. Compared to conventional approaches, the proposed architecture shows shorter period to complete the RSA while requiring relatively less hardware resources. The proposed RSA architecture based on the modified Montgomery algorithm has locality, regularity, and scalability suitable for VLSI implementation.

Implementation of RSA Exponentiator Based on Radix-$2^k$ Modular Multiplication Algorithm (Radix-$2^k$ 모듈라 곱셈 알고리즘 기반의 RSA 지수승 연산기 설계)

  • 권택원;최준림
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.12 no.2
    • /
    • pp.35-44
    • /
    • 2002
  • In this paper, an implementation method of RSA exponentiator based on Radix-$2^k$ modular multiplication algorithm is presented and verified. We use Booth receding algorithm to implement Radix-$2^k$ modular multiplication and implement radix-16 modular multiplier using 2K-byte memory and CSA(carry-save adder) array - with two full adder and three half adder delays. For high speed final addition we use a reduced carry generation and propagation scheme called pseudo carry look-ahead adder. Furthermore, the optimum value of the radix is presented through the trade-off between the operating frequency and the throughput for given Silicon technology. We have verified 1,024-bit RSA processor using Altera FPGA EP2K1500E device and Samsung 0.3$\mu\textrm{m}$ technology. In case of the radix-16 modular multiplication algorithm, (n+4+1)/4 clock cycles are needed and the 1,024-bit modular exponentiation is performed in 5.38ms at 50MHz.

Design of Serial-Parallel Multiplier for GF($2^n$) (GF($2^n$)에서의 직렬-병렬 곱셈기 구조)

  • 정석원;윤중철;이선옥
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.13 no.3
    • /
    • pp.27-34
    • /
    • 2003
  • Recently, an efficient hardware development for a cryptosystem is concerned. The efficiency of a multiplier for GF($2^n$)is directly related to the efficiency of some cryptosystem. This paper, considering the trade-off between time complexity andsize complexity, proposes a new multiplier architecture having n[n/2] AND gates and n([n/2]+1)- $$\Delta$_n$ = XOR gates, where $$\Delta$_n$=1 if n is even, $$\Delta$_n$=0 otherwise. This size complexity is less than that of existing ${multipliers}^{[5][12]}$which are $n^2$ AND gates and $n^2$-1 XOR gates. While a new multiplier is a serial-parallel multiplier to output a result of multiplication of two elements of GF($2^n$) after 2 clock cycles, the suggested multiplier is more suitable for some cryptographic device having space limitations.

An Efficient Matrix Multiplier Available in Multi-Head Attention and Feed-Forward Network of Transformer Algorithms (트랜스포머 알고리즘의 멀티 헤드 어텐션과 피드포워드 네트워크에서 활용 가능한 효율적인 행렬 곱셈기)

  • Seok-Woo Chang;Dong-Sun Kim
    • Journal of IKEEE
    • /
    • v.28 no.1
    • /
    • pp.53-64
    • /
    • 2024
  • With the advancement of NLP(Natural Language Processing) models, conversational AI such as ChatGPT is becoming increasingly popular. To enhance processing speed and reduce power consumption, it is important to implement the Transformer algorithm, which forms the basis of the latest natural language processing models, in hardware. In particular, the multi-head attention and feed-forward network, which analyze the relationships between different words in a sentence through matrix multiplication, are the most computationally intensive core algorithms in the Transformer. In this paper, we propose a new variable systolic array based on the number of input words to enhance matrix multiplication speed. Quantization maintains Transformer accuracy, boosting memory efficiency and speed. For evaluation purposes, this paper verifies the clock cycles required in multi-head attention and feed-forward network and compares the performance with other multipliers.

Hardware-Based High Performance XML Parsing Technique Using an FPGA (FPGA를 이용한 하드웨어 기반 고성능 XML 파싱 기법)

  • Lee, Kyu-hee;Seo, Byeong-seok
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.40 no.12
    • /
    • pp.2469-2475
    • /
    • 2015
  • A structured XML has been widely used to present services on various Web-services. The XML is also used for digital documents and digital signatures and for the representation of multimedia files in email systems. The XML document should be firstly parsed to access elements in the XML. The parsing is the most compute-instensive task in the use of XML documents. Most of the previous work has focused on hardware based XML parsers in order to improve parsing performance, while a little work has studied parsing techniques. We present the high performance parsing technique which can be used all of XML parsers and design hardware based XML parser using an FPGA. The proposed parsing technique uses element analyzers instead of the state machine and performs multibyte-based element matching. As a result, our parsing technique can reduce the number of clock cycles per byte(CPB) and does not need to require any preprocessing, such as loading XML data into memory. Compared to other parsers, our parser acheives 1.33~1.82 times improvement in the system performance. Therefore, the proposed parsing technique can process XML documents in real time and is suitable for applying to all of XML parsers.