• Title/Summary/Keyword: parallel multiplier

Search Result 158, Processing Time 0.025 seconds

An Optimized Hardware Design for High Performance Residual Data Decoder (고성능 잔여 데이터 복호기를 위한 최적화된 하드웨어 설계)

  • Jung, Hong-Kyun;Ryoo, Kwang-Ki
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.13 no.11
    • /
    • pp.5389-5396
    • /
    • 2012
  • In this paper, an optimized residual data decoder architecture is proposed to improve the performance in H.264/AVC. The proposed architecture is an integrated architecture that combined parallel inverse transform architecture and parallel inverse quantization architecture with common operation units applied new inverse quantization equations. The equations without division operation can reduce execution time and quantity of operation for inverse quantization process. The common operation unit uses multiplier and left shifter for the equations. The inverse quantization architecture with four common operation units can reduce execution cycle of inverse quantization to one cycle. The inverse transform architecture consists of eight inverse transform operation units. Therefore, the architecture can reduce the execution cycle of inverse transform to one cycle. Because inverse quantization operation and inverse transform operation are concurrency, the execution cycle of inverse transform and inverse quantization operation for one $4{\times}4$ block is one cycle. The proposed architecture is synthesized using Magnachip 0.18um CMOS technology. The gate count and the critical path delay of the architecture are 21.9k and 5.5ns, respectively. The throughput of the architecture can achieve 2.89Gpixels/sec at the maximum clock frequency of 181MHz. As the result of measuring the performance of the proposed architecture using the extracted data from JM 9.4, the execution cycle of the proposed architecture is about 88.5% less than that of the existing designs.

Design of Partial Product Accumulator using Multi-Operand Decimal CSA and Improved Decimal CLA (다중 피연산자 십진 CSA와 개선된 십진 CLA를 이용한 부분곱 누산기 설계)

  • Lee, Yang;Park, TaeShin;Kim, Kanghee;Choi, SangBang
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.11
    • /
    • pp.56-65
    • /
    • 2016
  • In this paper, in order to reduce the delay and area of the partial product accumulation (PPA) of the parallel decimal multiplier, a tree architecture that composed by multi-operand decimal CSAs and improved CLA is proposed. The proposed tree using multi-operand CSAs reduces the partial product quickly. Since the input range of the recoder of CSA is limited, CSA can get the simplest logic. In addition, using the multi-operand decimal CSAs to add decimal numbers that have limited range in specific locations of the specific architecture can reduce the partial products efficiently. Also, final BCD result can be received faster by improving the logic of the decimal CLA. In order to evaluate the performance of the proposed partial product accumulation, synthesis is implemented by using Design Complier with 180 nm COMS technology library. Synthesis results show the delay of the proposed partial product accumulation is reduced by 15.6% and area is reduced by 16.2% comparing with which uses general method. Also, the total delay and area are still reduced despite the delay and area of the CLA are increased.

Implementation of Hardware Data Prefetcher Adaptable for Various State-of-the-Art Workload (다양한 최신 워크로드에 적용 가능한 하드웨어 데이터 프리페처 구현)

  • Kim, KangHee;Park, TaeShin;Song, KyungHwan;Yoon, DongSung;Choi, SangBang
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.12
    • /
    • pp.20-35
    • /
    • 2016
  • In this paper, in order to reduce the delay and area of the partial product accumulation (PPA) of the parallel decimal multiplier, a tree architecture that composed by multi-operand decimal CSAs and improved CLA is proposed. The proposed tree using multi-operand CSAs reduces the partial product quickly. Since the input range of the recoder of CSA is limited, CSA can get the simplest logic. In addition, using the multi-operand decimal CSAs to add decimal numbers that have limited range in specific locations of the specific architecture can reduce the partial products efficiently. Also, final BCD result can be received faster by improving the logic of the decimal CLA. In order to evaluate the performance of the proposed partial product accumulation, synthesis is implemented by using Design Complier with 180 nm COMS technology library. Synthesis results show the delay of the proposed partial product accumulation is reduced by 15.6% and area is reduced by 16.2% comparing with which uses general method. Also, the total delay and area are still reduced despite the delay and area of the CLA are increased.

90/150 RCA Corresponding to Maximum Weight Polynomial with degree 2n (2n 차 최대무게 다항식에 대응하는 90/150 RCA)

  • Choi, Un-Sook;Cho, Sung-Jin
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.13 no.4
    • /
    • pp.819-826
    • /
    • 2018
  • The generalized Hamming weight is one of the important parameters of the linear code. It determines the performance of the code when the linear codes are applied to a cryptographic system. In addition, when the block code is decoded by soft decision using the lattice diagram, it becomes a measure for evaluating the state complexity required for the implementation. In particular, a bit-parallel multiplier on finite fields based on trinomials have been studied. Cellular automata(CA) has superior randomness over LFSR due to its ability to update its state simultaneously by local interaction. In this paper, we deal with the efficient synthesis of the pseudo random number generator, which is one of the important factors in the design of effective cryptosystem. We analyze the property of the characteristic polynomial of the simple 90/150 transition rule block, and propose a synthesis algorithm of the reversible 90/150 CA corresponding to the trinomials $x^2^n+x^{2^n-1}+1$($n{\geq}2$) and the 90/150 reversible CA(RCA) corresponding to the maximum weight polynomial with $2^n$ degree by using this rule block.

Design and Analysis of a Digit-Serial $AB^{2}$ Systolic Arrays in $GF(2^{m})$ ($GF(2^{m})$ 상에서 새로운 디지트 시리얼 $AB^{2}$ 시스톨릭 어레이 설계 및 분석)

  • Kim Nam-Yeun;Yoo Kee-Young
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.4
    • /
    • pp.160-167
    • /
    • 2005
  • Among finite filed arithmetic operations, division/inverse is known as a basic operation for public-key cryptosystems over $GF(2^{m})$ and it is computed by performing the repetitive $AB^{2}$ multiplication. This paper presents a digit-serial-in-serial-out systolic architecture for performing the $AB^2$ operation in GF$(2^{m})$. To obtain L×L digit-serial-in-serial-out architecture, new $AB^{2}$ algorithm is proposed and partitioning, index transformation and merging the cell of the architecture, which is derived from the algorithm, are proposed. Based on the area-time product, when the digit-size of digit-serial architecture, L, is selected to be less than about m, the proposed digit-serial architecture is efficient than bit-parallel architecture, and L is selected to be less than about $(1/5)log_{2}(m+1)$, the proposed is efficient than bit-serial. In addition, the area-time product complexity of pipelined digit-serial $AB^{2}$ systolic architecture is approximately $10.9\%$ lower than that of nonpipelined one, when it is assumed that m=160 and L=8. Additionally, since the proposed architecture can be utilized for the basic architecture of crypto-processor and it is well suited to VLSI implementation because of its simplicity, regularity and pipelinability.

Implementation of RSA modular exponentiator using Division Chain (나눗셈 체인을 이용한 RSA 모듈로 멱승기의 구현)

  • 김성두;정용진
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.12 no.2
    • /
    • pp.21-34
    • /
    • 2002
  • In this paper we propos a new hardware architecture of modular exponentiation using a division chain method which has been proposed in (2). Modular exponentiation using the division chain is performed by receding an exponent E as a mixed form of multiplication and addition with divisors d=2 or $d=2^I +1$ and respective remainders r. This calculates the modular exponentiation in about $1.4log_2$E multiplications on average which is much less iterations than $2log_2$E of conventional Binary Method. We designed a linear systolic array multiplier with pipelining and used a horizontal projection on its data dependence graph. So, for k-bit key, two k-bit data frames can be inputted simultaneously and two modular multipliers, each consisting of k/2+3 PE(Processing Element)s, can operate in parallel to accomplish 100% throughput. We propose a new encoding scheme to represent divisors and remainders of the division chain to keep regularity of the data path. When it is synthesized to ASIC using Samsung 0.5 um CMOS standard cell library, the critical path delay is 4.24ns, and resulting performance is estimated to be abort 140 Kbps for a 1024-bit data frame at 200Mhz clock In decryption process, the speed can be enhanced to 560kbps by using CRT(Chinese Remainder Theorem). Futhermore, to satisfy real time requirements we can choose small public exponent E, such as 3,17 or $2^{16} +1$, in encryption and verification process. in which case the performance can reach 7.3Mbps.

Further Improvement of Direct Solution-based FETI Algorithm (직접해법 기반의 FETI 알고리즘의 개선)

  • Kang, Seung-Hoon;Gong, DuHyun;Shin, SangJoon
    • Journal of the Computational Structural Engineering Institute of Korea
    • /
    • v.35 no.5
    • /
    • pp.249-257
    • /
    • 2022
  • This paper presents an improved computational framework for the direct-solution-based finite element tearing and interconnecting (FETI) algorithm. The FETI-local algorithm is further improved herein, and localized Lagrange multipliers are used to define the interface among its subdomains. Selective inverse entry computation, using a property of the Boolean matrix, is employed for the computation of the subdomain interface stiffness and load, in which the original FETI-local algorithm requires a full matrix inverse computation of a high computational cost. In the global interface computation step, the original serial computation is replaced by a parallel multi-frontal method. The performance of the improved FETI-local algorithm was evaluated using a numerical example with 64 million degrees of freedom (DOFs). The computational time was reduced by up to 97.8% compared to that of the original algorithm. In addition, further stable and improved scalability was obtained in terms of a speed-up indicator. Furthermore, a performance comparison was conducted to evaluate the differences between the proposed algorithm and commercial software ANSYS using a large-scale computation with 432 million DOFs. Although ANSYS is superior in terms of computational time, the proposed algorithm has an advantage in terms of the speed-up increase per processor increase.

Development of a Small Gamma Camera Using NaI(T1)-Position Sensitive Photomultiplier Tube for Breast Imaging (NaI (T1) 섬광결정과 위치민감형 광전자증배관을 이용한 유방암 진단용 소형 감마카메라 개발)

  • Kim, Jong-Ho;Choi, Yong;Kwon, Hong-Seong;Kim, Hee-Joung;Kim, Sang-Eun;Choe, Yearn-Seong;Lee, Kyung-Han;Kim, Moon-Hae;Joo, Koan-Sik;Kim, Byuug-Tae
    • The Korean Journal of Nuclear Medicine
    • /
    • v.32 no.4
    • /
    • pp.365-373
    • /
    • 1998
  • Purpose: The conventional gamma camera is not ideal for scintimammography because of its large detector size (${\sim}500mm$ in width) causing high cost and low image quality. We are developing a small gamma camera dedicated for breast imaging. Materials and Methods: The small gamma camera system consists of a NaI (T1) crystal ($60 mm{\times}60 mm{\times}6 mm$) coupled with a Hamamatsu R3941 Position Sensitive Photomultiplier Tube (PSPMT), a resister chain circuit, preamplifiers, nuclear instrument modules, an analog to digital converter and a personal computer for control and display. The PSPMT was read out using a standard resistive charge division which multiplexes the 34 cross wire anode channels into 4 signals ($X^+,\;X^-,\;Y^+,\;Y^-$). Those signals were individually amplified by four preamplifiers and then, shaped and amplified by amplifiers. The signals were discriminated ana digitized via triggering signal and used to localize the position of an event by applying the Anger logic. Results: The intrinsic sensitivity of the system was approximately 8,000 counts/sec/${\mu}Ci$. High quality flood and hole mask images were obtained. Breast phantom containing $2{\sim}7 mm$ diameter spheres was successfully imaged with a parallel hole collimator The image displayed accurate size and activity distribution over the imaging field of view Conclusion: We have succesfully developed a small gamma camera using NaI(T1)-PSPMT and nuclear Instrument modules. The small gamma camera developed in this study might improve the diagnostic accuracy of scintimammography by optimally imaging the breast.

  • PDF