• Title/Summary/Keyword: radix-4 algorithm

Search Result 89, Processing Time 0.025 seconds

Design and Implementation of Multi-mode Sensor Signal Processor on FPGA Device (다중모드 센서 신호 처리 프로세서의 FPGA 기반 설계 및 구현)

  • Soongyu Kang;Yunho Jung
    • Journal of Sensor Science and Technology
    • /
    • v.32 no.4
    • /
    • pp.246-251
    • /
    • 2023
  • Internet of Things (IoT) systems process signals from various sensors using signal processing algorithms suitable for the signal characteristics. To analyze complex signals, these systems usually use signal processing algorithms in the frequency domain, such as fast Fourier transform (FFT), filtering, and short-time Fourier transform (STFT). In this study, we propose a multi-mode sensor signal processor (SSP) accelerator with an FFT-based hardware design. The FFT processor in the proposed SSP is designed with a radix-2 single-path delay feedback (R2SDF) pipeline architecture for high-speed operation. Moreover, based on this FFT processor, the proposed SSP can perform filtering and STFT operation. The proposed SSP is implemented on a field-programmable gate array (FPGA). By sharing the FFT processor for each algorithm, the required hardware resources are significantly reduced. The proposed SSP is implemented and verified on Xilinxh's Zynq Ultrascale+ MPSoC ZCU104 with 53,591 look-up tables (LUTs), 71,451 flip-flops (FFs), and 44 digital signal processors (DSPs). The FFT, filtering, and STFT algorithm implementations on the proposed SSP achieve 185x average acceleration.

New VLSI Architecture of Parallel Multiplier-Accumulator Based on Radix-2 Modified Booth Algorithm (Radix-2 MBA 기반 병렬 MAC의 VLSI 구조)

  • Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.4
    • /
    • pp.94-104
    • /
    • 2008
  • In this paper, we propose a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator which has the largest delay in MAC was removed and its function was included into CSA, the overall performance becomes to be elevated. The proposed CSA tree uses 1's complement-based radix-2 modified booth algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of operands. The CSA propagates the carries by the least significant bits of the partial products and generates the least significant bits in advance for decreasing the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits not the output of the final adder for improving the performance by optimizing the efficiency of pipeline scheme. The proposed architecture was synthesized with $250{\mu}m,\;180{\mu}m,\;130{\mu}m$ and 90nm standard CMOS library after designing it. We analyzed the results such as hardware resource, delay, and pipeline which are based on the theoretical and experimental estimation. We used Sakurai's alpha power low for the delay modeling. The proposed MAC has the superior properties to the standard design in many ways and its performance is twice as much than the previous research in the similar clock frequency.

17$\times$17-b Multiplier for 32-bit RISC/DSP Processors (32 비트 RISC/DSP 프로세서를 위한 17 비트 $\times$ 17 비트 곱셈기의 설계)

  • 박종환;문상국;홍종욱;문병인;이용석
    • Proceedings of the IEEK Conference
    • /
    • 1999.06a
    • /
    • pp.914-917
    • /
    • 1999
  • The paper describes a 17 $\times$ 17-b multiplier using the Radix-4 Booth’s algorithm. which is suitable for 32-bit RISC/DSP microprocessors. To minimize design area and achieve improved speed, a 2-stage pipeline structure is adopted to achieve high clock frequency. Each part of circuit is modeled and optimized at the transistor level, verification of functionality and timing is performed using HSPICE simulations. After modeling and validating the circuit at transistor level, we lay it out in a 0.35 ${\mu}{\textrm}{m}$ 1-poly 4-metal CMOS technology and perform LVS test to compare the layout with the schematic. The simulation results show that maximum frequency is 330MHz under worst operating conditions at 55$^{\circ}C$ , 3V, The post simulation after layout results shows 187MHz under worst case conditions. It contains 9, 115 transistors and the area of layout is 0.72mm by 0.97mm.

  • PDF

An area-efficient 256-point FFT design for WiMAX systems

  • Yu, Jian;Cho, Kyung-Ju
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.3
    • /
    • pp.270-276
    • /
    • 2018
  • This paper presents a low area 256-point pipelined FFT architecture, especially for IEEE 802.16a WiMAX systems. Radix-24 algorithm and single-path delay feedback (SDF) architecture are adopted in the design to reduce the complexity of twiddle factor multiplication. A new cascade canonical signed digit (CSD) complex multipliers are proposed for twiddle factor multiplication, which has lower area and less power consumption than conventional complex multipliers composed of 4 multipliers and 2 adders. Also, the proposed cascade CSD multipliers can remove look-up table for storing coefficient of twiddle factors. In hardware implementation with Cyclone 10LP FPGA, it is shown that the proposed FFT design method achieves about 62% reduction in gate count and 64% memory reduction compared with the previous schemes.

Design of Low-complexity FFT Processor for Narrow-band Interference Signal Cancellation Based Array Antenna (배열 안테나 기반 협대역 간섭신호 제거를 위한 저면적 FFT 프로세서 설계 연구)

  • Yang, Gi-jung;Won, Hyun-Hee;Park, Sungyeol;Ahn, Byoung-Sun;Kang, Haeng-Ik
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.621-622
    • /
    • 2017
  • In this paper, a low-complexity FFT processor is proposed for narrow-band interference signal cancellation based array antenna. The proposed FFT pocessor can support the variable length of 64, 128 and 512. By reducing number of non-tirval multipliers with mixed radix-4/2/4/2/4/2 algorithm and flexible multi-path delay commutator(MDC) architecture, the complexity of the proposed FFT processor is dramatically decreased. The proposed FFT processor was designed in Xilinx system generator and Implemented with Xilinx Virtex-7 FPGA. With the proposed architecture, the number of slices for the processor is 11454, and the number of DSP48s is 194.

  • PDF

A Efficient Architecture of MBA-based Parallel MAC for High-Speed Digital Signal Processing (고속 디지털 신호처리를 위한 MBA기반 병렬 MAC의 효율적인 구조)

  • 서영호;김동욱
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.7
    • /
    • pp.53-61
    • /
    • 2004
  • In this paper, we proposed a new architecture of MAC(Multiplier-Accumulator) to operate high-speed multiplication-accumulation. We used the MBA(Modified radix-4 Booth Algorithm) which is based on the 1's complement number system, and CSA(Carry Save Adder) for addition of the partial products. During the addition of the partial product, the signed numbers with the 1's complement type after Booth encoding are converted in the 2's complement signed number in the CSA tree. Since 2-bit CLA(Carry Look-ahead Adder) was used in adding the lower bits of the partial product, the input bit width of the final adder and whole delay of the critical path were reduced. The proposed MAC was applied into the DWT(Discrete Wavelet Transform) filtering operation for JPEG2000, and it showed the possibility for the practical application. Finally we identified the improved performance according to the comparison with the previous architecture in the aspect of hardware resource and delay.

Efficient programmable power-of-two scaler for the three-moduli set {2n+p, 2n - 1, 2n+1 - 1}

  • Taheri, MohammadReza;Navi, Keivan;Molahosseini, Amir Sabbagh
    • ETRI Journal
    • /
    • v.42 no.4
    • /
    • pp.596-607
    • /
    • 2020
  • Scaling is an important operation because of the iterative nature of arithmetic processes in digital signal processors (DSPs). In residue number system (RNS)-based DSPs, scaling represents a performance bottleneck based on the complexity of intermodulo operations. To design an efficient RNS scaler for special moduli sets, a body of literature has been dedicated to the study of the well-known moduli sets {2n - 1, 2n, 2n + 1} and {2n, 2n - 1, 2n+1 - 1}, and their extension in vertical or horizontal forms. In this study, we propose an efficient programmable RNS scaler for the arithmetic-friendly moduli set {2n+p, 2n - 1, 2n+1 - 1}. The proposed algorithm yields high speed and energy-efficient realization of an RNS programmable scaler based on the effective exploitation of the mixed-radix representation, parallelism, and a hardware sharing technique. Experimental results obtained for a 130 nm CMOS ASIC technology demonstrate the superiority of the proposed programmable scaler compared to the only available and highly effective hybrid programmable scaler for an identical moduli set. The proposed scaler provides 43.28% less power consumption, 33.27% faster execution, and 28.55% more area saving on average compared to the hybrid programmable scaler.

Design and Hardware Implementation of High-Speed Variable-Length RSA Cryptosystem (가변길이 고속 RSA 암호시스템의 설계 및 하드웨어 구현)

  • 박진영;서영호;김동욱
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.9C
    • /
    • pp.861-870
    • /
    • 2002
  • In this paper, with targeting on the drawback of RSA of operation speed, a new 1024-bit RSA cryptosystem has been proposed and implemented in hardware to increase the operational speed and perform the variable-length encryption. The proposed cryptosystem mainly consists of the modular exponentiation part and the modular multiplication part. For the modular exponentiation, the RL-binary method, which performs squaring and modular multiplying in parallel, was improved, and then applied. And 4-stage CSA structure and radix-4 booth algorithm were applied to enhance the variable-length operation and reduce the number of partial product in modular multiplication arithmetic. The proposed RSA cryptosystem which can calculate at most 1024 bits at a tittle was mapped into the integrated circuit using the Hynix Phantom Cell Library for Hynix 0.35㎛ 2-Poly 4-Metal CMOS process. Also, the result of software implementation, which had been programmed prior to the hardware research, has been used to verify the operation of the hardware system. The size of the result from the hardware implementation was about 190k gate count and the operational clock frequency was 150㎒. By considering a variable-length of modulus number, the baud rate of the proposed scheme is one and half times faster than the previous works. Therefore, the proposed high speed variable-length RSA cryptosystem should be able to be used in various information security system which requires high speed operation.

Implementation of RSA modular exponentiator using Division Chain (나눗셈 체인을 이용한 RSA 모듈로 멱승기의 구현)

  • 김성두;정용진
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.12 no.2
    • /
    • pp.21-34
    • /
    • 2002
  • In this paper we propos a new hardware architecture of modular exponentiation using a division chain method which has been proposed in (2). Modular exponentiation using the division chain is performed by receding an exponent E as a mixed form of multiplication and addition with divisors d=2 or $d=2^I +1$ and respective remainders r. This calculates the modular exponentiation in about $1.4log_2$E multiplications on average which is much less iterations than $2log_2$E of conventional Binary Method. We designed a linear systolic array multiplier with pipelining and used a horizontal projection on its data dependence graph. So, for k-bit key, two k-bit data frames can be inputted simultaneously and two modular multipliers, each consisting of k/2+3 PE(Processing Element)s, can operate in parallel to accomplish 100% throughput. We propose a new encoding scheme to represent divisors and remainders of the division chain to keep regularity of the data path. When it is synthesized to ASIC using Samsung 0.5 um CMOS standard cell library, the critical path delay is 4.24ns, and resulting performance is estimated to be abort 140 Kbps for a 1024-bit data frame at 200Mhz clock In decryption process, the speed can be enhanced to 560kbps by using CRT(Chinese Remainder Theorem). Futhermore, to satisfy real time requirements we can choose small public exponent E, such as 3,17 or $2^{16} +1$, in encryption and verification process. in which case the performance can reach 7.3Mbps.