• Title/Summary/Keyword: 8-bit parallel processing

Search Result 45, Processing Time 0.025 seconds

Optimized implementation of block cipher PIPO in parallel-way on 64-bit ARM Processors (64-bit ARM 프로세서 상에서의 블록암호 PIPO 병렬 최적 구현)

  • Eum, Si-Woo;Kwon, Hyeok-Dong;Kim, Hyun-Jun;Jang, Kyung-Bae;Kim, Hyun-Ji;Park, Jae-Hoon;Sim, Min-Joo;Song, Gyeong-Ju;Seo, Hwa-Jeong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.05a
    • /
    • pp.163-166
    • /
    • 2021
  • ICISC'20에서 발표된 경량 블록암호 PIPO는 비트 슬라이스 기법 적용으로 효율적인 구현이 되었으며, 부채널 내성을 지니기에 안전하지 않은 환경에서도 안정적으로 사용 가능한 경량 블록암호이다. 본 논문에서는 ARM 프로세서를 대상으로 PIPO의 병렬 최적 구현을 제안한다. 제안하는 구현물은 8평문, 16평문의 병렬 암호화가 가능하다. 구현에는 최적의 명령어 활용, 레지스터 내부 정렬, 로테이션 연산 최적화 기법을 사용하였다. 구현은 A10x fusion 프로세서를 대상으로 한다. 대상 프로세서상에서, 기존 레퍼런스 PIPO 코드는 64/128, 64/256 규격에서 각각 34.6 cpb, 44.7 cpb의 성능을 가지나, 제안하는 기법은 8평문 64/128, 64/256 규격에서 각각 12.0 cpb, 15.6 cpb, 16평문 64/128, 64/256 규격에서 각각 6.3 cpb, 8.1 cpb의 성능을 보여준다. 이는 기존 대비 각 규격별로 8평문 병렬 구현물은 약 65.3%, 66.4%, 16평문 병렬 구현물은 약 81.8%, 82.1% 더 좋은 성능을 보인다.

A New Complex-Number Multiplication Algorithm using Radix-4 Booth Recoding and RB Arithmetic, and a 10-bit CMAC Core Design (Radix-4 Booth Recoding과 RB 연산을 이용한 새로운 복소수 승산 알고리듬 및 10-bit CMAC코어 설계)

  • 김호하;신경욱
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.9
    • /
    • pp.11-20
    • /
    • 1998
  • High-speed complex-number arithmetic units are essential to baseband signal processing of modern digital communication systems such as channel equalization, timing recovery, modulation and demodulation. In this paper, a new complex-number multiplication algorithm is proposed, which is based on redundant binary (RB) arithmetic combined with radix-4 Booth recoding scheme. The proposed algorithm reduces the number of partial product by one-half as compared with the conventional direct method using real-number multipliers and adders. It also leads to a highly parallel architecture and simplified circuit, resulting in high-speed operation and low power dissipation. To demonstrate the proposed algorithm, a prototype complex-number multiplier-accumulator (CMAC) core with 10-bit operands has been designed using 0.8-$\mu\textrm{m}$ N-Well CMOS technology. The designed CMAC core contains about 18,000 transistors on the area of about 1.60 ${\times}$ 1.93 $\textrm{mm}^2$. The functional and speed test results show that it can operate with 120-MHz clock at V$\sub$DD/=3.3-V, and its power consumption is given to about 63-mW.

  • PDF

A Study of Integral Image Hardware Design for Memory Size Efficiency (메모리 크기에 효율적인 적분영상 하드웨어 설계 연구)

  • Lee, Su-Hyun;Jeong, Yong-Jin
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.9
    • /
    • pp.75-81
    • /
    • 2014
  • The integral image is the sum of input image pixel values. It is mainly used to speed up processing of a box filter operation, such as Haar-like features. However, large memory for integral image data can be an obstacle on an embedded hardware environment with limited memory resources. Therefore, an efficient method to store the integral image is necessary. In this paper, we propose a memory size reduction hardware design for integral image. The hardware design is used two methods. It is the new integral image memory and modulo calculation for reducing integral image data. The new integral image memory has additional calculation overhead, but it is not obstacle in hardware environment that parallel processing is possible. In the Xilinx Virtex5-LX330T targeted experimental result, integral image memory can be reduced by 50% on a $640{\times}480$ 8-bit gray-scale input image.

A practial design of direct digital frequency synthesizer with multi-ROM configuration (병렬 구조의 직접 디지털 주파수 합성기의 설계)

  • 이종선;김대용;유영갑
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.21 no.12
    • /
    • pp.3235-3245
    • /
    • 1996
  • A DDFS(Direct Digital Frequency Synthesizer) used in spread spectrum communication systems must need fast switching speed, high resolution(the step size of the synthesizer), small size and low power. The chip has been designed with four parallel sine look-up table to achieve four times throughput of a single DDFS. To achieve a high processing speed DDFS chip, a 24-bit pipelined CMOS technique has been applied to the phase accumulator design. To reduce the size of the ROM, each sine ROM of the DDFS is stored 0-.pi./2 sine wave data by taking advantage of the fact that only one quadrant of the sine needs to be stored, since the sine the sine has symmetric property. And the 8 bit of phase accumulator's output are used as ROM addresses, and the 2 MSBs control the quadrants to synthesis the sine wave. To compensate the spectrum purity ty phase truncation, the DDFS use a noise shaper that structure like a phase accumlator. The system input clock is divided clock, 1/2*clock, and 1/4*clock. and the system use a low frequency(1/4*clock) except MUX block, so reduce the power consumption. A 107MHz DDFS(Direct Digital Frequency Synthesizer) implemented using 0.8.mu.m CMOS gate array technologies is presented. The synthesizer covers a bandwidth from DC to 26.5MHz in steps of 1.48Hz with a switching speed of 0.5.mu.s and a turing latency of 55 clock cycles. The DDFS synthesizes 10 bit sine waveforms with a spectral purity of -65dBc. Power consumption is 276.5mW at 40MHz and 5V.

  • PDF

A Study on the Design of a RISC core with DSP Support (DSP기능을 강화한 RISC 프로세서 core의 ASIC 설계 연구)

  • 김문경;정우경;이용석;이광엽
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.26 no.11C
    • /
    • pp.148-156
    • /
    • 2001
  • This paper proposed embedded application-specific microprocessor(YS-RDSP) whose structure has an additional DSP processor on chip. The YS-RDSP can execute maximum four instructions in parallel. To make program size shorter, 16-bit and 32-bit instruction lengths are supported in YS-RDSP. The YS-RDSP provides programmability. controllability, DSP processing ability, and includes eight-kilobyte on-chip ROM and eight-kilobyte RAM. System controller on the chip gives three power-down modes for low-power operation, and SLEEP instruction changes operation statue of CPU core and peripherals. YS-RDSP processor was implemented with Verilog HDL on top-down methodology, and it was improved and verified by cycle-based simulator written in C-language. The verified model was synthesized with 0.7um, 3.3V CMOS standard cell library, and the layout size was 10.7mm78.4mm which was implemented by using automatic P&R software.

  • PDF

Design and implementation of the SliM image processor chip (SliM 이미지 프로세서 칩 설계 및 구현)

  • 옹수환;선우명훈
    • Journal of the Korean Institute of Telematics and Electronics A
    • /
    • v.33A no.10
    • /
    • pp.186-194
    • /
    • 1996
  • The SliM (sliding memory plane) array processor has been proposed to alleviate disadvantages of existing mesh-connected SIMD(single instruction stream- multiple data streams) array processors, such as the inter-PE(processing element) communication overhead, the data I/O overhead and complicated interconnections. This paper presents the deisgn and implementation of SliM image processor ASIC (application specific integrated circuit) chip consisting of mesh connected 5 X 5 PE. The PE architecture implemented here is quite different from the originally proposed PE. We have performed the front-end design, such as VHDL (VHSIC hardware description language)modeling, logic synthesis and simulation, and have doen the back-end design procedure. The SliM ASIC chip used the VTI 0.8$\mu$m standard cell library (v8r4.4) has 55,255 gates and twenty-five 128 X 9 bit SRAM modules. The chip has the 326.71 X 313.24mil$^{2}$ die size and is packed using the 144 pin MQFP. The chip operates perfectly at 25 MHz and gives 625 MIPS. For performance evaluation, we developed parallel algorithms and the performance results showed improvement compared with existing image processors.

  • PDF

FPGA-based Implementation of Fast Histogram Equalization for Image Enhancement (영상 품질 개선을 위한 FPGA 기반 고속 히스토그램 평활화 회로 구현)

  • Ryu, Sang-Moon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.11
    • /
    • pp.1377-1383
    • /
    • 2019
  • Histogram equalization is the most frequently used algorithm for image enhancement. Its hardware implementation significantly outperforms in time its software version. The overall performance of FPGA-based implementation of histogram equalization can be improved by applying pipelining in the design and by exploiting the multipliers and a lot of SRAM blocks which are embedded in recent FPGAs. This work proposes how to implement a fast histogram equalization circuit for 8-bit gray level images. The proposed design contains a FIFO to perform equalization on an image while the histogram for next image is being calculated. Because of some overlap in time for histogram equalization, embedded multipliers and pipelined design, the proposed design can perform histogram equalization on a pixel nearly at a clock. And its dual parallel version outperforms in time almost two times over the original one.

A Scalable ECC Processor for Elliptic Curve based Public-Key Cryptosystem (타원곡선 기반 공개키 암호 시스템 구현을 위한 Scalable ECC 프로세서)

  • Choi, Jun-Baek;Shin, Kyung-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.8
    • /
    • pp.1095-1102
    • /
    • 2021
  • A scalable ECC architecture with high scalability and flexibility between performance and hardware complexity is proposed. For architectural scalability, a modular arithmetic unit based on a one-dimensional array of processing element (PE) that performs finite field operations on 32-bit words in parallel was implemented, and the number of PEs used can be determined in the range of 1 to 8 for circuit synthesis. A scalable algorithms for word-based Montgomery multiplication and Montgomery inversion were adopted. As a result of implementing scalable ECC processor (sECCP) using 180-nm CMOS technology, it was implemented with 100 kGEs and 8.8 kbits of RAM when NPE=1, and with 203 kGEs and 12.8 kbits of RAM when NPE=8. The performance of sECCP with NPE=1 and NPE=8 was analyzed to be 110 PSMs/sec and 610 PSMs/sec, respectively, on P256R elliptic curve when operating at 100 MHz clock.

Design of Stereo Image Match Processor for Real Time Stereo Matching (실시간 스테레오 정합을 위한 스테레오 영상 정합 프로세서 설계)

  • Kim, Yeon-Jae;Sim, Deok-Seon
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.37 no.2
    • /
    • pp.50-59
    • /
    • 2000
  • Stereo vision is a technique extracting depth information from stereo images, which are two images that view an object or a scene from different locations. The most important procedure in stereo vision, which is called stereo matching, is to find the same points in stereo images. It is difficult to match stereo images in real time because stereo matching requires heavy calculation. In this Paper we design a digital VLSI to Process stereo matching in real time, which we call stereo image match processor (SIMP). For implementation of real time stereo matching, sliding memory and minimum selection tree are presented. SIMP is designed with pipeline architecture and parallel processing. SIMP takes 64 gray level 64$\times$64 stereo images and yields 8 level 64 $\times$64 disparity map by 3 bit disparity and 12 bit address outputs. SIMP can process stereo images with process speed of 240 frames/sec.

  • PDF

Fine-scalable SPIHT Hardware Design for Frame Memory Compression in Video Codec

  • Kim, Sunwoong;Jang, Ji Hun;Lee, Hyuk-Jae;Rhee, Chae Eun
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.17 no.3
    • /
    • pp.446-457
    • /
    • 2017
  • In order to reduce the size of frame memory or bus bandwidth, frame memory compression (FMC) recompresses reconstructed or reference frames of video codecs. This paper proposes a novel FMC design based on discrete wavelet transform (DWT) - set partitioning in hierarchical trees (SPIHT), which supports fine-scalable throughput and is area-efficient. In the proposed design, multi-cores with small block sizes are used in parallel instead of a single core with a large block size. In addition, an appropriate pipelining schedule is proposed. Compared to the previous design, the proposed design achieves the processing speed which is closer to the target system speed, and therefore it is more efficient in hardware utilization. In addition, a scheme in which two passes of SPIHT are merged into one pass called merged refinement pass (MRP) is proposed. As the number of shifters decreases and the bit-width of remained shifters is reduced, the size of SPIHT hardware significantly decreases. The proposed FMC encoder and decoder designs achieve the throughputs of 4,448 and 4,000 Mpixels/s, respectively, and their gate counts are 76.5K and 107.8K. When the proposed design is applied to high efficiency video codec (HEVC), it achieves 1.96% lower average BDBR and 0.05 dB higher average BDPSNR than the previous FMC design.