• Title/Summary/Keyword: PE(Processing Element

Search Result 72, Processing Time 0.026 seconds

Multithreaded and Overlapped Systolic Array for Depthwise Separable Convolution (깊이별 분리 합성곱을 위한 다중 스레드 오버랩 시스톨릭 어레이)

  • Jongho Yoon;Seunggyu Lee;Seokhyeong Kang
    • Transactions on Semiconductor Engineering
    • /
    • v.2 no.1
    • /
    • pp.1-8
    • /
    • 2024
  • When processing depthwise separable convolution, low utilization of processing elements (PEs) is one of the challenges of systolic array (SA). In this study, we propose a new SA architecture to maximize throughput in depthwise convolution. Moreover, the proposed SA performs subsequent pointwise convolution on the idle PEs during depthwise convolution computation to increase the utilization. After the computation, we utilize unused PEs to boost the remaining pointwise convolution. Consequently, the proposed 128x128 SA achieves a 4.05x and 1.75x speed improvement and reduces the energy consumption by 66.7 % and 25.4 %, respectively, compared to the basic SA and RiSA in MobileNetV3.

Design of Degree-Computationless Modified Euclidean Algorithm using Polynomial Expression (다항식 표현을 이용한 DCME 알고리즘 설계)

  • Kang, Sung-Jin;Kim, Nam-Yong
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.10A
    • /
    • pp.809-815
    • /
    • 2011
  • In this paper, we have proposed and implemented a novel architecture which can be used to effectively design the modified Euclidean (ME) algorithm for key equation solver (KES) block in high-speed Reed-Solomon (RS) decoder. With polynomial expressions of newly-defined state variables for controlling each processing element (PE), the proposed architecture has simple input/output signals and requires less hardware complexity because no degree computation circuits are needed. In addition, since each PE circuit is independent of the error correcting capability t of RS codes, it has the advantage of linearly increase of the hardware complexity of KES block as t increases. For comparisons, KES block for RS(255,239,8) decoder is implemented using Verilog HDL and synthesized with 0.13um CMOS cell library. From the results, we can see that the proposed architecture can be used for a high-speed RS decoder with less gate count.

Design of a High Throughput Parallel Turbo Decoder (고 처리율 병렬 터보 복호기 설계)

  • Lee, Won-Ho;Park, Heemin;Rim, Chong S.
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.11
    • /
    • pp.50-57
    • /
    • 2013
  • This paper provides a design of high-throughput parallel turbo decoder that is able to decode several packets of various length simultaneously. For high-speed communications, designing of Turbo decoder as parallel structures reduces the long decoding time caused by iterative turbo decode way. Also, by employing the double buffer structure for input and output packets improves the decoder throughput by enabling continuous decoding. Because parallel turbo decoder is designed to be able to decode the packet of the longest length, there exist idle PE's(Processing Element) in the case of decoding packets of short length. The main idea of this paper is to increase the utilization of PE's in parallel Turbo decoder and to improve the decoder throughput by using the idle PE's immediately for the subsequent packets decoding. For this, the control is necessary to enable the concurrent decoding of several short packets and we propose the method of this control. Applying the proposed method, we implemented Turbo Decoder with 32 PE's that can decode packets of 6144 bits maximum. Compared to the conventional Turbo decoder, although the area was increased about 16%, the decoder throughput was improved 28 times for short packets.

Parallel Implementation of Nonlinear Analysis Program of PSC Frame Using MPI (MPI를 이용한 PSC 프레임 비선형해석 프로그램의 병렬화)

  • 이재석;최규천
    • Proceedings of the Computational Structural Engineering Institute Conference
    • /
    • 2001.04a
    • /
    • pp.61-68
    • /
    • 2001
  • A parallel nonlinear analysis program of prestressed concrete frame is migrated on a PC cluster system and a massively parallel processing system, CRAY T3E system, using MPI. The PC cluster system is configured with Pentium Ⅲ class PCs and fast ethernet. The CRAY T3E system is composed of a set of nodes each containing one Processing Element (PE), a memory subsystem and its distributed memory interconnect network. Parallel computing algorithms are implemented on element-wise processing parts including the calculation of stiffness matrix, element stresses and determination of material states, check of material failure and calculation of unbalanced loads. Parallel performance of the migrated program is evaluated through typical numerical examples.

  • PDF

Optimal Many-core Processor Architecture for Different Ultrasonic Image Resolutions (초음파 영상선호의 크기 변화에 따른 최적의 매니코어 프로세서 구조)

  • Kang, Seong-Mo;Kim, Jong-Myon
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.13 no.1
    • /
    • pp.50-55
    • /
    • 2012
  • This paper proposes an optima] many-core processor architecture that meets the requirements of low power and high performance for different ultrasonic image resolutions in hand-held ultrasonic devices. To identify the optimal many-core architecture, seven different PE configurations are simulated for processing ultrasonic images in terms of execution performance and energy consumption. Experimental results indicate that the highest energy efficiencies are achieved at PEs=1,024, 64, and 256 for ultrasonic images at $256{\times}256$, $320{\times}240$, and $800{\times}480$ resolutions, respectively. In addition, the maximum area efficiencies are obtained at PEs=256 (for $256{\times}256$ and $800{\times}480$ image resolutions) and 64 (for $320{\times}240$ image resolution).

A Scalable Montgomery Modular Multiplier (확장 가능형 몽고메리 모듈러 곱셈기)

  • Choi, Jun-Baek;Shin, Kyung-Wook
    • Journal of IKEEE
    • /
    • v.25 no.4
    • /
    • pp.625-633
    • /
    • 2021
  • This paper describes a scalable architecture for flexible hardware implementation of Montgomery modular multiplication. Our scalable modular multiplier architecture, which is based on a one-dimensional array of processing elements (PEs), performs word parallel operation and allows us to adjust computational performance and hardware complexity depending on the number of PEs used, NPE. Based on the proposed architecture, we designed a scalable Montgomery modular multiplier (sMM) core supporting eight field sizes defined in SEC2. Synthesized with 180-nm CMOS cell library, our sMM core was implemented with 38,317 gate equivalents (GEs) and 139,390 GEs for NPE=1 and NPE=8, respectively. When operating with a 100 MHz clock, it was evaluated that 256-bit modular multiplications of 0.57 million times/sec for NPE=1 and 3.5 million times/sec for NPE=8 can be computed. Our sMM core has the advantage of enabling an optimized implementation by determining the number of PEs to be used in consideration of computational performance and hardware resources required in application fields, and it can be used as an IP (intellectual property) in scalable hardware design of elliptic curve cryptography (ECC).

An Efficient Clock Cycle Reducing Architecture in Full-Search Block Matching Motion Estimation VLSI (전탐색 블럭정합 움직임추정 VLSI 에서 클럭사이클수를 줄이는 효율적 구조)

  • 윤종성;장순화
    • Proceedings of the IEEK Conference
    • /
    • 2000.09a
    • /
    • pp.259-262
    • /
    • 2000
  • 본 논문은 전탐색 블럭매칭 움직임추정 VLSI 구조에서 클럭당 두연산(하나는 클럭의 상향에지, 하나는 하향에지에서 동작)을 수행하는 PE(Processing Element)를 교번적으로 결선, 클럭의 상향에지는 물론 하향에지에서도 동작하도록 하는 방식으로 클럭 사이클수를 줄이는 VLSI 구조를 제안한다 기존 구조에 그대로 적용되는 본 방법은 공급 데이타폭이 2 배, PE 의 HW 복잡도가 1.5 배 절대차 합 연산의 복잡도가 2 배로 늘어나 전체 하드웨어가 복잡해지나, PE수를 2배로 하여 클럭사이클수를 줄이는 방법에 비해서는 매우 효율적이다. 본 제안 구조는 계층적 움직임 추정 알고리듬을 사용한 MPEG-2 움직임 추정기 개발의 설계에 적용하여 기능과 HW 복잡도를 확인하였다.

  • PDF

Improvement of reconfiguration rate using pseudo faulty processing elements on the single track 2-D systolic array (의사결함처리요소를 이용한 단일트랙 이차원 시스토릭 어레이에서 재구성율의 향상)

  • 신동석;우종호
    • Journal of the Korean Institute of Telematics and Electronics A
    • /
    • v.33A no.2
    • /
    • pp.163-172
    • /
    • 1996
  • In reconfiguration of systolic arrays, a potential disadvantage is that in the PRESENCE of consective faulty PE's logically connected PE's may be far apart, requiring the reduction of clock speed and thus reducing throughput of the array. Thus it is fundamental tokeep locality of interconnections as high as possible even after reconfiguration and to make reconfiguration implemented in the simple routing devices. However requirements of locality and simplicity mean that reconfiguring capability is limited. This paper deals iwth the issue of developing efficient method for reconfiguration of 2-D systolic arrays which can be achieved high reconfiguration rate, with the two conditions satisfying using concept of pseudo faulty processing element. Applying this concept to reconfiguration of systolic array, we have found similar condition. The simulation shows that recomfiguration rates are 97%, 84% when N faults ocurs on the N$\times$N array n case of N=5, 8 respectively.

  • PDF

A Scalable ECC Processor for Elliptic Curve based Public-Key Cryptosystem (타원곡선 기반 공개키 암호 시스템 구현을 위한 Scalable ECC 프로세서)

  • Choi, Jun-Baek;Shin, Kyung-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.8
    • /
    • pp.1095-1102
    • /
    • 2021
  • A scalable ECC architecture with high scalability and flexibility between performance and hardware complexity is proposed. For architectural scalability, a modular arithmetic unit based on a one-dimensional array of processing element (PE) that performs finite field operations on 32-bit words in parallel was implemented, and the number of PEs used can be determined in the range of 1 to 8 for circuit synthesis. A scalable algorithms for word-based Montgomery multiplication and Montgomery inversion were adopted. As a result of implementing scalable ECC processor (sECCP) using 180-nm CMOS technology, it was implemented with 100 kGEs and 8.8 kbits of RAM when NPE=1, and with 203 kGEs and 12.8 kbits of RAM when NPE=8. The performance of sECCP with NPE=1 and NPE=8 was analyzed to be 110 PSMs/sec and 610 PSMs/sec, respectively, on P256R elliptic curve when operating at 100 MHz clock.

Implementation of an Optimal SIMD-based Many-core Processor for Sound Synthesis of Guitar (기타 음 합성을 위한 최적의 SIMD기반 매니코어 프로세서 구현)

  • Choi, Ji-Won;Kang, Myeong-Su;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.1
    • /
    • pp.1-10
    • /
    • 2012
  • Improving operating frequency of processors is no longer today's issues; a multiprocessor technique which integrates many processors has received increasing attention. Currently, high-performance processors that integrate 64 or 128 cores are developing for large data processing over 2, 4, or 8 processor cores. This paper proposes an optimal many-core processor for synthesizing guitar sounds. Unlike the previous research in which a processing element (PE) was assigned to support one of guitar strings, this paper evaluates the impacts of mapping different numbers of PEs to one guitar string in terms of performance and both area and energy efficiencies using architectural and workload simulations. Experimental results show that the maximum area energy efficiencies were achieved at PEs=24 and 96, respectively, for synthesizing guitar sounds with sampling rate of 44.1kHz and 16-bit quantization. The synthesized sounds were very similar to original guitar sounds in their spectra. In addition, the proposed many-core processor was 1,235 and 22 times better than TI TMS320C6416 in area and energy efficiencies, respectively.