• Title/Summary/Keyword: PE(Processing Element

Search Result 72, Processing Time 0.039 seconds

Design Space Exploration of Many-Core Processors for Ultrasonic Image Processing at Different Resolutions (다양한 해상도의 초음파 영상처리를 위한 매니코어 프로세서의 디자인 공간 탐색)

  • Kang, Sung-Mo;Kim, Jong-Myon
    • The KIPS Transactions:PartA
    • /
    • v.19A no.3
    • /
    • pp.121-128
    • /
    • 2012
  • This paper explores the optimal processing element (PE) configuration for ultrasonic image processing at different resolutions ($256{\times}256$, $768{\times}1,024$, and $1,024{\times}1,280$). To determine the optimal PE configuration, this paper evaluates the impacts of a data-per-processing element (DPE) ratio that is defined as the amount of image data directly mapped to each PE on system performance and both energy and area efficiencies using architectural and workload simulations. This paper illustrates the correlation between DPE ratio and PE architecture for a target implementation in 130nm technology. To identify the most efficient PE structure, seven different PE configurations were simulated for ultrasonic image processing. Experimental results indicate that the highest energy efficiencies were achieved at PEs=1,024, 4,096, and 16,384 for ultrasonic images at $256{\times}256$, $768{\times}1,024$, $1,024{\times}1,280$ resolutions, respectively. Furthermore, the maximum area efficiencies were yielded at PEs=256 ($256{\times}256$ image) and 4,096 ($768{\times}1,024$ and $1,024{\times}1,280$ images), respectively.

Design of an Image Processing ASIC Architecture using Parallel Approach with Zero or Little (통신부담을 감소시킨 영상처리를 위한 병렬처리 방식 ASIC구조 설계)

  • 안병덕;정지원;선우명훈
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.10
    • /
    • pp.2043-2052
    • /
    • 1994
  • This paper proposes a new parallel ASIC architecture for real-time image processing to reduce inter-processing element (inter-PE) communication overhead, called a Sliding Memory Plane (SliM) Image Processor. The Slim Image Processor consists of $3\times3$ processing elements (PEs) connected by a mesh topology. With easy scalability due to the topology. a set of SliM Image Processors can form a mesh-connected SIMD parallel architecture. called the SliM Array Processor. The idea of sliding means that all pixels are slided into all neighboring PEs without interrupting PEs and without a coprocessor or a DMA controller. Since the inter-PE communication and computation occur simultaneously. the inter-PE communication overhead, significant disadvantage of existing machines greatly diminishes. Two I/O planes provide a buffering capability and reduce the date I/O overhead. In addition, using the by-passing path provides eight-way connectivity even with four links. with these salient features. SliM shows a significant performance improvement. This paper presents architectures of a PE and the SliM Image Processor, and describes the design of an instruction set.

  • PDF

Power-Efficient DCNN Accelerator Mapping Convolutional Operation with 1-D PE Array (1-D PE 어레이로 컨볼루션 연산을 수행하는 저전력 DCNN 가속기)

  • Lee, Jeonghyeok;Han, Sangwook;Choi, Seungwon
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.18 no.2
    • /
    • pp.17-26
    • /
    • 2022
  • In this paper, we propose a novel method of performing convolutional operations on a 2-D Processing Element(PE) array. The conventional method [1] of mapping the convolutional operation using the 2-D PE array lacks flexibility and provides low utilization of PEs. However, by mapping a convolutional operation from a 2-D PE array to a 1-D PE array, the proposed method can increase the number and utilization of active PEs. Consequently, the throughput of the proposed Deep Convolutional Neural Network(DCNN) accelerator can be increased significantly. Furthermore, the power consumption for the transmission of weights between PEs can be saved. Based on the simulation results, the performance of the proposed method provides approximately 4.55%, 13.7%, and 2.27% throughput gains for each of the convolutional layers of AlexNet, VGG16, and ResNet50 using the DCNN accelerator with a (weights size) x (output data size) 2-D PE array compared to the conventional method. Additionally the proposed method provides approximately 63.21%, 52.46%, and 39.23% power savings.

음성인식용 DTW PE의 IC화를 위한 ADD 및 ABS 회로의 설계

  • 정광재;문홍진;최규훈;김종교
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.15 no.8
    • /
    • pp.648-658
    • /
    • 1990
  • There are many methods for speed up counting in speech recongition. A multiple processing method is the one way to achieve the aim using systolic array. This arithmetic operation by the array is achieved pipelining skill. And the operation is multiprocessing by processing element(PE) that is incresing counting efficiencies. The DTW PE cell is seperated into three large blocks. "MIN" is the one block for counting accumulated minimum distance, "ADD" block calculated these minimum distances, and "ABS" seeks for the absolut values to the total sum of local distances. We have accomplished circuit design and verification about the "ADD" and "ABS" blocks, and performed total layout '||'&'||' DRC(design rule check) using 3um CMOS N-Well rule base.le check) using 3$\mu$m CMOS N-Well rule base.

  • PDF

The Design of a Code-String Matching Processor using an EWLD Algorithm (EWLD 알고리듬을 이용한 코드열 정합 프로세서의 설계)

  • 조원경;홍성민;국일호
    • Journal of the Korean Institute of Telematics and Electronics A
    • /
    • v.31A no.4
    • /
    • pp.127-135
    • /
    • 1994
  • In this paper we propose an EWLD(Enhanced Weighted Levenshtein Distance) algorithm to organize code-string pattern matching linear array processor based on the mappting to an one-dimensional array from a two-dimensional matching matrix, and design a processing element(PE) of the processor, N PEs are required instead of NS02T in the processor because of the mapping. Data input and output between PEs and all internal operations of each PE are performed in bit-serial fashion. The bit-serial operation consists of the computing of word distance (WD) by comparison and the selection of optimal code transformation path, and takes 22 clocks as a cycle. The layout of a PE is designed based on the double metal $1.5\mu$m CMOS rule. About 1,800 transistors consistute a processing element and 2 PEs are integrated on a 3mm$\times$3mm sized chip.

  • PDF

A New Systolic Array Architecture for the OS CFAR Processor (OS CFAR 프로세서에 대한 새로운 시스톨릭 어레이 구조)

  • 송재필
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1991.06a
    • /
    • pp.163-168
    • /
    • 1991
  • In this paper, we propose a new systolic architecture for the order statistics(OS) constant false alarm rate(CFAR) processor. In the proposed architecture, each processing element(PE) can compare two reference data cells with one test cell simultaneously in each clock cycle. So the utilization of each PE in this architecture is 100% whereas the utilization of each PE in the systolic architecture previously reported by Ritcey and Hwang is 50% because of one clock delay between two adjacent PE's active in computation. This can speed up the data processing rate by a factor of two. With this architecture, we can obtain the reduced number of communication links between adjacent PE's and reduction of the latency by half in comparison with the one proposed by Ritcey and Hwang.

  • PDF

An Optimized Design of RS(23,17) Decoder for UWB (UWB 시스템을 위한 RS(23,17) 복호기 최적 설계)

  • Kang, Sung-Jin;Kim, Han-Jong
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.33 no.8A
    • /
    • pp.821-828
    • /
    • 2008
  • In this paper, we present an optimized design of RS(23,17) decoder for UWB, which uses the pipeline structured-modified Euclidean(PS-ME) algorithm. Firstly, the modified processing element(PE) block is presented in order to get rid of degree comparison circuits, registers and MUX at the final PE stage. Also, a degree computationless decoding algorithm is proposed, so that the hardware complexity of the decoder can be reduced and high-speed decoder can be implemented. Additionally, we optimize Chien search algorithm, Forney algorithm, and FIFO size for UWB specification. Using Verilog HDL, the proposed decoder is implemented and synthesized with Samsung 65nm library. From synthesis results, it can operate at clock frequency of 250MHz, and gate count is 17,628.

A Study of a Task Mapping Technique for heterogeneous MPSoCs (이기종 MPSoC 를 위한 태스크 매핑 기법 연구)

  • Cho, Jungseok;Jung, Youjin;Cho, Doosan
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.04a
    • /
    • pp.18-19
    • /
    • 2014
  • 멀티프로세서 시스템 온칩 (MPSoC) 플랫폼은 고성능 임베디드 시스템을 위한 핵심 구성요소이다. MPSoC 를 구성하는 각각의 처리요소 (processing element, PE)는 대응하는 태스크의 연산 특징에 맞춤으로 최적화되어 있어야 한다. 갈수록 증가하는 고성능의 요구에 따라 동종 MPSoC 는 각각의 태스크 연산 특징에 최적화된 다양한 PE 를 보유한 이기종 MPSoC 로 발전되어 왔다. 따라서 이기종 MPSoC 의 코어들은 응용에 특화된 맞춤형 명령어 세트로 설계된다. 하지만 이러한 이기종성은 다양한 태스크로 구성된 응용들을 어떻게 서로 다른 특성을 지닌 PE 들에 매핑해야 최적의 시스템을 구성할 지를 결정해야 하는 부담을 컴파일러와 같은 툴에 지우고 있다. 잘못된 매핑은 시스템 성능을 현저히 저하시킬 소지가 있다. 본 연구에서는 멀티미디어 응용 태스크의 연산 패턴을 분석하여 최적의 태스크 매핑을 결정하는 기법을 제안하고 있다.

Implementation of an Optimal Many-core Processor for Beamforming Algorithm of Mobile Ultrasound Image Signals (모바일 초음파 영상신호의 빔포밍 기법을 위한 최적의 매니코어 프로세서 구현)

  • Choi, Byong-Kook;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.8
    • /
    • pp.119-128
    • /
    • 2011
  • This paper introduces design space exploration of many-core processors that meet high performance and low power required by the beamforming algorithm of image signals of mobile ultrasound. For the design space exploration of the many-core processor, we mapped different number of ultrasound image data to each processing element of many-core, and then determined an optimal many-core processor architecture in terms of execution time, energy efficiency and area efficiency. Experimental results indicate that PE=4096 and 1024 provide the highest energy efficiency and area efficiency, respectively. In addition, PE=4096 achieves 46x and 10x better than TI DSP C6416, which is widely used for ultrasound image devices, in terms of energy efficiency and area efficiency, respectively.

Design of Dynamic Time Warp Element for Speech Recognition (음성인식을 위한 Dynamic Time Warp 소자의 설계)

  • 최규훈;김종민
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.3
    • /
    • pp.543-552
    • /
    • 1994
  • Dynamic Time Warp(DTW) needs for iterative calculations and the design of PE cell suitable for the operations is very important. Accordingly, this paper aims at real time recognition design enables large dictionary hardware realization using DTW algorithm. The DTW PE cell separated into three large blocks. "MIN" is the one block for counting accumulated minimum distance. "ADD" block calculates these minimum distances, and "ABS" seeks for the absolute values to the total sum of local distances. Circuit design and verification about the three block have been accomplished, and performed layout '||'&'||' DRC(design rule check) using 1.2 m CMOS N-Well rule base.CMOS N-Well rule base.

  • PDF