• Title/Summary/Keyword: PE(Processing Element

Search Result 72, Processing Time 0.032 seconds

Performance Evaluation and Analysis for Discrete Wavelet Transform on Many-Core Processors (매니코어 프로세서 상에서 이산 웨이블릿 변환을 위한 성능 평가 및 분석)

  • Park, Yong-Hun;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.7 no.5
    • /
    • pp.277-284
    • /
    • 2012
  • To meet the usage of discrete wavelet transform (DWT) on potable devices, this paper implements 2-level DWT using a reference many-core processor architecture and determine the optimal many-core processor. To explore the optimal many-core processor, we evaluate the impacts of a data-per-processing element ratio that is defined as the amount of data mapped directly to each processing element (PE) on system performance, energy efficiency, and area efficiency, respectively. This paper utilized five PE configurations (PEs=16, 64, 256, 1,024, and 4,096) that were implemented in 130nm CMOS technology with a 720MHz clock frequency. Experimental results indicated that maximum energy and area efficiencies were achieved at PEs=1,024. However, the system area must be limited 140mm2 and the power should not exceed 3 watts in order to implement 2-level DWT on portable devices. When we consider these restrictions, the most reasonable energy and area efficiencies were achieved at PEs=256.

A Fuzzy Microprocessor for Real-time Control Applications

  • Katashiro, Takeshi
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1993.06a
    • /
    • pp.1394-1397
    • /
    • 1993
  • A Fuzzy Microprocessor(FMP) is presented, which is suitable for real-time control applications. The features include high speed inference of maximum 114K FLIPS at 20MHz system clocks, capability of up to 128-rule construction, and handing of 8 input variables with 8-bit resolution. In order to realize these features, the fuzzifier circuit and the processing element(PE) are well optimized for LSI implementation. The chip fabricated in 1.2$\mu\textrm{m}$ CMOS technology contains 71K transistors in 82.8 $\textrm{mm}^2$ die size and is packaged in 100-pin plastic QFP.

  • PDF

New systolic arrays for computation of the 1-D and 2-D discrete wavelet transform (1차원 및 2차원 이산 웨이브렛 변환 계산을 위한 새로운 시스톨릭 어레이)

  • 반성범;박래홍
    • Journal of the Korean Institute of Telematics and Electronics S
    • /
    • v.34S no.10
    • /
    • pp.132-140
    • /
    • 1997
  • This paper proposes systolic array architectures for compuataion of the 1-D and 2-D discrete wavelet transform (DWT). The proposed systolic array for compuataion of the 1-D DWT consists of L processing element (PE) arrays, where the PE array denotes the systolic array for computation of the one level DWT. The proposed PE array computes only the product terms that are required for further computation and the outputs of low and high frequency filters are computed in alternate clock cycles. Therefore, the proposed architecuter can compute the low and high frequency outputs using a single architecture. The proposed systolic array for computation of the 2-D DWT consists of two systolic array architectures for comutation of the 1-D DWT and memory unit. The required time and hardware cost of the proposed systolic arrays are comparable to those of the conventional architectures. However, the conventional architectures need extra processing units whereas the proposed architectures fo not. The proposed architectures can be applied to subband decomposition by simply changing the filter coefficients.

  • PDF

Design Space Exploration of Many-Core Processor for High-Speed Cluster Estimation (고속의 클러스터 추정을 위한 매니코어 프로세서의 디자인 공간 탐색)

  • Seo, Jun-Sang;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.10
    • /
    • pp.1-12
    • /
    • 2014
  • This paper implements and improves the performance of high computational subtractive clustering algorithm using a single instruction, multiple data (SIMD) based many-core processor. In addition, this paper implements five different processing element (PE) architectures (PEs=16, 64, 256, 1,024, 4,096) to select an optimal PE architecture for the subtractive clustering algorithm by estimating execution time and energy efficiency. Experimental results using two different medical images and three different resolutions ($128{\times}128$, $256{\times}256$, $512{\times}512$) show that PEs=4,096 achieves the highest performance and energy efficiency for all the cases.

Exploration of Optimal Multi-Core Processor Architecture for Physical Modeling of Plucked-String Instruments (현악기의 물리적 모델링을 위한 최적의 멀티코어 프로세서 아키텍처 탐색)

  • Kang, Myeong-Su;Choi, Ji-Won;Kim, Yong-Min;Kim, Jong-Myon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.30 no.5
    • /
    • pp.281-294
    • /
    • 2011
  • Physics-based sound synthesis usually requires high computational costs and this results in a restriction of its use in real-time applications. This motivates us to implement the sound synthesis algorithm of plucked-string instruments using multi-core processor architectures and determine the optimal processing element (PE) configuration for the target instruments. To determine the optimal PE configuration, we evaluate the impacts of a sample-per-processing element (SPE) ratio that is defined as the amount of sample data directly mapped to each PE on system performance and both area and energy efficiencies using architectural and workload simulations. For the acoustic guitar, the highest area and energy efficiencies are achieved at a SPE ratio of 5,513 and 2,756, respectively, for the synthesis of musical sounds sampled at 44.1 kHz. In the case of the classical guitar, the maximum area and energy efficiencies are achieved at a SPE ratio of 22,050 and 5,513, respectively. In addition, the synthetic sounds were very similar to original sounds in their spectra. Furthermore, we conducted MUSHRA subjective listening test with ten subjects including nine graduate students and one professor from the University of Ulsan, and the evaluation of the synthetic sounds was excellent.

An Integrated MIN Circuit Design of DTW PE for Speech Recognition (음성인식용 DTW PE의 IC화를 위한 MIN회로의 설계)

  • 정광재;문홍진;최규훈;김종교
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.15 no.8
    • /
    • pp.639-647
    • /
    • 1990
  • Dynamic time warp(DTW) needs for interative calculations and the design of PE cell suitable for the operations is very important. Accordingly, this paper aims at the real time recognition design which enables large dictionary hardware realization using DTW algorithm. The DTW PE cell is seperated into three large blocks. "MIN" is the one block for counting accumulated minimum distance, "ADD" block calculates these minimum distances, and "ABS" seeks for the absolute values to the total sum of local distances. We have accomplisehd circuit design and verification for the MIN blocks, and performed MIN layout and DRC(design rule check) using 3um CMOS N-Well rule base.ing 3um CMOS N-Well rule base.

  • PDF

Design Space Exploration of Many-Core Architecture for Sound Synthesis of Guitar on Portable Device (휴대 장치용 기타 음 합성을 위한 매니코어 아키텍처의 디자인 공간 탐색)

  • Kang, Myeongsu;Kim, Jong-Myon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.01a
    • /
    • pp.1-4
    • /
    • 2014
  • Although physical modeling synthesis is becoming more and more efficient in rich and natural high-quality sound synthesis, its high computational complexity limits its use in portable devices. This constraint motivated research of single-instruction multiple-data many-core architectures that support the tremendous amount of computations by exploiting massive parallelism inherent in physical modeling synthesis. Since no general consensus has been reached which grain sizes of many-core processors and memories provide the most efficient operation for sound synthesis, design space exploration is conducted for seven processing element (PE) configurations. To find an optimal PE configuration, each PE configuration is evaluated in terms of execution time, area and energy efficiencies. Experimental results show that all PE configurations are satisfied with the system requirements to be implemented in portable devices.

  • PDF

A Study on the CAM Designed by Adopting Best-Match Method using Parallel Processing Architecture (병렬 처리 구조를 이용한 최적 정합 방식 CAM 설계에 관한 연구)

  • 김상복;박노경;차균현
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.6
    • /
    • pp.1056-1063
    • /
    • 1994
  • In this paper a content addressable memory (CAM) is designed by adopting best-match method. It has a single processing element(PE) architecture with high computational efficiency and throughput. It is composed of three main functional blocks(input MUX, best-match CAM, control part). It support fully parallel processing. Logic simulation is completed by using QUICKSIM, Circuit simulation is performanced by using HSPICE. Its layout is based on the ETRI 3 m n-well process design rules. Its maximum operating frequency is 20 MHz.

  • PDF

Design Space Exploration of Many-Core Processors for Mobile Ultrasound Image Signal Processing (모바일 초음파 영상신호처리를 위한 매니코어 프로세서 디자인 공간 탐색)

  • Choi, Byong-Kook;Kim, Jong-Myon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.04a
    • /
    • pp.183-186
    • /
    • 2011
  • 본 논문에서는 모바일 초음파(mobile ultrasound) 영상신호의 빔포밍 알고리즘에서 요구되는 고성능 및 저전력을 만족시키는 매니코어 프로세서에 대한 디자인 공간 탐색 방법을 소개한다. 매니코어 프로세서의 디자인 공간 탐색을 위해 매니코어의 각 프로세싱 엘리먼트(Processing Element, PE)당 초음파 영상신호 데이터의 수를 변화시키는 실험을 통해 실행시간, 에너지 효율 및 시스템 면적 효율을 측정하고, 측정된 결과를 바탕으로 최적의 매니코어 프로세서 구조를 선택하였다.

Architecture Exploration of Optimal Many-Core Processors for a Vector-based Rasterization Algorithm (래스터화 알고리즘을 위한 최적의 매니코어 프로세서 구조 탐색)

  • Son, Dong-Koo;Kim, Cheol-Hong;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.9 no.1
    • /
    • pp.17-24
    • /
    • 2014
  • In this paper, we implement and evaluate the performance of a vector-based rasterization algorithm for 3D graphics by using a SIMD (single instruction multiple data) many-core processor architecture. In addition, we evaluate the impact of a data-per-processing elements (DPE) ratio that is defined as the amount of data directly mapped to each processing element (PE) within many-core in terms of performance, energy efficiency, and area efficiency. For the experiment, we utilize seven different PE configurations by varying the DPE ratio (or the number PEs), which are implemented in the same 130 nm CMOS technology with a 500 MHz clock frequency. Experimental results indicate that the optimal PE configuration is achieved as the DPE ratio is in the range from 16,384 to 256 (or the number of PEs is in the range from 16 and 1,024), which meets the requirements of mobile devices in terms of the optimal performance and efficiency.