• Title/Summary/Keyword: PE(Processing Element

Search Result 72, Processing Time 0.025 seconds

Multimedia Extension Instructions and Optimal Many-core Processor Architecture Exploration for Portable Ultrasonic Image Processing (휴대용 초음파 영상처리를 위한 멀티미디어 확장 명령어 및 최적의 매니코어 프로세서 구조 탐색)

  • Kang, Sung-Mo;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.8
    • /
    • pp.1-10
    • /
    • 2012
  • This paper proposes design space exploration methodology of many-core processors including multimedia specific instructions to support high-performance and low power ultrasound imaging for portable devices. To explore the impact of multimedia instructions, we compare programs using multimedia instructions and baseline programs with a same many-core processor in terms of execution time, energy efficiency, and area efficiency. Experimental results using a $256{\times}256$ ultrasound image indicate that programs using multimedia instructions achieve 3.16 times of execution time, 8.13 times of energy efficiency, and 3.16 times of area efficiency over the baseline programs, respectively. Likewise, programs using multimedia instructions outperform the baseline programs using a $240{\times}320$ image (2.16 times of execution time, 4.04 times of energy efficiency, 2.16 times of area efficiency) as well as using a $240{\times}400$ image (2.25 times of execution time, 4.34 times of energy efficiency, 2.25 times of area efficiency). In addition, we explore optimal PE architecture of many-core processors including multimedia instructions by varying the number of PEs and memory size.

Implementation of High-radix Modular Exponentiator for RSA using CRT (CRT를 이용한 하이래딕스 RSA 모듈로 멱승 처리기의 구현)

  • 이석용;김성두;정용진
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.10 no.4
    • /
    • pp.81-93
    • /
    • 2000
  • In a methodological approach to improve the processing performance of modulo exponentiation which is the primary arithmetic in RSA crypto algorithm, we present a new RSA hardware architecture based on high-radix modulo multiplication and CRT(Chinese Remainder Theorem). By implementing the modulo multiplier using radix-16 arithmetic, we reduced the number of PE(Processing Element)s by quarter comparing to the binary arithmetic scheme. This leads to having the number of clock cycles and the delay of pipelining flip-flops be reduced by quarter respectively. Because the receiver knows p and q, factors of N, it is possible to apply the CRT to the decryption process. To use CRT, we made two s/2-bit multipliers operating in parallel at decryption, which accomplished 4 times faster performance than when not using the CRT. In encryption phase, the two s/2-bit multipliers can be connected to make a s-bit linear multiplier for the s-bit arithmetic operation. We limited the encryption exponent size up to 17-bit to maintain high speed, We implemented a linear array modulo multiplier by projecting horizontally the DG of Montgomery algorithm. The H/W proposed here performs encryption with 15Mbps bit-rate and decryption with 1.22Mbps, when estimated with reference to Samsung 0.5um CMOS Standard Cell Library, which is the fastest among the publications at present.

Fast Array Architecture with Improved Reconfigurability (향상된 재구성능력을 가진 고속 어레이 구조)

  • Lee Jae-Ic;Kim Jinsang;Cho Won-Kyung;Kim Youngsoo
    • Proceedings of the IEEK Conference
    • /
    • 2004.06b
    • /
    • pp.451-454
    • /
    • 2004
  • The reconfigurable architecture is increasingly important for design of multi-mode communication systems and computation-intensive DSP systems. The proposed coarse-grain architecture is based on a reconfigurable processing element consisting of a MAC unit, a register file, a context data register, and PE interconnect control blocks. The main feature of the Proposed architecture is the loop context which enables faster configuration. Also, we propose another area-efficient reconfigurable architecture with improved reconfigurability. The SystemC modeling results show that the proposed architecture can reduce 9 clock cycles of 2D DCT compared to existing architectures.

  • PDF

Conflict-Free Memory System for Subarray Access (서브어레이 접근을 위한 충돌회피 기억장치)

  • 박춘자;박종원
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.04a
    • /
    • pp.43-45
    • /
    • 2002
  • 이 논문에서는 pq개의 PE(Processing Element)를 가진 SIMD처리기에서 기억 장치 접근시간을 감소시키기 위한 충돌회피 기억장치를 제안했다. 이 기억장치는 MxN 배열내 자료들의 임의의 위치에서 일정 간격인 블록형태와 8방향 선형태인 pd개의 자료들의 동시 접근을 지원한다. 기억모듈 수는 pq보다 큰 소수이고, 간격은 기억모듈 수의 배수가 아닌 양수이다. 간단하고 빠른 주소계산회로와 이동회로를 위해, 요구된 자료들에서 첫번째 자료의 기준 주소와 pq개의 주소간의 차들로 구분한 후, 주소간의 차들은 첫번째 자료 요소의 기억모듈번호로부터 번호에 따라 오름차순 정렬되고 빠른 기억모듈에 저장된다. 그래서 m개의 주소간의 차이들에 첫번째 자료의 기준주소 더해진 후, 첫 번째 요소의 기억모듈 번호에 의한 오른쪽 회전이 간격을 가진 9가지 서브어레이 모두이게 요구된다. 9가지 자료 이동 형태를 멀티플렉싱과 회전에 의해 1가지로 감소시킨 효율적인 자료 이동 회로를 제안하였다. 제안된 충돌회피기억 장치는 이전기억 장치와 비교하여 자료 접근형태, 간격, 자료 배열의 크기에 제한, 하드웨어 비용, 속도, 복잡도면에서 개선하였다.

  • PDF

Design and Implementation of Motion Estimation VLSI Processor using Block Matching Algorithm (완전탐색 블럭정합 알고리듬을 이용한 움직임 추정기의 VLSI 설계 및 구현)

  • 이용훈;권용무;박호근;류근장;김형곤;이문기
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.31B no.9
    • /
    • pp.76-84
    • /
    • 1994
  • This paper presents a new high-performance VLSI architecture and VLSI implementation for full-search block matching algorithm. The proposed VLSI architecture has the feature of two directional parallel and pipeline processing, thereby reducing the PE idle time at which the direction of block matching operation within the search area is changed. Therfore, the proposed architecture is faster than the existing architectures under the same clock frequency. Based on HSPICE circuit simulation, it is verified that the implemented procesing element is operated successfully within 13 ns for 75 MHz operation.

  • PDF

Two-dimentsional systolic arrays for DCT/DST/DHT hardware implementation (DCT/DST/DHT 하드웨어 구현을 위한 2차원 시스톨릭 어레이)

  • 판성범;박래홍
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.31B no.10
    • /
    • pp.11-20
    • /
    • 1994
  • We propose two architectures using two dimensional systolic arrays for the DCT/DST/DHT. One decomposes the N-point DCT/DST/DHT into even-and odd-numbered frequency samples, and then computes them independently at the same time. In addition, the proposed architecture can be used for the IDCT/IDST/IDHT. Anogher is the modified version for the DHT/IDHT. Two proposed architectures generate outputs sequentially using real multiplications and additions. As compared to the conventional methods the proposed systolic arrays exhibit many advantages in terms of simplicity of the processing element (PE), latency, and throughput. Teh simulation results using VHDL, international standard language for hardware description, show the effectiveness of the proposed architecture.

  • PDF

An Improving Motion Estimator based on multi arithmetic Architecture (고밀도 성능향상을 위한 다중연산구조기반의 움직임추정 프로세서)

  • Lee, Kang-Whan
    • Proceedings of the IEEK Conference
    • /
    • 2006.06a
    • /
    • pp.631-632
    • /
    • 2006
  • In this paper, acquiring the more desirable to adopt design SoC for the fast hierarchical motion estimation, we exploit foreground and background search algorithm (FBSA) base on the dual arithmetic processor element(DAPE). It is possible to estimate the large search area motion displacement using a half of number PE in general operation methods. And the proposed architecture of MHME improve the VLSI design hardware through the proposed FBSA structure with DAPE to remove the local memory. The proposed FBSA which use bit array processing in search area can improve structure as like multiple processor array unit(MPAU).

  • PDF

AN ARTIFICIAL NEURAL NETWORK BASED SENSOR SYSTEMS FOR GAS LEAKAGE MONITORING

  • Ahn, Hyung-Il;Kim, Eung-Sik;Lee, June-Ho
    • Proceedings of the Korea Institute of Fire Science and Engineering Conference
    • /
    • 1997.11a
    • /
    • pp.282-288
    • /
    • 1997
  • The purpose of this paper is to predict the situation of leak in closed space using an Artificial Neural Network (ANN). The existing system can't monitor the whole He situations with on/off signals. Especially the first stage of data determines the leak spot and intensity is disregarded in gas accidents. To complement these faults, a new prototype of monitoring system is proposed. Ihe system is composed of'sensing systenL data acquisition system computer, and ANN implemented in software and is capable of identifying the leak spot and intensity in closed space. The concentration of gas is measured at the 4 different places. The network has 3 layers that are composed of 4 input Processing Element (PE),24 hidden PEs, md 4 output PEs. The ANN has optimum condition through several experiments and as a consequence the recognition rate of93.75% is achieved finally

  • PDF

Implementation of SIMD-based Many-Core Processor for Efficient Image Data Processing (효율적인 영상데이터 처리를 위한 SIMD기반 매니코어 프로세서 구현)

  • Choi, Byong-Kook;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.1
    • /
    • pp.1-9
    • /
    • 2011
  • Recently, as mobile multimedia devices are used more and more, the needs for high-performance and low-energy multimedia processors are increasing. Application-specific integrated circuits (ASIC) can meet the needed high performance for mobile multimedia, but they provide limited, if any, generality needed for various application requirements. DSP based systems can used for various types of applications due to their generality, but they require higher cost and energy consumption as well as less performance than ASICs. To solve this problem, this paper proposes a single instruction multiple data (SIMD) based many-core processor which supports high-performance and low-power image data processing while keeping generality. The proposed SIMD based many-core processor composed of 16 processing elements (PEs) exploits large data parallelism inherent in image data processing. Experimental results indicate that the proposed SIMD-based many-core processor higher performance (22 times better), energy efficiency (7 times better), and area efficiency (3 times better) than conversional commercial high-performance processors.

A Minimum Wavelength Assignment Technique for Wavelength-routed Optical Network-on-Chip (파장 라우팅 광학 네트워크-온-칩에서의 최소 개수 파장 할당 기법)

  • Kim, Youngseok;Lee, Jae Hun;Cui, Di;Han, Tae Hee
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.10
    • /
    • pp.82-90
    • /
    • 2013
  • An Optical Network-on-Chip(ONoC) based on silicon photonics is one of promising technology for next generation exascale computing architectures. Recent active researches on ONoC focus on improving bandwidth further and avoiding path collisions by using wavelength division multiplexing (WDM). However, the number of wavelengths used for the WDM increases linearly as the number of Processing Element (PE) increases in existing ONoCs which adopt centralized routing architecture. The problem will also arises growing cost of optical devices such as light switches and light sources and limits the scalability of ONoC due to the sinal loss caused by interference of distinct light sources. In this paper, we proposes a distributed routing architecture for ONoC which is based on 2D-mesh structure using WDM technique and present a method that minimize the required number of wavelengths exploiting the connectivity of communication. In comparison with existing centralized routing architectures, results show reduction by 56% of the number of wavelengths and 21% of the number of optical switches in $8{\times}8$ networks.