• Title/Summary/Keyword: PE(Processing Element

Search Result 72, Processing Time 0.032 seconds

A study on implementation digital programmable CNN with variable template memory (가변적 템플릿 메모리를 갖는 디지털 프로그래머블 CNN 구현에 관한 연구)

  • 윤유권;문성룡
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.34C no.10
    • /
    • pp.59-66
    • /
    • 1997
  • Neural networks has widely been be used for several practical applications such as speech, image processing, and pattern recognition. Thus, a approach to the voltage-controlled current source in areas of neural networks, the key features of CNN in locally connected only to its netighbors. Because the architecture of the interconnection elements between cells in very simple and space invariant, CNNs are suitable for VLSI implementation. In this paper, processing element of digital programmable CNN with variable template memory was implemented using CMOS circuit. CNN PE circuit was designe dto control gain for obtaining the optimal solutions in the CNN output. Performance of operation for 4*4 CNN circuit applied for fixed template and variable template analyzed with the result of simulation using HSPICE tool. As a result of simulations, the proposed variable template method verified to improve performance of operation in comparison with the fixed template method.

  • PDF

A High Speed Bit-level Viterbi Decoder

  • Kim Min-U;Jo Jun-Dong
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2006.06a
    • /
    • pp.311-315
    • /
    • 2006
  • Viterbi decoder는 크게 BM(Branch metric), ACS(Add-Compare-Select), SM(Survivor Memory) block 으로 구성되어 있다. 이중 ACSU 부분은 고속 데이터 처리를 위한 bottleneck이 되어 왔으며, 이의 해결을 위한 많은 연구가 활발히 진행되어 왔다. look ahead technique은 ACSU를 M-step으로 처리하고 CS(Carry save) number를 사용한 새로운 비교 알고리즘을 제안하여 high throughput을 추구했으며, minimized method는 block processing 방식으로 forward, backward 방향으로 decoding을 수행하여 ACSU 부분의 feedback을 완전히 제거하여 exteremely high throughput 을 추구하고 있다. 이에 대해 look ahead technique 의 기본 PE(Processing Element)를 바탕으로 minimized method 알고 리즘의 core block 을 bit-level 로 구현하였으며 : code converter 를 이용하여 CS number 가운데 redundat number(l)를 제거하여 비교기를 더 간단히 하였다. SYNOPSYS의 Design compiler 와 TSMC 0.18 um library 를 이용하여 합성하였다.

  • PDF

Multi-Core Processor for Real-Time Sound Synthesis of Gayageum (가야금의 실시간 음 합성을 위한 멀티코어 프로세서 구현)

  • Choi, Ji-Won;Cho, Sang-Jin;Kim, Cheol-Hong;Kim, Jong-Myon;Chong, Ui-Pil
    • The KIPS Transactions:PartA
    • /
    • v.18A no.1
    • /
    • pp.1-10
    • /
    • 2011
  • Physical modeling has been widely used for sound synthesis since it synthesizes high quality sound which is similar to real-sound for musical instruments. However, physical modeling requires a lot of parameters to synthesize a large number of sounds simultaneously for the musical instrument, preventing its real-time processing. To solve this problem, this paper proposes a single instruction, multiple data (SIMD) based multi-core processor that supports real-time processing of sound synthesis of gayageum which is a representative Korean traditional musical instrument. The proposed SIMD-base multi-core processor consists of 12 processing elements (PE) to control 12 strings of gayageum in which each PE supports modeling of the corresponding string. The proposed SIMD-based multi-core processor can generate synthesized sounds of 12 strings simultaneously after receiving excitation signals and parameters of each string as an input. Experimental results using a sampling reate 44.1 kHz and 16 bits quantization show that synthesis sound using the proposed multi-core processor was very similar to the original sound. In addition, the proposed multi-core processor outperforms commercial processors(TI's TMS320C6416, ARM926EJ-S, ARM1020E) in terms of execution time ($5.6{\sim}11.4{\times}$ better) and energy efficiency (about $553{\sim}1,424{\times}$ better).

Functionality-based Processing-In-Memory Accelerator for Deep Neural Networks (딥뉴럴네트워크를 위한 기능성 기반의 핌 가속기)

  • Kim, Min-Jae;Kim, Shin-Dug
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.11a
    • /
    • pp.8-11
    • /
    • 2020
  • 4 차 산업혁명 시대의 도래와 함께 AI, ICT 기술의 융합이 진행됨에 따라, 유저 레벨의 디바이스에서도 AI 서비스의 요청이 실현되었다. 이미지 처리와 관련된 AI 서비스는 피사체 판별, 불량품 검사, 자율주행 등에 이용되고 있으며, 특히 Deep Convolutional Neural Network (DCNN)은 이미지의 특색을 파악하는 데 뛰어난 성능을 보여준다. 하지만, 이미지의 크기가 커지고, 신경망이 깊어짐에 따라 연산 처리에 있어 낮은 데이터 지역성과 빈번한 메모리 참조를 야기했다. 이에 따라, 기존의 계층적 시스템 구조는 DCNN 을 scalable 하고 빠르게 처리하는 데 한계를 보인다. 본 연구에서는 DCNN 의 scalable 하고 빠른 처리를 위해 3 차원 메모리 구조의 Processing-In-Memory (PIM) 가속기를 제안한다. 이를 위해 기존 3 차원 메모리인 Hybrid Memory Cube (HMC)에 하드웨어 및 소프트웨어 모듈을 추가로 구성하였다. 구체적으로, Processing Element (PE)간 데이터를 공유할 수 있는 공유 캐시 및 소프트웨어 스택, 파이프라인화된 곱셈기 및 듀얼 프리페치 버퍼를 구성하였다. 이를 유명 DCNN 알고리즘 LeNet, AlexNet, ZFNet, VGGNet, GoogleNet, RestNet 에 대해 성능 평가를 진행한 결과 기존 HMC 대비 40.3%의 속도 향상을 29.4%의 대역폭 향상을 보였다.

Design of a Pipelined High Performance RSA Crypto_chip (파이프라인 구조의 고속 RSA 암호화 칩 설계)

  • Lee, Seok-Yong;Kim, Seong-Du;Jeong, Yong-Jin
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.28 no.6
    • /
    • pp.301-309
    • /
    • 2001
  • 본 논문에서는 RSA 암호 시스템의 핵심 과정인 모듈로 멱승 연산에 대한 새로운 하드웨어 구조를 제시한다. 본 방식은 몽고메리 곱셈 알고리즘을 사용하였으며 기존의 방법들이 데이터 종속 그래프(DG : Dependence Graph)를 수직으로 매핑한 것과는 달리 여기서는 수평으로 매핑하여 1차원 선형 어레이구조를 구성하였다. 그 결과로 멱승시에 중간 결과값이 순차적으로 나와서 바로 다음 곱셈을 위한 입력으로 들어갈 수 있기 때문에 100%의 처리율(throughput)을 이룰 수 있고, 수직 매핑 방식에 비해 절반의 클럭 횟수로 연산을 해낼 수 있으며 컨트롤 또한 단순해지는 장점을 가진다. 각 PE(Processing Element)는 2개의 전가산기와 3개의 멀티플렉서로 이루어져 있고, 암호키의 비트수를 k비트라 할 때 k+3개의 PE만으로 파이프라인구조를 구현하였다. 1024비트 RSA데이터의 암호 똔느 복호를 완료하는데 2k$^2$+12k+19의 클럭 수가 소요되며 클럭 주파수 100Mhz에서 약 50kbps의 성능을 보인다. 또한, 제안된 하드웨어는 내부 계산 구조의 지역성(locality), 규칙성(regularity) 및 모듈성(modularity) 등으로 인해 실시간 고속 처리를 위한 VLSI 구현에 적합하다.

  • PDF

Architecture Design of Line based Lifting-DWT for JPEG2000 Image Compression (JPEG2000영상압축을 위한 라인 기반의 리프팅 DWT 구조 설계)

  • 정갑천;박성모
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.11
    • /
    • pp.97-104
    • /
    • 2004
  • This paper proposes an efficient VLSI architecture of 9-7/5-3 Lifting DWT filters that is used by lossy or lossless compression of JPEG2000. The proposed architecture uses only internal line memories to compute Lifting-DWT operations and its PE(Processing Element) has critical path with 1 multiplier and 1 adder. To reduce the number of PE, we make the vertical filter that is responsible for the column operations of the first level perform both the row and column operations of the second and following levels. As a result, the architecture has smaller hardware cost compared to that of other architectures. It was modeled in RTL level using VHDL and implemented on Altera APEX 20K FPGA.

A Novel Spiral-Type Motion Estimation Architecture for H.264/AVC

  • Hirai, Naoyuki;Song, Tian;Liu, Yizhong;Shimamoto, Takashi
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.10 no.1
    • /
    • pp.37-44
    • /
    • 2010
  • New features of motion compensation, such as variable block size and multiple reference frames are introduced in H.264/AVC. However, these new features induce significant implementation complexity increases. In this paper, an efficient architecture for spiral-type motion estimation is proposed. First, we propose a hardware-friendly spiral search order. Then, an efficient processing element (PE) architecture for ME is proposed to achieve the proposed search order. The improved PE enables one-pixel-move of the reference pixel data to top, bottom, right, and left by four ports for input and output. Moreover, the parallel calculation architecture to calculate all block size with the SAD of 4x4 is introduced in the proposed architecture. As the result of hardware implementation, the hardware cost is about 145k gates. Maximum clock frequency is 134 MHz in the case of FPGA (Xilinx Vertex5) implementation.

Optimal Design Space Exploration of Multi-core Architecture for Real-time Lane Detection Algorithm (실시간 차선인식 알고리즘을 위한 최적의 멀티코어 아키텍처 디자인 공간 탐색)

  • Jeong, Inkyu;Kim, Jongmyon
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.7 no.3
    • /
    • pp.339-349
    • /
    • 2017
  • This paper proposes a four-stage algorithm for detecting lanes on a driving car. In the first stage, it extracts region of interests in an image. In the second stage, it employs a median filter to remove noise. In the third stage, a binary algorithm is used to classify two classes of backgrond and foreground of an input image. Finally, an image erosion algorithm is utilized to obtain clear lanes by removing noises and edges remained after the binary process. However, the proposed lane detection algorithm requires high computational time. To address this issue, this paper presents a parallel implementation of a real-time line detection algorithm on a multi-core architecture. In addition, we implement and simulate 8 different processing element (PE) architectures to select an optimal PE architecture for the target application. Experimental results indicate that 40×40 PE architecture show the best performance, energy efficiency and area efficiency.

VLSI Implementation of Low-Power Motion Estimation Using Reduced Memory Accesses and Computations (메모리 호출과 연산횟수 감소기법을 이용한 저전력 움직임추정 VLSI 구현)

  • Moon, Ji-Kyung;Kim, Nam-Sub;Kim, Jin-Sang;Cho, Won-Kyung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.32 no.5A
    • /
    • pp.503-509
    • /
    • 2007
  • Low-power motion estimation is required for video coding in portable information devices. In this paper, we propose a low-power motion estimation algorithm and 1-D systolic may VLSI architecture using full search block matching algorithm (FSBMA). Main power dissipation sources of FSBMA are complex computations and frequent memory accesses for data in the search area. In the proposed algorithm, memory accesses and computations are reduced by using 1D PE (processing array) array architecture performing motion estimation of two neighboring blocks in parallel and by skipping unnecessary computations during motion estimation. The VLSI implementation results of the algorithm show that the proposed VLSI architecture can save 9.3% power dissipation and can operate two times faster than an existing low-power motion estimator.

Design and Implementation of the Digital Neuron Processor for the real time object recognition in the making Automatic system (생산자동화 시스템에서 실시간 물체인식을 위한 디지털 뉴런프로세서의 설계 및 구현)

  • Hong, Bong-Wha;Joo, Hae-Jong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.12 no.3
    • /
    • pp.37-50
    • /
    • 2007
  • In this paper, we designed and implementation of the high speed neuron processor for real time object recognition in the making automatic system. and we designed of the PE(Processing Element) used residue number system without carry propagation for the high speed operation. Consisting of MAC(Multiplication and Accumulation) operator using residue number system and sigmoid function operator unit using MAC(Mixed Radix conversion) is designed. The designed circuits are descript by C language and VHDL(Very High Speed Integrated Circuit Hardware Description Language) and synthesized by compass tools and finally, the designed processor is fabricated in $0.8{\mu}m$ CMOS process. we designed of MAC operation unit and sigmoid proceeding unit are proved that it could run time 0.6nsec on the simulation and improved to the speed of the three times and decreased to hardware size about 50%, each order. The designed neuron processor can be implemented of the object recognition in making automatic system with desired real time processing.

  • PDF