• Title/Summary/Keyword: Convolution Neural Network Accelerator

Search Result 9, Processing Time 0.024 seconds

An Implementation of a Convolutional Accelerator based on a GPGPU for a Deep Learning (Deep Learning을 위한 GPGPU 기반 Convolution 가속기 구현)

  • Jeon, Hee-Kyeong;Lee, Kwang-yeob;Kim, Chi-yong
    • Journal of IKEEE
    • /
    • v.20 no.3
    • /
    • pp.303-306
    • /
    • 2016
  • In this paper, we propose a method to accelerate convolutional neural network by utilizing a GPGPU. Convolutional neural network is a sort of the neural network learning features of images. Convolutional neural network is suitable for the image processing required to learn a lot of data such as images. The convolutional layer of the conventional CNN required a large number of multiplications and it is difficult to operate in the real-time on the embedded environment. In this paper, we reduce the number of multiplications through Winograd convolution operation and perform parallel processing of the convolution by utilizing SIMT-based GPGPU. The experiment was conducted using ModelSim and TestDrive, and the experimental results showed that the processing time was improved by about 17%, compared to the conventional convolution.

Research on the Main Memory Access Count According to the On-Chip Memory Size of an Artificial Neural Network (인공 신경망 가속기 온칩 메모리 크기에 따른 주메모리 접근 횟수 추정에 대한 연구)

  • Cho, Seok-Jae;Park, Sungkyung;Park, Chester Sungchung
    • Journal of IKEEE
    • /
    • v.25 no.1
    • /
    • pp.180-192
    • /
    • 2021
  • One widely used algorithm for image recognition and pattern detection is the convolution neural network (CNN). To efficiently handle convolution operations, which account for the majority of computations in the CNN, we use hardware accelerators to improve the performance of CNN applications. In using these hardware accelerators, the CNN fetches data from the off-chip DRAM, as the massive computational volume of data makes it difficult to derive performance improvements only from memory inside the hardware accelerator. In other words, data communication between off-chip DRAM and memory inside the accelerator has a significant impact on the performance of CNN applications. In this paper, a simulator for the CNN is developed to analyze the main memory or DRAM access with respect to the size of the on-chip memory or global buffer inside the CNN accelerator. For AlexNet, one of the CNN architectures, when simulated with increasing the size of the global buffer, we found that the global buffer of size larger than 100kB has 0.8x as low a DRAM access count as the global buffer of size smaller than 100kB.

Low Power ADC Design for Mixed Signal Convolutional Neural Network Accelerator (혼성신호 컨볼루션 뉴럴 네트워크 가속기를 위한 저전력 ADC설계)

  • Lee, Jung Yeon;Asghar, Malik Summair;Arslan, Saad;Kim, HyungWon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.11
    • /
    • pp.1627-1634
    • /
    • 2021
  • This paper introduces a low-power compact ADC circuit for analog Convolutional filter for low-power neural network accelerator SOC. While convolutional neural network accelerators can speed up the learning and inference process, they have drawback of consuming excessive power and occupying large chip area due to large number of multiply-and-accumulate operators when implemented in complex digital circuits. To overcome these drawbacks, we implemented an analog convolutional filter that consists of an analog multiply-and-accumulate arithmetic circuit along with an ADC. This paper is focused on the design optimization of a low-power 8bit SAR ADC for the analog convolutional filter accelerator We demonstrate how to minimize the capacitor-array DAC, an important component of SAR ADC, which is three times smaller than the conventional circuit. The proposed ADC has been fabricated in CMOS 65nm process. It achieves an overall size of 1355.7㎛2, power consumption of 2.6㎼ at a frequency of 100MHz, SNDR of 44.19 dB, and ENOB of 7.04bit.

Microcode based Controller for Compact CNN Accelerators Aimed at Mobile Devices (모바일 디바이스를 위한 소형 CNN 가속기의 마이크로코드 기반 컨트롤러)

  • Na, Yong-Seok;Son, Hyun-Wook;Kim, Hyung-Won
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.3
    • /
    • pp.355-366
    • /
    • 2022
  • This paper proposes a microcode-based neural network accelerator controller for artificial intelligence accelerators that can be reconstructed using a programmable architecture and provide the advantages of low-power and ultra-small chip size. In order for the target accelerator to support various neural network models, the neural network model can be converted into microcode through microcode compiler and mounted on accelerator to control the operators of the accelerator such as datapath and memory access. While the proposed controller and accelerator can run various CNN models, in this paper, we tested them using the YOLOv2-Tiny CNN model. Using a system clock of 200 MHz, the Controller and accelerator achieved an inference time of 137.9 ms/image for VOC 2012 dataset to detect object, 99.5ms/image for mask detection dataset to detect wearing mask. When implementing an accelerator equipped with the proposed controller as a silicon chip, the gate count is 618,388, which corresponds to 65.5% reduction in chip area compared with an accelerator employing a CPU-based controller (RISC-V).

Design and Implementation of Human and Object Classification System Using FMCW Radar Sensor (FMCW 레이다 센서 기반 사람과 사물 분류 시스템 설계 및 구현)

  • Sim, Yunsung;Song, Seungjun;Jang, Seonyoung;Jung, Yunho
    • Journal of IKEEE
    • /
    • v.26 no.3
    • /
    • pp.364-372
    • /
    • 2022
  • This paper proposes the design and implementation results for human and object classification systems utilizing frequency modulated continuous wave (FMCW) radar sensor. Such a system requires the process of radar sensor signal processing for multi-target detection and the process of deep learning for the classification of human and object. Since deep learning requires such a great amount of computation and data processing, the lightweight process is utmost essential. Therefore, binary neural network (BNN) structure was adopted, operating convolution neural network (CNN) computation in a binary condition. In addition, for the real-time operation, a hardware accelerator was implemented and verified via FPGA platform. Based on performance evaluation and verified results, it is confirmed that the accuracy for multi-target classification of 90.5%, reduced memory usage by 96.87% compared to CNN and the run time of 5ms are achieved.

Optimizing 2-stage Tiling-based Matrix Multiplication in FPGA-based Neural Network Accelerator (FPGA기반 뉴럴네트워크 가속기에서 2차 타일링 기반 행렬 곱셈 최적화)

  • Jinse, Kwon;Jemin, Lee;Yongin, Kwon;Jeman, Park;Misun, Yu;Taeho, Kim;Hyungshin, Kim
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.6
    • /
    • pp.367-374
    • /
    • 2022
  • The acceleration of neural networks has become an important topic in the field of computer vision. An accelerator is absolutely necessary for accelerating the lightweight model. Most accelerator-supported operators focused on direct convolution operations. If the accelerator does not provide GEMM operation, it is mostly replaced by CPU operation. In this paper, we proposed an optimization technique for 2-stage tiling-based GEMM routines on VTA. We improved performance of the matrix multiplication routine by maximizing the reusability of the input matrix and optimizing the operation pipelining. In addition, we applied the proposed technique to the DarkNet framework to check the performance improvement of the matrix multiplication routine. The proposed GEMM method showed a performance improvement of more than 2.4 times compared to the non-optimized GEMM method. The inference performance of our DarkNet framework has also improved by at least 2.3 times.

A Real-Time Hardware Design of CNN for Vehicle Detection (차량 검출용 CNN 분류기의 실시간 처리를 위한 하드웨어 설계)

  • Bang, Ji-Won;Jeong, Yong-Jin
    • Journal of IKEEE
    • /
    • v.20 no.4
    • /
    • pp.351-360
    • /
    • 2016
  • Recently, machine learning algorithms, especially deep learning-based algorithms, have been receiving attention due to its high classification performance. Among the algorithms, Convolutional Neural Network(CNN) is known to be efficient for image processing tasks used for Advanced Driver Assistance Systems(ADAS). However, it is difficult to achieve real-time processing for CNN in vehicle embedded software environment due to the repeated operations contained in each layer of CNN. In this paper, we propose a hardware accelerator which enhances the execution time of CNN by parallelizing the repeated operations such as convolution. Xilinx ZC706 evaluation board is used to verify the performance of the proposed accelerator. For $36{\times}36$ input images, the hardware execution time of CNN is 2.812ms in 100MHz clock frequency and shows that our hardware can be executed in real-time.

An Efficient FPGA Based TDC Accelerator for Deconvolutional Neural Networks (효율적인 DCNN 연산을 위한 FPGA 기반 TDC 가속기)

  • Jang, Hyerim;Moon, Byungin
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.05a
    • /
    • pp.457-458
    • /
    • 2021
  • 딥러닝 알고리즘 중 DCNN(DeConvolutional Neural Network)은 이미지 업스케일링과 생성·복원 등 다양한 분야에서 뛰어난 성능을 보여주고 있다. DCNN은 많은 양의 데이터를 병렬로 처리할 수 있기 때문에 하드웨어로 설계하는 것이 유용하다. 최근 DCNN의 하드웨어 구조 연구에서는 overlapping sum 문제를 해결하기 위해 deconvolution 필터를 convolution 필터로 변환하는 TDC(Transforming the Deconvolutional layer into the Convolutional layer) 알고리즘이 제안되었다. 하지만 TDC를 CPU(Central Processing Unit)로 수행하기 때문에 연산의 최적화가 어려우며, 외부 메모리를 사용하기에 추가적인 전력이 소모된다. 이에 본 논문에서는 저전력으로 구동할 수 있는 FPGA 기반 TDC 하드웨어 구조를 제안한다. 제안하는 하드웨어 구조는 자원 사용량이 적어 저전력으로 구동 가능할 뿐만 아니라, 병렬 처리 구조로 설계되어 빠른 연산 처리 속도를 보인다.

Cascade CNN with CPU-FPGA Architecture for Real-time Face Detection (실시간 얼굴 검출을 위한 Cascade CNN의 CPU-FPGA 구조 연구)

  • Nam, Kwang-Min;Jeong, Yong-Jin
    • Journal of IKEEE
    • /
    • v.21 no.4
    • /
    • pp.388-396
    • /
    • 2017
  • Since there are many variables such as various poses, illuminations and occlusions in a face detection problem, a high performance detection system is required. Although CNN is excellent in image classification, CNN operatioin requires high-performance hardware resources. But low cost low power environments are essential for small and mobile systems. So in this paper, the CPU-FPGA integrated system is designed based on 3-stage cascade CNN architecture using small size FPGA. Adaptive Region of Interest (ROI) is applied to reduce the number of CNN operations using face information of the previous frame. We use a Field Programmable Gate Array(FPGA) to accelerate the CNN computations. The accelerator reads multiple featuremap at once on the FPGA and performs a Multiply-Accumulate (MAC) operation in parallel for convolution operation. The system is implemented on Altera Cyclone V FPGA in which ARM Cortex A-9 and on-chip SRAM are embedded. The system runs at 30FPS with HD resolution input images. The CPU-FPGA integrated system showed 8.5 times of the power efficiency compared to systems using CPU only.