• Title/Summary/Keyword: FPGA 가속기

Search Result 60, Processing Time 0.03 seconds

A Study on Hardware Accelerator for Transformer Encoder (Transformer Encoder 의 Hardware Accelerator 에 관한 연구)

  • Ye-Song Yu;Chae-Yoon Kim;Hye-Ryeong Park;Chae-Won Ahn
    • Annual Conference of KIPS
    • /
    • 2024.10a
    • /
    • pp.928-929
    • /
    • 2024
  • 데이터의 규모가 방대해지고 AI 모델의 구조적 복잡성이 증가함에 따라 AI 하드웨어 가속기의 성능이 더욱 중요해졌다. 특히 LLM 의 핵심을 이루는 Transformer 모델이 주목받고 있으나, Transformer 의 하드웨어 가속기 연구는 타 모델에 비해 상대적으로 늦게 진행되었다. 그 이유에는 최적화가 어려운 복잡한 연산과 메모리 접근패턴이 있다. Transformer 는 Self-Attention 메커니즘을 사용해 입력 시퀀스 내 모든 요소 간의 관계를 계산하는 구조로[1], 매우 많은 양의 연산과 메모리 사용을 요구한다. NLP 기술이 생활 곳곳에서 대체될 수 없는 도구로 자리 잡은 만큼 Transformer Accelerator 가 더 많이 연구, 개발될 필요가 있다.[2] 본 연구는 Verilog HDL 로 하드웨어에 최적화된 Transformer Encoder 를 구현한 후 합성/실행하여 FPGA 칩에 업로드한다. transformer 의 encoder 에 알맞은 accelerator 를 제작하여 다양한 NLP 모델의 등장과 개발을 촉진하고자 한다. 또 각 모델에 따라 특화 연산기를 제작하는 연구 파이프라인을 구축한다.

Embedded SoC Design for H.264/AVC Decoder (H.264/AVC 디코더를 위한 Embedded SoC 설계)

  • Kim, Jin-Wook;Park, Tae-Geun
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.9
    • /
    • pp.71-78
    • /
    • 2008
  • In this paper, we implement the H.264/AVC baseline decoder by hardware-software partitioning under the embedded Linux Kernel 2.4.26 and the FPGA-based target board with ARM926EJ-S core. We design several IPs for the time-demanding blocks, such as motion compensation, deblocking filter, and YUV-to-RGB and they are communicated with the host through the AMBA bus protocol. We also try to minimize the number of memory accesses between IPs and the reference software (JM 11.0) which is ported in the embedded Linux. The proposed IPs and the system have been designed and verified in several stages. The proposed system decodes the QCIF sample video at 2 frame per second when 24MHz of system clock is running and we expect the bitter performance if the proposed system is designed with ASIC.

Design and Implementation of CNN-based HMI System using Doppler Radar and Voice Sensor (도플러 레이다 및 음성 센서를 활용한 CNN 기반 HMI 시스템 설계 및 구현)

  • Oh, Seunghyun;Bae, Chanhee;Kim, Seryeong;Cho, Jaechan;Jung, Yunho
    • Journal of IKEEE
    • /
    • v.24 no.3
    • /
    • pp.777-782
    • /
    • 2020
  • In this paper, we propose CNN-based HMI system using Doppler radar and voice sensor, and present hardware design and implementation results. To overcome the limitation of single sensor monitoring, the proposed HMI system combines data from two sensors to improve performance. The proposed system exhibits improved performance by 3.5% and 12% compared to a single radar and voice sensor-based classifier in noisy environment. In addition, hardware to accelerate the complex computational unit of CNN is implemented and verified on the FPGA test system. As a result of performance evaluation, the proposed HMI acceleration platform can be processed with 95% reduction in computation time compared to a single software-based design.

MC50 사이클로트론을 위한 디지털 LLRF의 설계

  • Jo, Seong-Jin;Choe, Jun-Yong;Hong, Seung-Pyo;Kim, Gye-Hong;Park, Yeon-Su
    • Proceedings of the Korean Vacuum Society Conference
    • /
    • 2013.02a
    • /
    • pp.515-515
    • /
    • 2013
  • RF는 사이클로트론에서 빔을 원하는 에너지로 가속하기위해 쓰인다. MC50 사이클로트론에는 두 개의 DEE가 있고 각각 독립된 LLRF모듈과 증폭기를 통해 제어된다. 주요 제어변수는 DEE1,2의 Voltage와 양단간의 Phase인데 이는 RF Generator에서 특정 주파수로 발생된 RF 시그널의 Amplitude와 Phase를 RF Modulator에서 변조하므로 제어되어진다. 지금 현재의 Modulator는 오래되어 DEE Voltage의 컨트롤이 잘 이루어지지 않고 있고 가끔 연결부위에서 문제를 보여 새 Modulator를 제작하게 되었다. 기존의 LLRF는 아날로그 방식인데 아날로그 방식은 외부제어가 어렵고 확장이 쉽지 않아 디지털 제어방식으로 설계하였다. 새 LLRF는 저속처리부와 고속처리부로 두 부분으로 구성하였다. Final amplifier와 cavity의 상태를 체크하는 저속처리부는 PLC로 RF 시그널의 Amplitude와 Phase를 제어하는 고속처리부는 FPGA로 제어할 계획이다.

  • PDF

Optimizing 2-stage Tiling-based Matrix Multiplication in FPGA-based Neural Network Accelerator (FPGA기반 뉴럴네트워크 가속기에서 2차 타일링 기반 행렬 곱셈 최적화)

  • Jinse, Kwon;Jemin, Lee;Yongin, Kwon;Jeman, Park;Misun, Yu;Taeho, Kim;Hyungshin, Kim
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.6
    • /
    • pp.367-374
    • /
    • 2022
  • The acceleration of neural networks has become an important topic in the field of computer vision. An accelerator is absolutely necessary for accelerating the lightweight model. Most accelerator-supported operators focused on direct convolution operations. If the accelerator does not provide GEMM operation, it is mostly replaced by CPU operation. In this paper, we proposed an optimization technique for 2-stage tiling-based GEMM routines on VTA. We improved performance of the matrix multiplication routine by maximizing the reusability of the input matrix and optimizing the operation pipelining. In addition, we applied the proposed technique to the DarkNet framework to check the performance improvement of the matrix multiplication routine. The proposed GEMM method showed a performance improvement of more than 2.4 times compared to the non-optimized GEMM method. The inference performance of our DarkNet framework has also improved by at least 2.3 times.

Accuracy Analysis of Fixed Point Arithmetic for Hardware Implementation of Binary Weight Network (이진 가중치 신경망의 하드웨어 구현을 위한 고정소수점 연산 정확도 분석)

  • Kim, Jong-Hyun;Yun, SangKyun
    • Journal of IKEEE
    • /
    • v.22 no.3
    • /
    • pp.805-809
    • /
    • 2018
  • In this paper, we analyze the change of accuracy when fixed point arithmetic is used instead of floating point arithmetic in binary weight network(BWN). We observed the change of accuracy by varying total bit size and fraction bit size. If the integer part is not changed after fixed point approximation, there is no significant decrease in accuracy compared to the floating-point operation. When overflow occurs in the integer part, the approximation to the maximum or minimum of the fixed point representation minimizes the decrease in accuracy. The results of this paper can be applied to the minimization of memory and hardware resource requirement in the implementation of FPGA-based BWN accelerator.

FPGA Implementation of Levenverg-Marquardt Algorithm (LM(Levenberg-Marquardt) 알고리즘의 FPGA 구현)

  • Lee, Myung-Jin;Jung, Yong-Jin
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.11
    • /
    • pp.73-82
    • /
    • 2014
  • The LM algorithm is used in solving the least square problem in a non linear system, and is used in various fields. However, in cases the applied field's target functionis complicated and high-dimensional, it takes a lot of time solving the inner matrix and vector operations. In such cases, the LM algorithm is unsuitable in embedded environment and requires a hardware accelerator. In this paper, we implemented the LM algorithm in hardware. In the implementation, we used pipeline stages to divide the target function operation, and reduced the period of data input of the matrix and vector operations in order to accelerate the speed. To measure the performance of the implemented hardware, we applied the refining fundamental matrix(RFM), which is a part of 3D reconstruction application. As a result, the implemented system showed similar performance compared to software, and the execution speed increased in a product of 74.3.

Design and Implementation of CNN-Based Human Activity Recognition System using WiFi Signals (WiFi 신호를 활용한 CNN 기반 사람 행동 인식 시스템 설계 및 구현)

  • Chung, You-shin;Jung, Yunho
    • Journal of Advanced Navigation Technology
    • /
    • v.25 no.4
    • /
    • pp.299-304
    • /
    • 2021
  • Existing human activity recognition systems detect activities through devices such as wearable sensors and cameras. However, these methods require additional devices and costs, especially for cameras, which cause privacy issue. Using WiFi signals that are already installed can solve this problem. In this paper, we propose a CNN-based human activity recognition system using channel state information of WiFi signals, and present results of designing and implementing accelerated hardware structures. The system defined four possible behaviors during studying in indoor environments, and classified the channel state information of WiFi using convolutional neural network (CNN), showing and average accuracy of 91.86%. In addition, for acceleration, we present the results of an accelerated hardware structure design for fully connected layer with the highest computation volume on CNN classifiers. As a result of performance evaluation on FPGA device, it showed 4.28 times faster calculation time than software-based system.

Design and Implementation of Human and Object Classification System Using FMCW Radar Sensor (FMCW 레이다 센서 기반 사람과 사물 분류 시스템 설계 및 구현)

  • Sim, Yunsung;Song, Seungjun;Jang, Seonyoung;Jung, Yunho
    • Journal of IKEEE
    • /
    • v.26 no.3
    • /
    • pp.364-372
    • /
    • 2022
  • This paper proposes the design and implementation results for human and object classification systems utilizing frequency modulated continuous wave (FMCW) radar sensor. Such a system requires the process of radar sensor signal processing for multi-target detection and the process of deep learning for the classification of human and object. Since deep learning requires such a great amount of computation and data processing, the lightweight process is utmost essential. Therefore, binary neural network (BNN) structure was adopted, operating convolution neural network (CNN) computation in a binary condition. In addition, for the real-time operation, a hardware accelerator was implemented and verified via FPGA platform. Based on performance evaluation and verified results, it is confirmed that the accuracy for multi-target classification of 90.5%, reduced memory usage by 96.87% compared to CNN and the run time of 5ms are achieved.

A Design of Floating-Point Geometry Processor for Embedded 3D Graphics Acceleration (내장형 3D 그래픽 가속을 위한 부동소수점 Geometry 프로세서 설계)

  • Nam Ki hun;Ha Jin Seok;Kwak Jae Chang;Lee Kwang Youb
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.43 no.2 s.344
    • /
    • pp.24-33
    • /
    • 2006
  • The effective geometry processing IP architecture for mobile SoC that has real time 3D graphics acceleration performance in mobile information system is proposed. Base on the proposed IP architecture, we design the floating point arithmetic unit needed in geometry process and the floating point geometry processor supporting the 3D graphic international standard OpenGL-ES. The geometry processor is implemented by 160k gate area in a Xilinx-Vertex FPGA and we measure the performance of geometry processor using the actual 3D graphic data at 80MHz frequency environment The experiment result shows 1.5M polygons/sec processing performance. The power consumption is measured to 83.6mW at Hynix 0.25um CMOS@50MHz.