• Title/Summary/Keyword: Hardware Accelerator

Search Result 111, Processing Time 0.034 seconds

Design and Implementation of a Hardware-based Transmission/Reception Accelerator for a Hybrid TCP/IP Offload Engine (하이브리드 TCP/IP Offload Engine을 위한 하드웨어 기반 송수신 가속기의 설계 및 구현)

  • Jang, Han-Kook;Chung, Sang-Hwa;Yoo, Dae-Hyun
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.9
    • /
    • pp.459-466
    • /
    • 2007
  • TCP/IP processing imposes a heavy load on the host CPU when it is processed by the host CPU on a very high-speed network. Recently the TCP/IP Offload Engine (TOE), which processes TCP/IP on a network adapter instead of the host CPU, has become an attractive solution to reduce the load in the host CPU. There have been two approaches to implement TOE. One is the software TOE in which TCP/IP is processed by an embedded processor and the other is the hardware TOE in which TCP/IP is processed by a dedicated ASIC. The software TOE has poor performance and the hardware TOE is neither flexible nor expandable enough to add new features. In this paper we designed and implemented a hybrid TOE architecture, in which TCP/IP is processed by cooperation of hardware and software, based on an FPGA that has two embedded processor cores. The hybrid TOE can have high performance by processing time-critical operations such as making and processing data packets in hardware. The software based on the embedded Linux performs operations that are not time-critical such as connection establishment, flow control and congestions, thus the hybrid TOE can have enough flexibility and expandability. To improve the performance of the hybrid TOE, we developed a hardware-based transmission/reception accelerator that processes important operations such as creating data packets. In the experiments the hybrid TOE shows the minimum latency of about $19{\mu}s$. The CPU utilization of the hybrid TOE is below 6 % and the maximum bandwidth of the hybrid TOE is about 675 Mbps.

Toward Optimal FPGA Implementation of Deep Convolutional Neural Networks for Handwritten Hangul Character Recognition

  • Park, Hanwool;Yoo, Yechan;Park, Yoonjin;Lee, Changdae;Lee, Hakkyung;Kim, Injung;Yi, Kang
    • Journal of Computing Science and Engineering
    • /
    • v.12 no.1
    • /
    • pp.24-35
    • /
    • 2018
  • Deep convolutional neural network (DCNN) is an advanced technology in image recognition. Because of extreme computing resource requirements, DCNN implementation with software alone cannot achieve real-time requirement. Therefore, the need to implement DCNN accelerator hardware is increasing. In this paper, we present a field programmable gate array (FPGA)-based hardware accelerator design of DCNN targeting handwritten Hangul character recognition application. Also, we present design optimization techniques in SDAccel environments for searching the optimal FPGA design space. The techniques we used include memory access optimization and computing unit parallelism, and data conversion. We achieved about 11.19 ms recognition time per character with Xilinx FPGA accelerator. Our design optimization was performed with Xilinx HLS and SDAccel environment targeting Kintex XCKU115 FPGA from Xilinx. Our design outperforms CPU in terms of energy efficiency (the number of samples per unit energy) by 5.88 times, and GPGPU in terms of energy efficiency by 5 times. We expect the research results will be an alternative to GPGPU solution for real-time applications, especially in data centers or server farms where energy consumption is a critical problem.

Power Operation Accelerator to speed up lighting in 3D graphics

  • Young-Su Kwon;In-
    • Proceedings of the IEEK Conference
    • /
    • 1998.10a
    • /
    • pp.1129-1132
    • /
    • 1998
  • This paper presents a design of special hardware developed for enhancing the floating-point power operations which are actively used at the lighting stage to calculate the specular term in 3D graphics geometry engines. The power operation takes just 4 cycles in our floating-point multiplier while it takes about 100-200 cycles in conventional floating-point units. Although an approximation algorithm is employed in the power operation to reduce the hardware complexity required, the error of power value from the developed floatingpoint multiplier is so minimal that no difference can be found by human eyes.

  • PDF

An OpenVG Vector Graphics Accelerator (OpenVG 기반 벡터 그래픽 가속기)

  • Choi, Y.;Hong, E.K.;Lee, G.H.;Shen, Y.L.;Kim, T.G.;Kim, H.G.;Oh, H.C.
    • Proceedings of the IEEK Conference
    • /
    • 2008.06a
    • /
    • pp.761-762
    • /
    • 2008
  • This paper presents a hardware accelerator for accelerating vector graphics applications based on the OpenVG standard. Since our design mainly targets embedded applications, we focus on efficient uses of limited resources, especially the memory bandwidth. The designed accelerator can process the images of $640{\times}240$ pixels with moderate complexity at the rate of 30 frames per second.

  • PDF

TCP Engine Design for TCP/IP Hardware Accelerator (TCP/IP Hardware Accelerator를 위한 TCP Engine 설계)

  • 이보미;정여진;임혜숙
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.29 no.5B
    • /
    • pp.465-475
    • /
    • 2004
  • Transport Control Protocol (TCP) has been implemented in software running on CPU in end systems, and the protocol processing has appeared as a new bottleneck due to advanced link technology. TCP processing is a critical issue in Storage Area Network (SAN) such as iSCSL, and the overall performance of the Storage Area Network heavily depends on speed of TCP processing. TCP Engine implemented in hardware reduces the load of CPU in end systems as well as accelerates the protocol processing, and hence high speed data processing is achieved. In this paper, we have proposed a hardware engine for TCP processing. TCP engine consists of three major block, TCP Connection block Rx TCP block and Tx TCP block TCP Connection block is responsible for managing TCP connection states. Rx TCP block is responsible for receive flow which receives packets from network and sends to CPU. Rx TCP performs header and data processing and sends header information to TCP connection block and Tx TCP block It also assembles out-of-ordered data to in-ordered before it transfers data to CPU. Tx TCP block is responsible for transmit flow which transfers data from CPU to network. Tx TCP performs retransmission for reliable data transfer and management of transmit window and sequence number. Various test-cases are used to verify the TCP functions. The TCP Engine is synthesized using 0.18 micron technology and results in 51K gates not including buffers for temporal data storage.

Conceptual Design of PLS-II Control System for PLS (가속기 제어시스템의 성능향상을 위한 연구)

  • Yoon, J.C.;Lee, J.W.;Lee, E.H.;Ha, H.G.;Kim, J.M.;Park, S.J.;Kim, K.R.
    • Proceedings of the KIEE Conference
    • /
    • 2009.07a
    • /
    • pp.1658_1659
    • /
    • 2009
  • PLS(Pohang Light Source) will begin the PLS-II project that has been funded by the KOREA Government in order to further upgrade the PLS which has operated since 1992. The control system of the PLS-II has distributed control architecture, with two layers of hierarchy; operator interface computer (OIC) layer and machine interface computer (MIC) layer. The OIC layer is based on SUN workstation with UNIX. A number of PC-based consoles allow to remotely operating the machine from the control room. PC-based consoles use the Linux or Windows operation system. Similar consoles in the experimental hall are used to control experiments. The MIC layer is directly interfaced to individual machine devices for low-level data acquisition and control. MIC layer is based on VMEbus standard with vxWorks real-time operating system. Executable application software modules are downloaded from host computers at the system start-up time. The MIC's and host computers are linked through Ethernet network. It should enable the use of hardware and software already developed for specific light source requirements. The core of the EPICS (Experimental Physics and Industrial Control System)[1] has been chosen as the basis for the control system software.

  • PDF

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression (구조적 압축을 통한 FPGA 기반 GRU 추론 가속기 설계)

  • Chae, Byeong-Cheol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.6
    • /
    • pp.850-858
    • /
    • 2022
  • To deploy Gate Recurrent Units (GRU) on resource-constrained embedded devices, this paper presents a reconfigurable FPGA-based GRU accelerator that enables structured compression. Firstly, a dense GRU model is significantly reduced in size by hybrid quantization and structured top-k pruning. Secondly, the energy consumption on external memory access is greatly reduced by the proposed reuse computing pattern. Finally, the accelerator can handle a structured sparse model that benefits from the algorithm-hardware co-design workflows. Moreover, inference tasks can be flexibly performed using all functional dimensions, sequence length, and number of layers. Implemented on the Intel DE1-SoC FPGA, the proposed accelerator achieves 45.01 GOPs in a structured sparse GRU network without batching. Compared to the implementation of CPU and GPU, low-cost FPGA accelerator achieves 57 and 30x improvements in latency, 300 and 23.44x improvements in energy efficiency, respectively. Thus, the proposed accelerator is utilized as an early study of real-time embedded applications, demonstrating the potential for further development in the future.

SW-HW Co-design of a High-performance Dehazing System Using OpenCL-based High-level Synthesis Technique (OpenCL 기반의 상위 수준 합성 기술을 이용한 고성능 안개 제거 시스템의 소프트웨어-하드웨어 통합 설계)

  • Park, Yongmin;Kim, Minsang;Kim, Byung-O;Kim, Tae-Hwan
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.54 no.8
    • /
    • pp.45-52
    • /
    • 2017
  • This paper presents a high-performance software-hardware dehazing system based on a dedicated hardware accelerator for the haze removal. In the proposed system, the dedicated hardware accelerator performs the dark-channel-prior-based dehazing process, and the software performs the other control processes. For this purpose, the dehazing process is realized as an OpenCL kernel by finding the inherent parallelism in the algorithm and is synthesized into a hardware by employing a high-level-synthesis technique. The proposed system executes the dehazing process much faster than the previous software-only dehazing system: the performance improvement is up to 96.3% in terms of the execution time.

A Real-Time Hardware Design of CNN for Vehicle Detection (차량 검출용 CNN 분류기의 실시간 처리를 위한 하드웨어 설계)

  • Bang, Ji-Won;Jeong, Yong-Jin
    • Journal of IKEEE
    • /
    • v.20 no.4
    • /
    • pp.351-360
    • /
    • 2016
  • Recently, machine learning algorithms, especially deep learning-based algorithms, have been receiving attention due to its high classification performance. Among the algorithms, Convolutional Neural Network(CNN) is known to be efficient for image processing tasks used for Advanced Driver Assistance Systems(ADAS). However, it is difficult to achieve real-time processing for CNN in vehicle embedded software environment due to the repeated operations contained in each layer of CNN. In this paper, we propose a hardware accelerator which enhances the execution time of CNN by parallelizing the repeated operations such as convolution. Xilinx ZC706 evaluation board is used to verify the performance of the proposed accelerator. For $36{\times}36$ input images, the hardware execution time of CNN is 2.812ms in 100MHz clock frequency and shows that our hardware can be executed in real-time.

Design of IPSec Hardware Accelerator IP

  • Ha Chang-Soo;Kim Joo-Hong;Cho Hyun-Sook;Park Myoung-Soo;Choi Byeong-Yoon
    • Proceedings of the Korean Institute of Communication Sciences Conference
    • /
    • 2004.07a
    • /
    • pp.341-341
    • /
    • 2004
  • PDF