• Title/Summary/Keyword: Hardware accelerator

Search Result 115, Processing Time 0.025 seconds

Motion Estimation Specific Instructions and Their Hardware Architecture for ASIP (ASIP을 위한 움직임 추정 전용 연산기 구조 및 명령어 설계)

  • Hwang, Sung-Jo;SunWoo, Myung-Hoon
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.48 no.3
    • /
    • pp.106-111
    • /
    • 2011
  • This paper presents an ASIP (Application-specific Instruction Processor) for motion estimation that employs specific IME instructions and its programmable and reconfigurable hardware architecture for various video codecs, such as H.264/AVC, MPEG4, etc. With the proposed specific instructions and hardware accelerator, it can handle the real-time processing requirement of High Definition (HD) video. With the parallel operations and SAD unit control using pattern information, the proposed IME instruction supports not only full search algorithm but also other fast search algorithms. The hardware size is 77K gates for each Processing Element Group (PEG) which has 256 SAD PEs. The proposed ASIP runs at 160MHz with sixteen PEGs and it can handle 1080p@30 frame in real time.

An Efficient Hardware Implementation of CABAC Using H/W-S/W Co-design (H/W-S/W 병행설계를 이용한 CABAC의 효율적인 하드웨어 구현)

  • Cho, Young-Ju;Ko, Hyung-Hwa
    • Journal of Advanced Navigation Technology
    • /
    • v.18 no.6
    • /
    • pp.600-608
    • /
    • 2014
  • In this paper, CABAC H/W module is developed using co-design method. After entire H.264/AVC encoder was developed with C using reference SW(JM), CABAC H/W IP is developed as a block in H.264/AVC encoder. Context modeller of CABAC is included on the hardware to update the changed value during binary encoding, which enables the efficient usage of memory and the efficient design of I/O stream. Hardware IP is co-operated with the reference software JM of H.264/AVC, and executed on Virtex-4 FX60 FPGA on ML410 board. Functional simulation is done using Modelsim. Compared with existing H/W module of CABAC with register-level design, the development time is reduced greatly and software engineer can design H/W module more easily. As a result, the used amount of slice in CABAC is less than 1/3 of that of CAVLC module. The proposed co-design method is useful to provide hardware accelerator in need of speed-up of high efficient video encoder in embedded system.

FPGA Implementation of Levenverg-Marquardt Algorithm (LM(Levenberg-Marquardt) 알고리즘의 FPGA 구현)

  • Lee, Myung-Jin;Jung, Yong-Jin
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.11
    • /
    • pp.73-82
    • /
    • 2014
  • The LM algorithm is used in solving the least square problem in a non linear system, and is used in various fields. However, in cases the applied field's target functionis complicated and high-dimensional, it takes a lot of time solving the inner matrix and vector operations. In such cases, the LM algorithm is unsuitable in embedded environment and requires a hardware accelerator. In this paper, we implemented the LM algorithm in hardware. In the implementation, we used pipeline stages to divide the target function operation, and reduced the period of data input of the matrix and vector operations in order to accelerate the speed. To measure the performance of the implemented hardware, we applied the refining fundamental matrix(RFM), which is a part of 3D reconstruction application. As a result, the implemented system showed similar performance compared to software, and the execution speed increased in a product of 74.3.

Performance Assessment of a Lithium-Polymer Battery for HEV Utilizing Pack-Level Battery Hardware-in-the-Loop-Simulation System

  • Han, Sekyung;Lim, Jawhwan
    • Journal of Electrical Engineering and Technology
    • /
    • v.8 no.6
    • /
    • pp.1431-1438
    • /
    • 2013
  • A pack-level battery hardware-in-the-loop simulation (B-HILS) platform is implemented. It consists of dynamic vehicle models using PSAT and multiple control interfaces including real-time 3D driving and GPS mode. In real-time 3D driving mode, user can drive a virtual vehicle using actual drive equipment such as steering wheel and accelerator to generate the cycle profile of the battery. In GPS mode, actual road traffic and terrain effects can be simulated using GPS data while the trajectory is displayed on Google map. In the latter part of the paper, several performance tests of an actual lithium-polymer battery pack are carried out utilizing the developed system. All experiments are conducted as parts of actual development process of a commercial battery pack adopting 2nd generation Prius as a target vehicle model. Through the experiments, the low temperature performance and fuel efficiency of the battery are quantitatively investigated in comparison with the original nickel-metal hydride (NiMH) pack of the Prius.

Comparison of Artificial Neural Networks for Low-Power ECG-Classification System

  • Rana, Amrita;Kim, Kyung Ki
    • Journal of Sensor Science and Technology
    • /
    • v.29 no.1
    • /
    • pp.19-26
    • /
    • 2020
  • Electrocardiogram (ECG) classification has become an essential task of modern day wearable devices, and can be used to detect cardiovascular diseases. State-of-the-art Artificial Intelligence (AI)-based ECG classifiers have been designed using various artificial neural networks (ANNs). Despite their high accuracy, ANNs require significant computational resources and power. Herein, three different ANNs have been compared: multilayer perceptron (MLP), convolutional neural network (CNN), and spiking neural network (SNN) only for the ECG classification. The ANN model has been developed in Python and Theano, trained on a central processing unit (CPU) platform, and deployed on a PYNQ-Z2 FPGA board to validate the model using a Jupyter notebook. Meanwhile, the hardware accelerator is designed with Overlay, which is a hardware library on PYNQ. For classification, the MIT-BIH dataset obtained from the Physionet library is used. The resulting ANN system can accurately classify four ECG types: normal, atrial premature contraction, left bundle branch block, and premature ventricular contraction. The performance of the ECG classifier models is evaluated based on accuracy and power. Among the three AI algorithms, the SNN requires the lowest power consumption of 0.226 W on-chip, followed by MLP (1.677 W), and CNN (2.266 W). However, the highest accuracy is achieved by the CNN (95%), followed by MLP (76%) and SNN (90%).

Design and Implementation of Multi-mode Sensor Signal Processor on FPGA Device (다중모드 센서 신호 처리 프로세서의 FPGA 기반 설계 및 구현)

  • Soongyu Kang;Yunho Jung
    • Journal of Sensor Science and Technology
    • /
    • v.32 no.4
    • /
    • pp.246-251
    • /
    • 2023
  • Internet of Things (IoT) systems process signals from various sensors using signal processing algorithms suitable for the signal characteristics. To analyze complex signals, these systems usually use signal processing algorithms in the frequency domain, such as fast Fourier transform (FFT), filtering, and short-time Fourier transform (STFT). In this study, we propose a multi-mode sensor signal processor (SSP) accelerator with an FFT-based hardware design. The FFT processor in the proposed SSP is designed with a radix-2 single-path delay feedback (R2SDF) pipeline architecture for high-speed operation. Moreover, based on this FFT processor, the proposed SSP can perform filtering and STFT operation. The proposed SSP is implemented on a field-programmable gate array (FPGA). By sharing the FFT processor for each algorithm, the required hardware resources are significantly reduced. The proposed SSP is implemented and verified on Xilinxh's Zynq Ultrascale+ MPSoC ZCU104 with 53,591 look-up tables (LUTs), 71,451 flip-flops (FFs), and 44 digital signal processors (DSPs). The FFT, filtering, and STFT algorithm implementations on the proposed SSP achieve 185x average acceleration.

An Implementation of 3D Graphic Accelerator for Phong Shading (퐁 음영법을 위한 3차원 그래픽 가속기의 구현)

  • Lee, Hyung;Park, Youn-Ok;Park, Jong-Won
    • Journal of Korea Multimedia Society
    • /
    • v.3 no.5
    • /
    • pp.526-534
    • /
    • 2000
  • There have been many researches on the 3D graphic accelerator for high speed by needs of CAD/CAM,3D modeling, virtual reality or medical image. In this paper, an SIMD processor architecture for 3D graphic accelerator is proposed in order to improve the processing time of the 3D graphics, and a parallel Phong shading algorithm is presented to estimate performance of the proposed architecture. The proposed SIMD processor architecture for 3D graphic accelerator consists of PCI local bus interface, 16 Processing Elements (PE's), and Park's multi-access memory system (NAMS) that has 17 memory modules. A serial algorithm for Phong shading is modified for the architecture and the main key is to divide a polygon into $4\times{4}$ squares. And, for processing a square, 4 PE's are regarded as a PE Grou logically. Since MAMS can support block access type with interval 1, it is possible that 4 PE Groups process a square at a time. In consequence, 16 pixels are processed simultaneously. The proposed SIMD processor architecture is simulated by CADENCE Verilog-XL that is a package for the hardware simulation. With the same simulated results as that of the serial algorithm, the speed enhancement by the parallel algorithm to the serial one is 5.68.

  • PDF

Hardware Design of Arccosine Function for Mobile Vector Graphics Processor (모바일 벡터 그래픽 프로세서용 역코사인 함수의 하드웨어 설계)

  • Choi, Byeong-Yoon;Lee, Jong-Hyoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.13 no.4
    • /
    • pp.727-736
    • /
    • 2009
  • In this paper, the $arccos(cos^{-1})$ arithmetic unit for mobile graphics accelerator is designed. The mobile vector graphics applications need tight area, execution time, power dissipation, and accuracy constraints compared to desktop PC applications. The designed processor adopts 2nd-order polynomial approximation scheme based on IEEE floating point data format to satisfy speed and accuracy conditions and reduces area via hardware sharing structure. The arccosine processor consists of 15,280 gates and its estimated operating frequency is about 125Mhz at operating condition of $0.35{\mu}m$ CMOS technology. Because the processor can execute arccosine function within 7 clock cycles, it has about 17 MOPS(million arccos operations per second) execution rate and can be applicable to mobile OpenVG processor. And because of its flexible architecture, it can be applicable to the various transcendental functions such as exponential, trigonometric and logarithmic functions via replacement of ROM and minor hardware modification.

ASIP Design for Real-Time Processing of H.264 (실시간 H.264/AVC 처리를 위한 ASIP설계)

  • Kim, Jin-Soo;SunWoo, Myung-Hoon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.44 no.5
    • /
    • pp.12-19
    • /
    • 2007
  • This paper presents an ASIP(Application Specific Instruction Set Processor) for implementation of H.264/AVC, called VSIP(Video Specific Instruction-set Processor). The proposed VSIP has novel instructions and optimized hardware architectures for specific applications, such as intra prediction, in-loop deblocking filter, integer transform, etc. Moreover, VSIP has hardware accelerators for computation intensive parts in video signal processing, such as inter prediction and entropy coding. The VSIP has much smaller area and can dramatically reduce the number of memory access compared with commercial DSP chips, which result in low power consumption. The proposed VSIP can efficiently perform in real-time video processing and it can support various profiles and standards.

A Real-time Single-Pass Visibility Culling Method Based on a 3D Graphics Accelerator Architecture (실시간 단일 패스 가시성 선별 기법 기반의 3차원 그래픽스 가속기 구조)

  • Choo, Catherine;Choi, Moon-Hee;Kim, Shin-Dug
    • The KIPS Transactions:PartA
    • /
    • v.15A no.1
    • /
    • pp.1-8
    • /
    • 2008
  • An occlusion culling method, one of visibility culling methods, excludes invisible objects or triangles which are covered by other objects. As it reduces computation quantity, occlusion culling is an effective method to handle complex scenes in real-time. But an existing common occlusion culling method, such as hardware occlusion query method, sends objects' data twice to GPU and this causes processing overheads once for occlusion culling test and the other is for rendering. And another existing hardware occlusion culling method, VCBP, can test objects' visibility quickly, but it neither test bounding volume nor return test result to application stage. In this paper, we propose a single pass occlusion culling method which uses temporal and spatial coherency, with effective occlusion culling hardware architecture. In our approach, the hardware performs occlusion culling test rapidly with cache on the rasterization stage where triangles are transformed into fragments. At the same time, hardware sends each primitive's visibility information to application stage. As a result, the application stage reduces data transmission quantity by excluding covered objects using the visibility information on previous frame and hierarchical spatial tree. Our proposed method improved maximum 44%, minimum 14% compared with S&W method based on hardware occlusion query. And the performance is increased 25% and 17% respectively, compared to maximum and minimum performance of CHC method which is based on occlusion culling method.