• Title/Summary/Keyword: SIMD (Single Instruction Multiple Data) Technology

Search Result 15, Processing Time 0.022 seconds

Improving the speed of deep neural networks using the multi-core and single instruction multiple data technology (다중 코어 및 single instruction multiple data 기술을 이용한 심층 신경망 속도 향상)

  • Chung, Ik Joo;Kim, Seung Hi
    • The Journal of the Acoustical Society of Korea
    • /
    • v.36 no.6
    • /
    • pp.425-435
    • /
    • 2017
  • In this paper, we propose optimization methods for speeding the feedforward network of deep neural networks using NEON SIMD (Single Instruction Multiple Data) parallel instructions and multi-core parallelization on the multi-core ARM processor. As the result of the optimization using SIMD parallel instructions, we present the amount of speed improvement and arithmetic precision stage by stage. Through the optimization using SIMD parallel instructions on the single core, we obtain $2.6{\times}$ speedup over the baseline implementation using C compiler. Furthermore, by parallelizing the single core implementation on the multi-core, we obtain $5.7{\times}{\sim}7.7{\times}$ speedup. The results we obtain show the possibility for applying the arithmetic-intensive deep neural network technology to applications on mobile devices.

An Implementation of Efficient Quicksort Utilizing SIMD-Based VBP Technique (SIMD 기반의 VBP 기법을 적용한 효율적인 퀵정렬의 구현)

  • Hong, Gilseok;Kim, Hongyeon;Kang, Seonghyeon;Min, Jun-Ki
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.8
    • /
    • pp.498-503
    • /
    • 2017
  • SIMD (Single Instruction Multiple Data) is a representative parallelization architecture that processes multiple data loaded in a SIMD register with a single instruction. Quicksort is a sorting algorithm that picks an element as a pivot from the array and reorders the array such that all elements having the values less than the pivot value are located in the left side on the pivot as well as all elements having the value greater than the pivot value are located in the right side on the pivot and then the algorithm performs the same task on both sublist recursively. In this paper, we propose an efficient Quicksort algorithm applying the SIMD instructions which minimally invokes conditional branches to avoid the performance degradation incurred by branch misprediction in a pipeline architecture. In addition, we improve the performance of the Quicksort algorithm by fetching data into a SIMD register as a byte unit to apply VBP (Vertical Bit Parallel) and the early pruning technique.

Scalable Application Mapping for SIMD Reconfigurable Architecture

  • Kim, Yongjoo;Lee, Jongeun;Lee, Jinyong;Paek, Yunheung
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.15 no.6
    • /
    • pp.634-646
    • /
    • 2015
  • Coarse-Grained Reconfigurable Architecture (CGRA) is a very promising platform that provides fast turn-around-time as well as very high energy efficiency for multimedia applications. One of the problems with CGRAs, however, is application mapping, which currently does not scale well with geometrically increasing numbers of cores. To mitigate the scalability problem, this paper discusses how to use the SIMD (Single Instruction Multiple Data) paradigm for CGRAs. While the idea of SIMD is not new, SIMD can complicate the mapping problem by adding an additional dimension of iteration mapping to the already complex problem of operation and data mapping, which are all interdependent, and can thus significantly affect performance through memory bank conflicts. In this paper, based on a new architecture called SIMD reconfigurable architecture, which allows SIMD execution at multiple levels of granularity, we present how to minimize bank conflicts considering all three related sub-problems, for various RA organizations. We also present data tiling and evaluate a conflict-free scheduling algorithm as a way to eliminate bank conflicts for a certain class of mapping problem.

A Fast SAD Algorithm for Area-based Stereo Matching Methods (영역기반 스테레오 영상 정합을 위한 고속 SAD 알고리즘)

  • Lee, Woo-Young;Kim, Cheong Ghil
    • Journal of Satellite, Information and Communications
    • /
    • v.7 no.2
    • /
    • pp.8-12
    • /
    • 2012
  • Area-based stereo matchng algorithms are widely used for image analysis for stereo vision. SAD (Sum of Absolute Difference) algorithm is one of well known area-based stereo matchng algorithms with the characteristics of data intensive computing application. Therefore, it requires very high computation capabilities and its processing speed becomes very slow with software realization. This paper proposes a fast SAD algorithm utilizing SSE (Streaming SIMD Extensions) instructions based on SIMD (Single Instruction Multiple Data) parallism. CPU supporing SSE instructions has 16 XMM registers with 128 bits. For the performance evaluation of the proposed scheme, we compare the processing speed between SAD with/without SSE instructions. The proposed scheme achieves four times performance improvement over the general SAD, which shows the possibility of the software realization of real time SAD algorithm.

Implementation of Pixel Subword Parallel Processing Instructions for Embedded Parallel Processors (임베디드 병렬 프로세서를 위한 픽셀 서브워드 병렬처리 명령어 구현)

  • Jung, Yong-Bum;Kim, Jong-Myon
    • The KIPS Transactions:PartA
    • /
    • v.18A no.3
    • /
    • pp.99-108
    • /
    • 2011
  • Processor technology is currently continued to parallel processing techniques, not by only increasing clock frequency of a single processor due to the high technology cost and power consumption. In this paper, a SIMD (Single Instruction Multiple Data) based parallel processor is introduced that efficiently processes massive data inherent in multimedia. In addition, this paper proposes pixel subword parallel processing instructions for the SIMD parallel processor architecture that efficiently operate on the image and video pixels. The proposed pixel subword parallel processing instructions store and process four 8-bit pixels on the partitioned four 12-bit registers in a 48-bit datapath architecture. This solves the overflow problem inherent in existing multimedia extensions and reduces the use of many packing/unpacking instructions. Experimental results using the same SIMD-based parallel processor architecture indicate that the proposed pixel subword parallel processing instructions achieve a speedup of $2.3{\times}$ over the baseline SIMD array performance. This is in contrast to MMX-type instructions (a representative Intel multimedia extension), which achieve a speedup of only $1.4{\times}$ over the same baseline SIMD array performance. In addition, the proposed instructions achieve $2.5{\times}$ better energy efficiency than the baseline program, while MMX-type instructions achieve only $1.8{\times}$ better energy efficiency than the baseline program.

Low-latency SAO Architecture and its SIMD Optimization for HEVC Decoder

  • Kim, Yong-Hwan;Kim, Dong-Hyeok;Yi, Joo-Young;Kim, Je-Woo
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.1
    • /
    • pp.1-9
    • /
    • 2014
  • This paper proposes a low-latency Sample Adaptive Offset filter (SAO) architecture and its Single Instruction Multiple Data (SIMD) optimization scheme to achieve fast High Efficiency Video Coding (HEVC) decoding in a multi-core environment. According to the HEVC standard and its Test Model (HM), SAO operation is performed only at the picture level. Most realtime decoders, however, execute their sub-modules on a Coding Tree Unit (CTU) basis to reduce the latency and memory bandwidth. The proposed low-latency SAO architecture has the following advantages over picture-based SAO: 1) significantly less memory requirements, and 2) low-latency property enabling efficient pipelined multi-core decoding. In addition, SIMD optimization of SAO filtering can reduce the SAO filtering time significantly. The simulation results showed that the proposed low-latency SAO architecture with significantly less memory usage, produces a similar decoding time as a picture-based SAO in single-core decoding. Furthermore, the SIMD optimization scheme reduces the SAO filtering time by approximately 509% and increases the total decoding speed by approximately 7% compared to the existing look-up table approach of HM.

Improvement of H.264 Encoder Using MMX (MMX를 이용한 H.264 인코더 성능 개선)

  • Kim, Sang-Ho;Lee, June-Hwan;Rhee, Sang-Burm
    • Proceedings of the IEEK Conference
    • /
    • 2006.06a
    • /
    • pp.729-730
    • /
    • 2006
  • multimedia applications has been targeted for exploiting single instruction multiple data extensions to instruction architectures for the most of the modern microprocessor. In this paper, the newest video coding standard, H.264/AVC baseline profile decoder has been implemented and optimized exploiting INTEL MMX technology to show the overall system speedup by the SIMD style coding

  • PDF

Implementation of Mobile WiMAX Receiver using Mobile Computing Platform for SDR System (모바일 컴퓨팅 플랫폼을 이용한 SDR 기반 MOBILE WIMAX 수신기 구현)

  • Kim, Han Taek;Ahn, Chi Young;Kim, June;Choi, Seung Won
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.8 no.1
    • /
    • pp.117-123
    • /
    • 2012
  • This paper implements mobile Worldwide Interoperability for Microwave Access (WiMAX) receiver using Software Defined Radio (SDR) technology. SDR system is difficult to implement on the mobile handset because of restrictions that are computing power and under space constraints. The implemented receiver processes mobile WiMAX software modem on Open Multimedia Application Platform (OMAP) System on Chip (SoC) and Field Programmable Gate Array (FPGA). OMAP SoC is composed of ARM processor and Digital Signal Processor (DSP). ARM processor supports Single Instruction Multiple Data (SIMD) instruction which could operate on a vector of data with a single instruction and DSP is powerful image and video accelerators. For this reason, we suggest the possibility of SDR technology in the mobile handset. In order to verify the performance of the mobile WiMAX receiver, we measure the software modem runtime respectively. The experimental results show that the proposed receiver is able to do real-time signal processing.

TDES CODER USING SSE2 TECHNOLOGY

  • Koo, In-Hoi;Kim, Tae-Hoon;Ahn, Sang-Il
    • Proceedings of the KSRS Conference
    • /
    • 2007.10a
    • /
    • pp.114-117
    • /
    • 2007
  • DES is an improvement of the algorithm Lucifer developed by IBM in the 1977. IBM, the National Security Agency (NSA) and the National Bureau of Standards (NBS now National Institute of Standards and Technology NIST) developed the DES algorithm. The DES has been extensively studied since its publication and is the most widely used symmetric algorithm in the world. But nowadays, Triple DES (TDES) is more widely used than DES especially in the application in case high level of data security is required. Even though TDES can be implemented based on standard algorithm, very high speed TDES codec performance is required to process when encrypted high resolution satellite image data is down-linked at high speed. In this paper, Intel SSE2 (Streaming SIMD (Single-Instruction Multiple-Data) Extensions 2 of Intel) is applied to TDES Decryption algorithm and proved its effectiveness in the processing time reduction by comparing the time consumed for two cases; original TDES Decryption and TDES Decryption with SSE2

  • PDF

Performance Comparison of Implementation Technologies for Image Quality Enhancement Operations on Android Platforms (Android 플랫폼에서 구현 기술에 따른 화질 개선 연산 성능 비교)

  • Lee, Ju-Ho;Lee, Goo-Yeon;Jeong, Choong-Kyo
    • Journal of Digital Contents Society
    • /
    • v.14 no.1
    • /
    • pp.7-14
    • /
    • 2013
  • As mobiles devices with high-spec camera built in are used widely, the visual quality enhancement of the high-resolution images turns out to be one of the key capabilities of the mobile devices. Due to the limited computational resources of the mobile devices and the size of the high-resolution images, we should choose an image processing algorithm not too complex and use an efficient implementation technology. One of the simple and widely used image quality enhancement algorithms is contrast stretching. Java libraries running on a virtual machine, JNI (Java Native Interface) based native C/C++, and NEONTM SIMD (Single Instruction Multiple Data) are common implementation technologies in the case of Android smartphones. Using these three implementation technologies, we have implemented two image contrast stretching algorithms - linear and equalized, and compared the computation times. The native C/C++ and the NEONTM SIMD outperformed the native C/C++ implementation by 56-78 and 50-76 time faster respectively.