• Title/Summary/Keyword: single-instruction multiple-data

Search Result 77, Processing Time 0.026 seconds

SIMD instruction-based fast HEVC interpolation filter for high bit-depth (High bit-depth 를 위한 SIMD 명령어 기반 HEVC 보간 필터 고속화)

  • Mok, Jung-Soo;Ahn, Yong-Jo;Ryu, Hochan;Sim, Dong-Gyu
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2014.11a
    • /
    • pp.200-202
    • /
    • 2014
  • 본 논문은 High bit-depth 를 위한 SIMD (Single Instruction, Multiple Data) 명령어 기반 보간 필터 고속화 방법을 제안한다. 픽셀 연산을 기반으로 하는 보간 필터링은 HEVC 복호화기에서 높은 복잡도를 차지하고 있지만 반복적인 산술연산을 수행하기 때문에 SIMD 를 이용한 고속화에 적합한 구조를 가지고 있다. 이러한 이유로 본 논문에서는 보간 필터 연산에 대하여 SIMD 명령어를 이용하여 메모리를 효율적으로 사용하여 고속화하는 방법을 제안한다. 제안하는 기술은 HEVC 참조 소프트웨어 HM 12.0-RExt 4.1 에 기반을 둔 ANSI C 기반 자체 개발 HEVC RExt 복호화기 소프트웨어에서 평균 8.5%의 복호화 속도향상을 보였으며, 보간 필터의 수행 시간을 평균 24.8% 향상시켰다.

  • PDF

A Novel Reconfigurable Processor Using Dynamically Partitioned SIMD for Multimedia Applications

  • Lyuh, Chun-Gi;Suk, Jung-Hee;Chun, Ik-Jae;Roh, Tae-Moon
    • ETRI Journal
    • /
    • v.31 no.6
    • /
    • pp.709-716
    • /
    • 2009
  • In this paper, we propose a novel reconfigurable processor using dynamically partitioned single-instruction multiple-data (DP-SIMD) which is able to process multimedia data. The SIMD processor and parallel SIMD (P-SIMD) processor, which is composed of a number of SIMD processors, are usually used these days. But these processors are inefficient because all processing units (PUs) should process the same operations all the time. Moreover, the PUs can process different operations only when every SIMD group operation is predefined. We propose a processor control method which can partition parallel processors into multiple SIMD-based processors dynamically to enhance efficiency. For performance evaluation of the proposed method, we carried out the inverse transform, inverse quantization, and motion compensation operations of H.264 using processors based on SIMD, P-SIMD, and DP-SIMD. Experimental results show that the DP-SIMD control method is more efficient than SIMD and P-SIMD control methods by about 15% and 14%, respectively.

Software-based Real-time GNSS Signal Generation and Processing Using a Graphic Processing Unit (GPU)

  • Im, Sung-Hyuck;Jee, Gyu-In
    • Journal of Positioning, Navigation, and Timing
    • /
    • v.3 no.3
    • /
    • pp.99-105
    • /
    • 2014
  • A graphic processing unit (GPU) can perform the same calculation on multiple data (SIMD: single instruction multiple data) using hundreds of to thousands of special purpose processors for graphic processing. Thus, high efficiency is expected when GPU is used for the generation and correlation of satellite navigation signals, which perform generation and processing by applying the same calculation procedure to tens of millions of discrete signal samples per second. In this study, the structure of a GPU-based GNSS simulator for the generation and processing of satellite navigation signals was designed, developed, and verified. To verify the developed satellite navigation signal generator, generated signals were applied to the OEM-V3 receiver of Novatel Inc., and the measured values were examined. To verify the satellite navigation signal processor, the performance was examined by collecting and processing actual GNSS intermediate frequency signals. The results of the verification indicated that satellite navigation signals could be generated and processed in real time using two GPUs.

Performance Comparison of Implementation Technologies for Image Quality Enhancement Operations on Android Platforms (Android 플랫폼에서 구현 기술에 따른 화질 개선 연산 성능 비교)

  • Lee, Ju-Ho;Lee, Goo-Yeon;Jeong, Choong-Kyo
    • Journal of Digital Contents Society
    • /
    • v.14 no.1
    • /
    • pp.7-14
    • /
    • 2013
  • As mobiles devices with high-spec camera built in are used widely, the visual quality enhancement of the high-resolution images turns out to be one of the key capabilities of the mobile devices. Due to the limited computational resources of the mobile devices and the size of the high-resolution images, we should choose an image processing algorithm not too complex and use an efficient implementation technology. One of the simple and widely used image quality enhancement algorithms is contrast stretching. Java libraries running on a virtual machine, JNI (Java Native Interface) based native C/C++, and NEONTM SIMD (Single Instruction Multiple Data) are common implementation technologies in the case of Android smartphones. Using these three implementation technologies, we have implemented two image contrast stretching algorithms - linear and equalized, and compared the computation times. The native C/C++ and the NEONTM SIMD outperformed the native C/C++ implementation by 56-78 and 50-76 time faster respectively.

Design of Multiprocess Models for Parallel Protocol Implementation (병렬 프로토콜 구현을 위한 다중 프로세스 모델의 설계)

  • Choi, Sun-Wan;Chung, Kwang-Sue
    • The Transactions of the Korea Information Processing Society
    • /
    • v.4 no.10
    • /
    • pp.2544-2552
    • /
    • 1997
  • This paper presents three multiprocess models for parallel protocol implementation, that is, (1)channel communication model, (2)fork-join model, and (3)event polling model. For the specification of parallelism for each model, a parallel programming language, Par. C System, is used. to measure the performance of multiprocess models, we implemented the Internet Protocol Suite(IPS) Internet Protocol (IP) for each model by writing the parallel language on the Transputer. After decomposing the IP functions into two parts, that is, the sending side and the receiving side, the parallelism in both sides is exploited in the form of Multiple Instruction Single Data (MISD). Three models are evaluated and compared on the basis of various run-time overheads, such as an event sending via channels in the parallel channel communication model, process creating in the fork-join model and context switching in the event polling model, at the sending side and the receiving side. The event polling model has lower processing delays as about 77% and 9% in comparison with the channel communication model and the fork-join model at the sending side, respectively. At the receiving side, the fork-join model has lower processing delays as about 55% and 107% in comparison with the channel communication model and the event polling model, respectively.

  • PDF

An Efficient Technique for Processing of Spatial Data Using GPU (GPU를 사용한 효율적인 공간 데이터 처리)

  • Lee, Jae-Il;Oh, Byoung-Woo
    • Spatial Information Research
    • /
    • v.17 no.3
    • /
    • pp.371-379
    • /
    • 2009
  • Recently, GPU (Graphics Processing Unit) has been improved rapidly on the need of speed for gaming. As a result, GPU contains multiple ALU (Arithmetic Logic Unit) for parallel processing of a lot of graphics data, such as transform, ray tracing, etc. Therefore, this paper proposed a technique for parallel processing of spatial data using GPU. Spatial data consists of multiple coordinates, and each coordinate contains value of x and y axis. To display spatial data graphics operations have to be processed to large amount of coordinates. Because the graphics operation is identical and coordinates are multiple data, SIMD (Single Instruction Multiple Data) parallel processing of GPU can be used for processing of spatial data to improve performance. This paper implemented SIMD parallel processing of spatial data using two kinds of SDK (Software Development Kit). CUDA and ATI Stream are used for NVIDIA and ATI GPU respectively. Experiments that measure time of calculation for graphics operations are carried out to observe enhancement of performance. Experimental result is reported that proposed method can enhance performance up to 1,162% for graphics operations. The proposed method that uses parallel processing with GPU for spatial data can be generally used to enhance performance for applications which deal with large amount of spatial data.

  • PDF

Efficient Thread Allocation Method of Convolutional Neural Network based on GPGPU (GPGPU 기반 Convolutional Neural Network의 효율적인 스레드 할당 기법)

  • Kim, Mincheol;Lee, Kwangyeob
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.7 no.10
    • /
    • pp.935-943
    • /
    • 2017
  • CNN (Convolution neural network), which is used for image classification and speech recognition among neural networks learning based on positive data, has been continuously developed to have a high performance structure to date. There are many difficulties to utilize in an embedded system with limited resources. Therefore, we use GPU (General-Purpose Computing on Graphics Processing Units), which is used for general-purpose operation of GPU to solve the problem because we use pre-learned weights but there are still limitations. Since CNN performs simple and iterative operations, the computation speed varies greatly depending on the thread allocation and utilization method in the Single Instruction Multiple Thread (SIMT) based GPGPU. To solve this problem, there is a thread that needs to be relaxed when performing Convolution and Pooling operations with threads. The remaining threads have increased the operation speed by using the method used in the following feature maps and kernel calculations.

CPU-GPU2 Trigeneous Computing for Iterative Reconstruction in Computed Tomography

  • Oh, Chanyoung;Yi, Youngmin
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.5 no.4
    • /
    • pp.294-301
    • /
    • 2016
  • In this paper, we present methods to efficiently parallelize iterative 3D image reconstruction by exploiting trigeneous devices (three different types of device) at the same time: a CPU, an integrated GPU, and a discrete GPU. We first present a technique that exploits single instruction multiple data (SIMD) architectures in GPUs. Then, we propose a performance estimation model, based on which we can easily find the optimal data partitioning on trigeneous devices. We found that the performance significantly varies by up to 6.23 times, depending on how SIMD units in GPUs are accessed. Then, by using trigeneous devices and the proposed estimation models, we achieve optimal partitioning and throughput, which corresponds to a 9.4% further improvement, compared to discrete GPU-only execution.

Hardware Implementation of High Speed CODEC for PACS (PACS를 위한 고속 CODEC의 하드웨어 구현)

  • 유선국;박성욱
    • Journal of Biomedical Engineering Research
    • /
    • v.15 no.4
    • /
    • pp.475-480
    • /
    • 1994
  • For the effective management of medical images, it becomes popular to use computing machines in medical practice, namely PACS. However, the amount of image data is so large that there is a lack of storage space. We usually use data compression techniques to save storage, but the process speed of machines is not fast enough to meet surgical requirement. So a special hardware system processing medical images faster is more important than ever. To meet the demand for high speed image processing, especially image compression and decompression, we designed and implemented the medical image CODEC (COder/DECoder) based on MISD (Multiple Instruction Single Data stream) architecture to adopt parallelism in it. Considering not being a standard scheme of medical image compression/decompression, the CODEC is designed programable and general. In this paper, we use JPEG (Joint Photographic Experts Group) algorithm to process images and evalutate the CODEC.

  • PDF

TDES CODER USING SSE2 TECHNOLOGY

  • Koo, In-Hoi;Kim, Tae-Hoon;Ahn, Sang-Il
    • Proceedings of the KSRS Conference
    • /
    • 2007.10a
    • /
    • pp.114-117
    • /
    • 2007
  • DES is an improvement of the algorithm Lucifer developed by IBM in the 1977. IBM, the National Security Agency (NSA) and the National Bureau of Standards (NBS now National Institute of Standards and Technology NIST) developed the DES algorithm. The DES has been extensively studied since its publication and is the most widely used symmetric algorithm in the world. But nowadays, Triple DES (TDES) is more widely used than DES especially in the application in case high level of data security is required. Even though TDES can be implemented based on standard algorithm, very high speed TDES codec performance is required to process when encrypted high resolution satellite image data is down-linked at high speed. In this paper, Intel SSE2 (Streaming SIMD (Single-Instruction Multiple-Data) Extensions 2 of Intel) is applied to TDES Decryption algorithm and proved its effectiveness in the processing time reduction by comparing the time consumed for two cases; original TDES Decryption and TDES Decryption with SSE2

  • PDF