• Title/Summary/Keyword: SIMD instruction

Search Result 81, Processing Time 0.029 seconds

High Performance Coprocessor Architecture for Real-Time Dense Disparity Map (실시간 Dense Disparity Map 추출을 위한 고성능 가속기 구조 설계)

  • Kim, Cheong-Ghil;Srini, Vason P.;Kim, Shin-Dug
    • The KIPS Transactions:PartA
    • /
    • v.14A no.5
    • /
    • pp.301-308
    • /
    • 2007
  • This paper proposes high performance coprocessor architecture for real time dense disparity computation based on a phase-based binocular stereo matching technique called local weighted phase-correlation(LWPC). The algorithm combines the robustness of wavelet based phase difference methods and the basic control strategy of phase correlation methods, which consists of 4 stages. For parallel and efficient hardware implementation, the proposed architecture employs SIMD(Single Instruction Multiple Data Stream) architecture for each functional stage and all stages work on pipelined mode. Such that the newly devised pipelined linear array processor is optimized for the case of row-column image processing eliminating the need for transposed memory while preserving generality and high throughput. The proposed architecture is implemented with Xilinx HDL tool and the required hardware resources are calculated in terms of look up tables, flip flops, slices, and the amount of memory. The result shows the possibility that the proposed architecture can be integrated into one chip while maintaining the processing speed at video rate.

Performance Comparison of Implementation Technologies for Image Quality Enhancement Operations on Android Platforms (Android 플랫폼에서 구현 기술에 따른 화질 개선 연산 성능 비교)

  • Lee, Ju-Ho;Lee, Goo-Yeon;Jeong, Choong-Kyo
    • Journal of Digital Contents Society
    • /
    • v.14 no.1
    • /
    • pp.7-14
    • /
    • 2013
  • As mobiles devices with high-spec camera built in are used widely, the visual quality enhancement of the high-resolution images turns out to be one of the key capabilities of the mobile devices. Due to the limited computational resources of the mobile devices and the size of the high-resolution images, we should choose an image processing algorithm not too complex and use an efficient implementation technology. One of the simple and widely used image quality enhancement algorithms is contrast stretching. Java libraries running on a virtual machine, JNI (Java Native Interface) based native C/C++, and NEONTM SIMD (Single Instruction Multiple Data) are common implementation technologies in the case of Android smartphones. Using these three implementation technologies, we have implemented two image contrast stretching algorithms - linear and equalized, and compared the computation times. The native C/C++ and the NEONTM SIMD outperformed the native C/C++ implementation by 56-78 and 50-76 time faster respectively.

NTGST-Based Parallel Computer Vision Inspection for High Resolution BLU (NTGST 병렬화를 이용한 고해상도 BLU 검사의 고속화)

  • 김복만;서경석;최흥문
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.41 no.6
    • /
    • pp.19-24
    • /
    • 2004
  • A novel fast parallel NTGST is proposed for high resolution computer vision inspection of the BLUs in a LCD production line. The conventional computation- intensive NTGST algorithm is modified and its C codes are optimized into fast NTGST to be adapted to the SIMD parallel architecture. And then, the input inspection image is partitioned and allocated to each of the P processors in multi-threaded implementation, and the NTGST is executed on SIMD architecture of N data items simultaneously in each thread. Thus, the proposed inspection system can achieve the speedup of O(NP). Experiments using Dual-Pentium III processor with its MMX and extended MMX SIMD technology show that the proposed parallel NTGST is about Sp=8 times faster than the conventional NTGST, which shows the scalability of the proposed system implementation for the fast, high resolution computer vision inspection of the various sized BLUs in LCD production lines.

Performance Analysis of Implementation on Image Processing Algorithm for Multi-Access Memory System Including 16 Processing Elements (16개의 처리기를 가진 다중접근기억장치를 위한 영상처리 알고리즘의 구현에 대한 성능평가)

  • Lee, You-Jin;Kim, Jea-Hee;Park, Jong-Won
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.3
    • /
    • pp.8-14
    • /
    • 2012
  • Improving the speed of image processing is in great demand according to spread of high quality visual media or massive image applications such as 3D TV or movies, AR(Augmented reality). SIMD computer attached to a host computer can accelerate various image processing and massive data operations. MAMS is a multi-access memory system which is, along with multiple processing elements(PEs), adequate for establishing a high performance pipelined SIMD machine. MAMS supports simultaneous access to pq data elements within a horizontal, a vertical, or a block subarray with a constant interval in an arbitrary position in an $M{\times}N$ array of data elements, where the number of memory modules(MMs), m, is a prime number greater than pq. MAMS-PP4 is the first realization of the MAMS architecture, which consists of four PEs in a single chip and five MMs. This paper presents implementation of image processing algorithms and performance analysis for MAMS-PP16 which consists of 16 PEs with 17 MMs in an extension or the prior work, MAMS-PP4. The newly designed MAMS-PP16 has a 64 bit instruction format and application specific instruction set. The author develops a simulator of the MAMS-PP16 system, which implemented algorithms can be executed on. Performance analysis has done with this simulator executing implemented algorithms of processing images. The result of performance analysis verifies consistent response of MAMS-PP16 through the pyramid operation in image processing algorithms comparing with a Pentium-based serial processor. Executing the pyramid operation in MAMS-PP16 results in consistent response of processing time while randomly response time in a serial processor.

Optimal Economic Load Dispatch using Parallel Genetic Algorithms in Large Scale Power Systems (병렬유전알고리즘을 응용한 대규모 전력계통의 최적 부하배분)

  • Kim, Tae-Kyun;Kim, Kyu-Ho;Yu, Seok-Ku
    • The Transactions of the Korean Institute of Electrical Engineers A
    • /
    • v.48 no.4
    • /
    • pp.388-394
    • /
    • 1999
  • This paper is concerned with an application of Parallel Genetic Algorithms(PGA) to optimal econmic load dispatch(ELD) in power systems. The ELD problem is to minimize the total generation fuel cost of power outputs for all generating units while satisfying load balancing constraints. Genetic Algorithms(GA) is a good candidate for effective parallelization because of their inherent principle of evolving in parallel a population of individuals. Each individual of a population evaluates the fitness function without data exchanges between individuals. In application of the parallel processing to GA, it is possible to use Single Instruction stream, Multiple Data stream(SIMD), a kind of parallel system. The architecture of SIMD system need not data communications between processors assigned. The proposed ELD problem with C code is implemented by SIMSCRIPT language for parallel processing which is a powerfrul, free-from and versatile computer simulation programming language. The proposed algorithms has been tested for 38 units system and has been compared with Sequential Quadratic programming(SQP).

  • PDF

Fast implementation of HEVC inverse DCT using AVX2 instructions (AVX2 명령어를 이용한 HEVC 역 이산여현변환 고속화)

  • Kim, Woori;Jo, Hyunho;Ahn, Yong-Jo;Sim, Dong-Gyu
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2014.06a
    • /
    • pp.206-208
    • /
    • 2014
  • 본 논문에서는 HEVC (High Efficiency Video Coding)의 IDCT (Inverse Discrete Cosine Transform) 모듈을 AVX2 (Advanced Vector Extensions 2) 명령어 셋을 사용하여 고속화하는 방법을 제안한다. 제안하는 방법은 4 개의 $4{\times}4$ 블록을 AVX2 레지스터에 로드 한 후, 동시에 AVX2 명령어 셋을 통해 한 번에 IDCT 를 수행한다. 제안하는 방법은 $4{\times}4$ 블록 단위로 순차적으로 SIMD(Single Instruction Multiple Data) 명령어 셋을 통해 IDCT 를 수행하는 방법에 비해 명령어 단위의 병렬화 성능을 극대화한다. 실험 결과, HEVC 디코더의 $4{\times}4$ IDCT 에 SIMD 명령어 셋을 적용한 경우 기존의 HM-12.1 에 비해 평균 3.35 배 수행 속도를 향상 시킨 반면, 제안하는 방법은 HM12.1에 비해 평균 9.50 배 수행 속도를 향상 시켰다.

  • PDF

CPU-GPU2 Trigeneous Computing for Iterative Reconstruction in Computed Tomography

  • Oh, Chanyoung;Yi, Youngmin
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.5 no.4
    • /
    • pp.294-301
    • /
    • 2016
  • In this paper, we present methods to efficiently parallelize iterative 3D image reconstruction by exploiting trigeneous devices (three different types of device) at the same time: a CPU, an integrated GPU, and a discrete GPU. We first present a technique that exploits single instruction multiple data (SIMD) architectures in GPUs. Then, we propose a performance estimation model, based on which we can easily find the optimal data partitioning on trigeneous devices. We found that the performance significantly varies by up to 6.23 times, depending on how SIMD units in GPUs are accessed. Then, by using trigeneous devices and the proposed estimation models, we achieve optimal partitioning and throughput, which corresponds to a 9.4% further improvement, compared to discrete GPU-only execution.

TDES CODER USING SSE2 TECHNOLOGY

  • Koo, In-Hoi;Kim, Tae-Hoon;Ahn, Sang-Il
    • Proceedings of the KSRS Conference
    • /
    • 2007.10a
    • /
    • pp.114-117
    • /
    • 2007
  • DES is an improvement of the algorithm Lucifer developed by IBM in the 1977. IBM, the National Security Agency (NSA) and the National Bureau of Standards (NBS now National Institute of Standards and Technology NIST) developed the DES algorithm. The DES has been extensively studied since its publication and is the most widely used symmetric algorithm in the world. But nowadays, Triple DES (TDES) is more widely used than DES especially in the application in case high level of data security is required. Even though TDES can be implemented based on standard algorithm, very high speed TDES codec performance is required to process when encrypted high resolution satellite image data is down-linked at high speed. In this paper, Intel SSE2 (Streaming SIMD (Single-Instruction Multiple-Data) Extensions 2 of Intel) is applied to TDES Decryption algorithm and proved its effectiveness in the processing time reduction by comparing the time consumed for two cases; original TDES Decryption and TDES Decryption with SSE2

  • PDF

A Study on Application Method of Parallel Processing for Performance Improvement of Sonar-based Undersea Simulation (소나 기반 해저 시뮬레이션의 성능 향상을 위한 병렬처리 적용 방법 연구)

  • Back, Seoung-Jea;Lee, Keon-Pyo;Ha, Ok-Kyoon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.1-2
    • /
    • 2018
  • 해상 선박의 안전을 위해 해저의 객체 및 장애물의 정확한 탐지를 위해 해저환경에서 감쇠현상이 비교적 적은 음파 기반의 소나가 널리 활용된다. 그러나 기존의 소나 영상 시뮬레이션은 고해상도의 영상, 잡음 처리, 해저지형과 객체 데이터 등의 방대한 데이터 처리로 인해 물체 탐지 및 식별을 위한 처리속도와 비용이 크게 증가한다. 이러한 문제를 최소화하기 위해서 해저지형, 객체 생성과 잡음 처리 모델을 Multi-Threading, SIMD 등 병렬처리를 적용하여 처리속도를 최적화 한다. 본 논문에서는 혼합된 병렬처리 방법을 적용하여 소나를 기반으로 해저 환경 시뮬레이션을 위한 모의 신호를 생성하는 성능을 향상시킨다. 병렬처리로 인해 개선된 성능을 순차처리에 따른 속도와 실험적으로 비교한다.

  • PDF

A Study on Tools for Implementing High-speed Neural Network (신경회로망의 고속 구현 방법에 관한 연구)

  • Kim, Pyong-Kun;Kim, Doo-Sik;Lee, Sang-Ho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2002.11a
    • /
    • pp.377-380
    • /
    • 2002
  • 신경회로망은 문자인식, 자동제어 등의 여러 분야에 널리 쓰이는 방식이다. 그러나 신경회로망을 구현하는데는 연산량이 많아서 실시간으로 구현하기에 어려움이 많이 따른다. 본 논문은 신경회로망을 구현하는데 필요한 연산을 살펴보고 그 연산을 구현하는 방법을 비교 분석하였다. 신경회로망을 구현하기 위해 DSP(Digital Signal Processor), PC의 FPU(Floating Point Unit), Intel사의 Pentium 계열 프로세서에서 지원하는 SIMD(Single Instruction Multiple Data) 기술을 사용하여 결과를 비교 분석 하였다. 신경회로망의 핵심인 MLP(Multi Layer Perceptron) 연산에 대해 실험한 결과 SIMD 기술을 이용하는 방법이 다른 방법에 비해 2배이상 좋은 결과를 나타내었다.

  • PDF