• Title/Summary/Keyword: single-instruction multiple-data

Search Result 77, Processing Time 0.019 seconds

NTGST-Based Parallel Computer Vision Inspection for High Resolution BLU (NTGST 병렬화를 이용한 고해상도 BLU 검사의 고속화)

  • 김복만;서경석;최흥문
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.41 no.6
    • /
    • pp.19-24
    • /
    • 2004
  • A novel fast parallel NTGST is proposed for high resolution computer vision inspection of the BLUs in a LCD production line. The conventional computation- intensive NTGST algorithm is modified and its C codes are optimized into fast NTGST to be adapted to the SIMD parallel architecture. And then, the input inspection image is partitioned and allocated to each of the P processors in multi-threaded implementation, and the NTGST is executed on SIMD architecture of N data items simultaneously in each thread. Thus, the proposed inspection system can achieve the speedup of O(NP). Experiments using Dual-Pentium III processor with its MMX and extended MMX SIMD technology show that the proposed parallel NTGST is about Sp=8 times faster than the conventional NTGST, which shows the scalability of the proposed system implementation for the fast, high resolution computer vision inspection of the various sized BLUs in LCD production lines.

Design Space Exploration of Many-Core Processor for High-Speed Cluster Estimation (고속의 클러스터 추정을 위한 매니코어 프로세서의 디자인 공간 탐색)

  • Seo, Jun-Sang;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.10
    • /
    • pp.1-12
    • /
    • 2014
  • This paper implements and improves the performance of high computational subtractive clustering algorithm using a single instruction, multiple data (SIMD) based many-core processor. In addition, this paper implements five different processing element (PE) architectures (PEs=16, 64, 256, 1,024, 4,096) to select an optimal PE architecture for the subtractive clustering algorithm by estimating execution time and energy efficiency. Experimental results using two different medical images and three different resolutions ($128{\times}128$, $256{\times}256$, $512{\times}512$) show that PEs=4,096 achieves the highest performance and energy efficiency for all the cases.

Hardware Design and Implementation of a Parallel Processor for High-Performance Multimedia Processing (고성능 멀티미디어 처리용 병렬프로세서 하드웨어 설계 및 구현)

  • Kim, Yong-Min;Hwang, Chul-Hee;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.5
    • /
    • pp.1-11
    • /
    • 2011
  • As the use of mobile multimedia devices is increasing in the recent year, the needs for high-performance multimedia processors are increasing. In this regard, we propose a SIMD (Single Instruction Multiple Data) based parallel processor that supports high-performance multimedia applications with low energy consumption. The proposed parallel processor consists of 16 processing elements (PEs) and operates on a 3-stage pipelining. Experimental results indicated that the proposed parallel processor outperforms conventional parallel processors in terms of performance. In addition, our proposed parallel processor outperforms commercial high-performance TI C6416 DSP in terms of performance (1.4-31.4x better) and energy efficiency (5.9-8.1x better) with same 130nm technology and 720 clock frequency. The proposed parallel processor was developed with verilog HDL and verified with a FPGA prototype system.

Optimal Design Space Exploration of Multi-core Architecture for Real-time Lane Detection Algorithm (실시간 차선인식 알고리즘을 위한 최적의 멀티코어 아키텍처 디자인 공간 탐색)

  • Jeong, Inkyu;Kim, Jongmyon
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.7 no.3
    • /
    • pp.339-349
    • /
    • 2017
  • This paper proposes a four-stage algorithm for detecting lanes on a driving car. In the first stage, it extracts region of interests in an image. In the second stage, it employs a median filter to remove noise. In the third stage, a binary algorithm is used to classify two classes of backgrond and foreground of an input image. Finally, an image erosion algorithm is utilized to obtain clear lanes by removing noises and edges remained after the binary process. However, the proposed lane detection algorithm requires high computational time. To address this issue, this paper presents a parallel implementation of a real-time line detection algorithm on a multi-core architecture. In addition, we implement and simulate 8 different processing element (PE) architectures to select an optimal PE architecture for the target application. Experimental results indicate that 40×40 PE architecture show the best performance, energy efficiency and area efficiency.

Run-time Memory Optimization Algorithm for the DDMB Architecture (DDMB 구조에서의 런타임 메모리 최적화 알고리즘)

  • Cho, Jeong-Hun;Paek, Yun-Heung;Kwon, Soo-Hyun
    • The KIPS Transactions:PartA
    • /
    • v.13A no.5 s.102
    • /
    • pp.413-420
    • /
    • 2006
  • Most vendors of digital signal processors (DSPs) support a Harvard architecture, which has two or more memory buses, one for program and one or more for data and allow the processor to access multiple words of data from memory in a single instruction cycle. We already addressed how to efficiently assign data to multi-memory banks in our previous work. This paper reports on our recent attempt to optimize run-time memory. The run-time environment for dual data memory banks (DBMBs) requires two run-time stacks to control activation records located in two memory banks corresponding to calling procedures. However, activation records of two memory banks for a procedure are able to have different size. As a consequence, dual run-time stacks can be unbalanced whenever a procedure is called. This unbalance between two memory banks causes that usage of one memory bank can exceed the extent of on-chip memory area although there is free area in the other memory bank. We attempt balancing dual run-time slacks to enhance efficiently utilization of on-chip memory in this paper. The experimental results have revealed that although our algorithm is relatively quite simple, it still can utilize run-time memories efficiently; thus enabling our compiler to run extremely fast, yet minimizing the usage of un-time memory in the target code.

Color Media Instructions for Embedded Parallel Processors (임베디드 병렬 프로세서를 위한 칼라미디어 명령어 구현)

  • Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.35 no.7
    • /
    • pp.305-317
    • /
    • 2008
  • As a mobile computing environment is rapidly changing, increasing user demand for multimedia-over-wireless capabilities on embedded processors places constraints on performance, power, and sire. In this regard, this paper proposes color media instructions (CMI) for single instruction, multiple data (SIMD) parallel processors to meet the computational requirements and cost goals. While existing multimedia extensions store and process 48-bit pixels in a 32-bit register, CMI, which considers that color components are perceptually less significant, supports parallel operations on two-packed compressed 16-bit YCbCr (6 bit Y and 5 bits Cb, Cr) data in a 32-bit datapath processor. This provides greater concurrency and efficiency for YCbCr data processing. Moreover, the ability to reduce data format size reduces system cost. The reduction in data bandwidth also simplifies system design. Experimental results on a representative SIMD parallel processor architecture show that CMI achieves an average speedup of 6.3x over the baseline SIMD parallel processor performance. This is in contrast to MMX (a representative Intel's multimedia extensions), which achieves an average speedup of only 3.7x over the same baseline SIMD architecture. CMI also outperforms MMX in both area efficiency (a 52% increase versus a 13% increase) and energy efficiency (a 50% increase versus an 11% increase). CMI improves the performance and efficiency with a mere 3% increase in the system area and a 5% increase in the system power, while MMX requires a 14% increase in the system area and a 16% increase in the system power.

Fast and Efficient Implementation of Neural Networks using CUDA and OpenMP (CUDA와 OPenMP를 이용한 빠르고 효율적인 신경망 구현)

  • Park, An-Jin;Jang, Hong-Hoon;Jung, Kee-Chul
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.4
    • /
    • pp.253-260
    • /
    • 2009
  • Many algorithms for computer vision and pattern recognition have recently been implemented on GPU (graphic processing unit) for faster computational times. However, the implementation has two problems. First, the programmer should master the fundamentals of the graphics shading languages that require the prior knowledge on computer graphics. Second, in a job that needs much cooperation between CPU and GPU, which is usual in image processing and pattern recognition contrary to the graphic area, CPU should generate raw feature data for GPU processing as much as possible to effectively utilize GPU performance. This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU. We use CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem. Moreover, OpenMP (Open Multi-Processing) is used to concurrently process multiple data with single instruction on multi-core CPU, which results in effectively utilizing the memories of GPU. In the experiments, we implemented neural networks-based text extraction system using the proposed architecture, and the computational times showed about 15 times faster than implementation on only GPU without OpenMP.