• Title/Summary/Keyword: SIMD 병렬 프로세서

Search Result 42, Processing Time 0.024 seconds

Parallel Speedup of NTGST on SIMD type Multiprocessor (SIMD 구조의 다중 프로세서를 이용한 NTGST의 병렬고속화)

  • 김복만;서경석;김종화;최흥문
    • Proceedings of the IEEK Conference
    • /
    • 2001.06d
    • /
    • pp.127-130
    • /
    • 2001
  • 본 논문에서는 SIMD (Single Instruction stream and Multiple Data stream)형 병렬 구조의 다중 프로세서를 이용하여 NTGST (noise-tolerant generalized symmetry transform)를 병렬 고속화하였다. 먼저 NTGST의 화소 및 영상 영역간의 계산 독립성을 이용하여 영상을 분할하여 P개의 프로세서에 할당하고, 이들 각각을 N개의 데이터를 한번에 처리하는 SIMD 구조로 병렬화하여 NP에 비례하는 속도 향상을 얻었다. 실험에서 MMX 기술의 펜티엄 Ⅲ 프로세서를 2개 사용하여 제안한 알고리즘이 기존의 NTGST 보다 8배 가까이 고속으로 처리됨을 확인하였다.

  • PDF

Implementation of Pixel Subword Parallel Processing Instructions for Embedded Parallel Processors (임베디드 병렬 프로세서를 위한 픽셀 서브워드 병렬처리 명령어 구현)

  • Jung, Yong-Bum;Kim, Jong-Myon
    • The KIPS Transactions:PartA
    • /
    • v.18A no.3
    • /
    • pp.99-108
    • /
    • 2011
  • Processor technology is currently continued to parallel processing techniques, not by only increasing clock frequency of a single processor due to the high technology cost and power consumption. In this paper, a SIMD (Single Instruction Multiple Data) based parallel processor is introduced that efficiently processes massive data inherent in multimedia. In addition, this paper proposes pixel subword parallel processing instructions for the SIMD parallel processor architecture that efficiently operate on the image and video pixels. The proposed pixel subword parallel processing instructions store and process four 8-bit pixels on the partitioned four 12-bit registers in a 48-bit datapath architecture. This solves the overflow problem inherent in existing multimedia extensions and reduces the use of many packing/unpacking instructions. Experimental results using the same SIMD-based parallel processor architecture indicate that the proposed pixel subword parallel processing instructions achieve a speedup of $2.3{\times}$ over the baseline SIMD array performance. This is in contrast to MMX-type instructions (a representative Intel multimedia extension), which achieve a speedup of only $1.4{\times}$ over the same baseline SIMD array performance. In addition, the proposed instructions achieve $2.5{\times}$ better energy efficiency than the baseline program, while MMX-type instructions achieve only $1.8{\times}$ better energy efficiency than the baseline program.

Performance Evaluation and Verification of MMX-type Instructions on an Embedded Parallel Processor (임베디드 병렬 프로세서 상에서 MMX타입 명령어의 성능평가 및 검증)

  • Jung, Yong-Bum;Kim, Yong-Min;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.10
    • /
    • pp.11-21
    • /
    • 2011
  • This paper introduces an SIMD(Single Instruction Multiple Data) based parallel processor that efficiently processes massive data inherent in multimedia. In addition, this paper implements MMX(MultiMedia eXtension)-type instructions on the data parallel processor and evaluates and analyzes the performance of the MMX-type instructions. The reference data parallel processor consists of 16 processors each of which has a 32-bit datapath. Experimental results for a JPEG compression application with a 1280x1024 pixel image indicate that MMX-type instructions achieves a 50% performance improvement over the baseline instructions on the same data parallel architecture. In addition, MMX-type instructions achieves 100% and 51% improvements over the baseline instructions in energy efficiency and area efficiency, respectively. These results demonstrate that multimedia specific instructions including MMX-type have potentials for widely used many-core GPU(Graphics Processing Unit) and any types of parallel processors.

Color Media Instructions for Embedded Parallel Processors (임베디드 병렬 프로세서를 위한 칼라미디어 명령어 구현)

  • Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.35 no.7
    • /
    • pp.305-317
    • /
    • 2008
  • As a mobile computing environment is rapidly changing, increasing user demand for multimedia-over-wireless capabilities on embedded processors places constraints on performance, power, and sire. In this regard, this paper proposes color media instructions (CMI) for single instruction, multiple data (SIMD) parallel processors to meet the computational requirements and cost goals. While existing multimedia extensions store and process 48-bit pixels in a 32-bit register, CMI, which considers that color components are perceptually less significant, supports parallel operations on two-packed compressed 16-bit YCbCr (6 bit Y and 5 bits Cb, Cr) data in a 32-bit datapath processor. This provides greater concurrency and efficiency for YCbCr data processing. Moreover, the ability to reduce data format size reduces system cost. The reduction in data bandwidth also simplifies system design. Experimental results on a representative SIMD parallel processor architecture show that CMI achieves an average speedup of 6.3x over the baseline SIMD parallel processor performance. This is in contrast to MMX (a representative Intel's multimedia extensions), which achieves an average speedup of only 3.7x over the same baseline SIMD architecture. CMI also outperforms MMX in both area efficiency (a 52% increase versus a 13% increase) and energy efficiency (a 50% increase versus an 11% increase). CMI improves the performance and efficiency with a mere 3% increase in the system area and a 5% increase in the system power, while MMX requires a 14% increase in the system area and a 16% increase in the system power.

Implementation of Parallel Processor for Sound Synthesis of Guitar (기타의 음 합성을 위한 병렬 프로세서 구현)

  • Choi, Ji-Won;Kim, Yong-Min;Cho, Sang-Jin;Kim, Jong-Myon;Chong, Ui-Pil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.29 no.3
    • /
    • pp.191-199
    • /
    • 2010
  • Physical modeling is a synthesis method of high quality sound which is similar to real sound for musical instruments. However, since physical modeling requires a lot of parameters to synthesize sound of a musical instrument, it prevents real-time processing for the musical instrument which supports a large number of sounds simultaneously. To solve this problem, this paper proposes a single instruction multiple data (SIMD) parallel processor that supports real-time processing of sound synthesis of guitar, a representative plucked string musical instrument. To control six strings of guitar, we used a SIMD parallel processor which consists of six processing elements (PEs). Each PE supports modeling of the corresponding string. The proposed SIMD processor can generate synthesized sounds of six strings simultaneously when a parallel synthesis algorithm receives excitation signals and parameters of each string as an input. Experimental results using a sampling rate 44.1 kHz and 16 bits quantization indicate that synthesis sounds using the proposed parallel processor were very similar to original sound. In addition, the proposed parallel processor outperforms commercial TI's TMS320C6416 in terms of execution time (8.9x better) and energy efficiency (39.8x better).

Implementation of SIMD-based Many-Core Processor for Efficient Image Data Processing (효율적인 영상데이터 처리를 위한 SIMD기반 매니코어 프로세서 구현)

  • Choi, Byong-Kook;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.1
    • /
    • pp.1-9
    • /
    • 2011
  • Recently, as mobile multimedia devices are used more and more, the needs for high-performance and low-energy multimedia processors are increasing. Application-specific integrated circuits (ASIC) can meet the needed high performance for mobile multimedia, but they provide limited, if any, generality needed for various application requirements. DSP based systems can used for various types of applications due to their generality, but they require higher cost and energy consumption as well as less performance than ASICs. To solve this problem, this paper proposes a single instruction multiple data (SIMD) based many-core processor which supports high-performance and low-power image data processing while keeping generality. The proposed SIMD based many-core processor composed of 16 processing elements (PEs) exploits large data parallelism inherent in image data processing. Experimental results indicate that the proposed SIMD-based many-core processor higher performance (22 times better), energy efficiency (7 times better), and area efficiency (3 times better) than conversional commercial high-performance processors.

Hardware Design and Implementation of a Parallel Processor for High-Performance Multimedia Processing (고성능 멀티미디어 처리용 병렬프로세서 하드웨어 설계 및 구현)

  • Kim, Yong-Min;Hwang, Chul-Hee;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.5
    • /
    • pp.1-11
    • /
    • 2011
  • As the use of mobile multimedia devices is increasing in the recent year, the needs for high-performance multimedia processors are increasing. In this regard, we propose a SIMD (Single Instruction Multiple Data) based parallel processor that supports high-performance multimedia applications with low energy consumption. The proposed parallel processor consists of 16 processing elements (PEs) and operates on a 3-stage pipelining. Experimental results indicated that the proposed parallel processor outperforms conventional parallel processors in terms of performance. In addition, our proposed parallel processor outperforms commercial high-performance TI C6416 DSP in terms of performance (1.4-31.4x better) and energy efficiency (5.9-8.1x better) with same 130nm technology and 720 clock frequency. The proposed parallel processor was developed with verilog HDL and verified with a FPGA prototype system.

Parallel Factorization using Quadratic Sieve Algorithm on SIMD machines (SIMD상에서의 이차선별법을 사용한 병렬 소인수분해 알고리즘)

  • Kim, Yang-Hee
    • The KIPS Transactions:PartA
    • /
    • v.8A no.1
    • /
    • pp.36-41
    • /
    • 2001
  • In this paper, we first design an parallel quadratic sieve algorithm for factoring method. We then present parallel factoring algorithm for factoring a large odd integer by repeatedly using the parallel quadratic sieve algorithm based on the divide-and-conquer strategy on SIMD machines with DMM. We show that this algorithm is optimal in view of the product of time and processor numbers.

  • PDF

NTGST-Based Parallel Computer Vision Inspection for High Resolution BLU (NTGST 병렬화를 이용한 고해상도 BLU 검사의 고속화)

  • 김복만;서경석;최흥문
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.41 no.6
    • /
    • pp.19-24
    • /
    • 2004
  • A novel fast parallel NTGST is proposed for high resolution computer vision inspection of the BLUs in a LCD production line. The conventional computation- intensive NTGST algorithm is modified and its C codes are optimized into fast NTGST to be adapted to the SIMD parallel architecture. And then, the input inspection image is partitioned and allocated to each of the P processors in multi-threaded implementation, and the NTGST is executed on SIMD architecture of N data items simultaneously in each thread. Thus, the proposed inspection system can achieve the speedup of O(NP). Experiments using Dual-Pentium III processor with its MMX and extended MMX SIMD technology show that the proposed parallel NTGST is about Sp=8 times faster than the conventional NTGST, which shows the scalability of the proposed system implementation for the fast, high resolution computer vision inspection of the various sized BLUs in LCD production lines.

Parallel Sorting Algorithm by Median-Median (중위수의 중위수에 의한 병렬 분류 알고리즘)

  • Min, Yong-Sik
    • The Journal of the Acoustical Society of Korea
    • /
    • v.14 no.1E
    • /
    • pp.14-21
    • /
    • 1995
  • This paper presents a parallel sorting algorithm suitable for the SIMD multiprocessor. The algorithm finds pivots for partitioning the data into ordered subsets. The data can be evenly distributed to be sorted since it uses the probability theory. For n data elements to be sorted on p processors, when $n{\geq}p^2$, the algorithm is shown to be asymptotically optimal. In practice, sorting 8 million data items on 64 processors achieved a 48.43-fold speedup, while the PSRS required a 44.4-fold speedup. On a variety of shared and distributed memory machines, the algorithm achieved better than half-linear speedups.

  • PDF