• Title/Summary/Keyword: SIMD instruction

Search Result 81, Processing Time 0.044 seconds

Optimization of H.264 Encoder using SIMD Instructions (SIMD 명령어를 이용한 H.264 인코더 최적화)

  • 김용환;김제우;김태완;최병호
    • Proceedings of the Korea Multimedia Society Conference
    • /
    • 2003.11a
    • /
    • pp.175-178
    • /
    • 2003
  • 최근에 표준화가 완료된 차세대 비디오 코딩 표준인 H.264 는 적은 비트율에서 높은 품질의 비디오 압축을 목표로 하기 때문에, H.263+ 및 MPEG-2/4 와 같은 이전의 표준들보다 훨씬 더 많은 연산을 필요로 한다. 본 논문은 SIMD (Single Instruction Multiple Data) 명령어를 가지는 범용 프로세서(예를 들면, 펜티엄 4)에서 H.264 S/W 인코더의 속도 최적화를 위한 알고리듬 및 구현 기술을 제안한다. 화질 저하 없이 RDO (Rate Distortion Optimization) 의 속도를 높일 수 있는 효율적인 모드 검색 건너뛰기 알고리듬을 제안하고, SIMD 명령어를 이용하여 1/4 화소 보간, SAD(Sum of Absolute Difference), SATD(Sum of Absolute Transformed Difference), SSD (Sum of Squared Difference) 등의 개별 루틴의 속도를 최적화한다. 일련의 최적화 후에 인코더는 화질 저하 없이 H.264 레퍼런스 인코더보다 평균 3배 정도의 속도 향상이 이루어진다.

  • PDF

Optimization of Image Averaging Filter Using SIMD Instructions (SIMD 명령어를 이용한 영상의 평균화 필터 최적화)

  • Kim, Jun-Chul;Cui, Xuenan;Park, Eun-Soo;Kim, Hak-Il
    • Proceedings of the KIEE Conference
    • /
    • 2007.10a
    • /
    • pp.431-432
    • /
    • 2007
  • 본 연구에서는 영상의 잡음 제거에 우수한 평균화 필터 알고리즘을 SIMD(Single Instruction Multiple Data) 명령어를 이용하여 고속화하였다. 먼저 순차 알고리즘의 속도 향상을 위한 최적화를 수행하고, SIMD에 적합 하도록 변환 하여 4 또는 16 개의 데이터를 동시에 연산이 가능하도록 하였다. 또한 나누기 연산을 줄이기 위하여 8비트의 데이터 타입과 함께 룩업 테이블(Lookup Table)을 이용하였다. 각각의 데이터 타입에 적합하게 알고리즘을 변환하여 비교하였고 실험을 통하여 기존의 순차 처리방식에 비해 평균적으로 2.5배 이상의 속도 향상을 보았다.

  • PDF

Multi-Core Processor for Real-Time Sound Synthesis of Gayageum (가야금의 실시간 음 합성을 위한 멀티코어 프로세서 구현)

  • Choi, Ji-Won;Cho, Sang-Jin;Kim, Cheol-Hong;Kim, Jong-Myon;Chong, Ui-Pil
    • The KIPS Transactions:PartA
    • /
    • v.18A no.1
    • /
    • pp.1-10
    • /
    • 2011
  • Physical modeling has been widely used for sound synthesis since it synthesizes high quality sound which is similar to real-sound for musical instruments. However, physical modeling requires a lot of parameters to synthesize a large number of sounds simultaneously for the musical instrument, preventing its real-time processing. To solve this problem, this paper proposes a single instruction, multiple data (SIMD) based multi-core processor that supports real-time processing of sound synthesis of gayageum which is a representative Korean traditional musical instrument. The proposed SIMD-base multi-core processor consists of 12 processing elements (PE) to control 12 strings of gayageum in which each PE supports modeling of the corresponding string. The proposed SIMD-based multi-core processor can generate synthesized sounds of 12 strings simultaneously after receiving excitation signals and parameters of each string as an input. Experimental results using a sampling reate 44.1 kHz and 16 bits quantization show that synthesis sound using the proposed multi-core processor was very similar to the original sound. In addition, the proposed multi-core processor outperforms commercial processors(TI's TMS320C6416, ARM926EJ-S, ARM1020E) in terms of execution time ($5.6{\sim}11.4{\times}$ better) and energy efficiency (about $553{\sim}1,424{\times}$ better).

Implementation of Parallel Processor for Sound Synthesis of Guitar (기타의 음 합성을 위한 병렬 프로세서 구현)

  • Choi, Ji-Won;Kim, Yong-Min;Cho, Sang-Jin;Kim, Jong-Myon;Chong, Ui-Pil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.29 no.3
    • /
    • pp.191-199
    • /
    • 2010
  • Physical modeling is a synthesis method of high quality sound which is similar to real sound for musical instruments. However, since physical modeling requires a lot of parameters to synthesize sound of a musical instrument, it prevents real-time processing for the musical instrument which supports a large number of sounds simultaneously. To solve this problem, this paper proposes a single instruction multiple data (SIMD) parallel processor that supports real-time processing of sound synthesis of guitar, a representative plucked string musical instrument. To control six strings of guitar, we used a SIMD parallel processor which consists of six processing elements (PEs). Each PE supports modeling of the corresponding string. The proposed SIMD processor can generate synthesized sounds of six strings simultaneously when a parallel synthesis algorithm receives excitation signals and parameters of each string as an input. Experimental results using a sampling rate 44.1 kHz and 16 bits quantization indicate that synthesis sounds using the proposed parallel processor were very similar to original sound. In addition, the proposed parallel processor outperforms commercial TI's TMS320C6416 in terms of execution time (8.9x better) and energy efficiency (39.8x better).

Parallel Simulation of Bounded Petri Nets using Data Packing Scheme (데이터 중첩을 통한 페트리네트의 병렬 시뮬레이션)

  • 김영찬;김탁곤
    • Journal of the Korea Society for Simulation
    • /
    • v.11 no.2
    • /
    • pp.67-75
    • /
    • 2002
  • This paper proposes a parallel simulation algorithm for bounded Petri nets in a single processor, which exploits the SIMD(Single Instruction Multiple Data)-type parallelism. The proposed algorithm is based on a data packing scheme which packs multiple bytes data in a single register, thereby being manipulated simultaneously. The parallelism can reduce simulation time of bounded Petri nets in a single processor environment. The effectiveness of the algorithm is demonstrated by presenting speed-up of simulation time for two bounded Petri nets.

  • PDF

Low-latency SAO Architecture and its SIMD Optimization for HEVC Decoder

  • Kim, Yong-Hwan;Kim, Dong-Hyeok;Yi, Joo-Young;Kim, Je-Woo
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.1
    • /
    • pp.1-9
    • /
    • 2014
  • This paper proposes a low-latency Sample Adaptive Offset filter (SAO) architecture and its Single Instruction Multiple Data (SIMD) optimization scheme to achieve fast High Efficiency Video Coding (HEVC) decoding in a multi-core environment. According to the HEVC standard and its Test Model (HM), SAO operation is performed only at the picture level. Most realtime decoders, however, execute their sub-modules on a Coding Tree Unit (CTU) basis to reduce the latency and memory bandwidth. The proposed low-latency SAO architecture has the following advantages over picture-based SAO: 1) significantly less memory requirements, and 2) low-latency property enabling efficient pipelined multi-core decoding. In addition, SIMD optimization of SAO filtering can reduce the SAO filtering time significantly. The simulation results showed that the proposed low-latency SAO architecture with significantly less memory usage, produces a similar decoding time as a picture-based SAO in single-core decoding. Furthermore, the SIMD optimization scheme reduces the SAO filtering time by approximately 509% and increases the total decoding speed by approximately 7% compared to the existing look-up table approach of HM.

Novel Motion Estimation Scheme to Integer Pixel with a Search Box based on SIMD for Fast HEVC encoding (HEVC 고속 부호화를 위한 SIMD 기반 Search Box 기법의 정수 화소 단위 움직임 추정 방법)

  • Seok, Jinwuk;Kim, Younhee;Ki, MyungSeok;Kim, Hui Yong
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2015.11a
    • /
    • pp.203-206
    • /
    • 2015
  • 본 논문은 4K UHD 입력 영상에 대한 HEVC 고속 부호화를 위하여 대부분의 상용 CPU 및 AP 에서 사용되고 있는 SIMD (Single Instruction Mutiple Data) 명령어를 사용한 고속의 정수 화소 단위 움직임 추정 방법에 대한 연구이다. 특히, IT 기기에서의 고속 동영상 부호화를 위해 기존의 SIMD 명령어를 개량하여 동일한 CPU 실행시간에 다수의 움직임 추정을 수행할 수 있는 SIMD 명령어를 사용하여 보다 같은 실행시간에 보다 넓은 영역에 대한 움직임 벡터 탐색을 수행할 수 있도록 Search Box 기법을 새로이 개발하고 이를 토대로 기존 HEVC 에서 사용되고 있는 움직임 추정 방법에 대하여 연산시간을 줄이는 동시에 화질 열화를 최소화 시킬 수 있는 방법에 대하여 논한다.

  • PDF

SIMD MAC Unit Design for Multimedia Data Processing (멀티미디어 데이터 처리에 적합한 SIMD MAC 연산기의 설계)

  • Hong, In-Pyo;Jeong, Woo-Kyong;Jeong Jae-Won;Lee Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.12
    • /
    • pp.44-55
    • /
    • 2001
  • MAC(Multiply and ACcumulate) is the core operation of multimedia data processing. Because MAC units implemented on traditional DSP units or embedded processors have latency of three cycles and cannot operate on multiple data simultaneously, then, performances are seriously limited. Many high end general purpose microprocessors have SIMD MAC unit as a functional unit. But these high end MAC units must support pipeline structure for various operation modes and high clock frequency, which makes control logic complex and increases chip area. In this paper, a 64bit SIMD MAC unit for embedded processors is designed. It is implemented to have a latency of one clock cycle to remove pipeline control logics and a minimal area overhead for SIMD support is added to existing Booth multipliers.

  • PDF

A Partial Access Mechanism on a Register for Low-cost Embedded Multimedia ASIP (저비용 내장형 멀티미디어 프로세서를 위한 분할 레지스터 접근 구조)

  • Joe, Min-Young;Jeong, Ha-Young;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.9
    • /
    • pp.50-56
    • /
    • 2008
  • In this paper, we propose a partial access mechanism for low cost multimedia processors. Due to the cost increase of adding the SIMD register files and the execution blocks, we experience difficulties applying the SIMD instructions to low cost multimedia embedded processors. The proposed mechanism has the advantages of decreasing the cost burden of the additional hardware and enhancing total performance of the SIMD operation. We designed the ASIP in which the mechanism is applied and compared the latency of the SIMD operation regarding the use of instruction sets in the DSP benchmark. Then, we analyzed the total performance enhancement and the reduction in area burden by synthesizing the ASIP using 0.25um TSMC CMOS technology. As a result, there are approximately a 38% of performance increase and a 13.4% of area increase according to the proposed mechanism simulation.

Performance Evaluation and Verification of MMX-type Instructions on an Embedded Parallel Processor (임베디드 병렬 프로세서 상에서 MMX타입 명령어의 성능평가 및 검증)

  • Jung, Yong-Bum;Kim, Yong-Min;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.10
    • /
    • pp.11-21
    • /
    • 2011
  • This paper introduces an SIMD(Single Instruction Multiple Data) based parallel processor that efficiently processes massive data inherent in multimedia. In addition, this paper implements MMX(MultiMedia eXtension)-type instructions on the data parallel processor and evaluates and analyzes the performance of the MMX-type instructions. The reference data parallel processor consists of 16 processors each of which has a 32-bit datapath. Experimental results for a JPEG compression application with a 1280x1024 pixel image indicate that MMX-type instructions achieves a 50% performance improvement over the baseline instructions on the same data parallel architecture. In addition, MMX-type instructions achieves 100% and 51% improvements over the baseline instructions in energy efficiency and area efficiency, respectively. These results demonstrate that multimedia specific instructions including MMX-type have potentials for widely used many-core GPU(Graphics Processing Unit) and any types of parallel processors.