• Title/Summary/Keyword: SIMD 명령어

Search Result 49, Processing Time 0.031 seconds

Optimization Technique for Vertex Programming on Programmable GPU (프로그래밍이 가능한 GPU 상에서의 버텍스 프로그래밍의 최적화 기법)

  • Oh, Jinsang;Ihm, Insung
    • Journal of the Korea Computer Graphics Society
    • /
    • v.8 no.3
    • /
    • pp.25-34
    • /
    • 2002
  • 최근 프로그래밍이 가능한 그래픽스 프로세서(GPU)의 등장은 렌더링 속도의 향상은 물론 기존의 GPU가 할 수 없었던 다양한 그래픽스 계산을 효과적으로 수행할 수 있도록 해주고 있다. 이로 인하여 기존에 CPU 상에서 수행해야만 했던 그래픽스 계산들의 일부를 GPU 상에서 수행하도록 해주는 기법들에 대한 연구가 활발히 진행되고 있다. 본 논문에서는 선형식에 기반을 둔 여러 응용 문제들을 GPU 상에서 효율적으로 구현할 수 있도록 도와주는 쉐이더 코드 최적화 기법을 제안한다. 이 기법은 SIMD 형태의 병렬 처리 능력을 가진 버텍스 쉐이더의 명령어에 맞게 고안되었다. 본 기법의 활용 가능성을 보이기 위하여 미분 방정식을 풀기 위한 4차 런지-쿠타 방법, 선형방정식을 풀기 위한 가우스-자이델 방법, 자연스러운 유체 모델링을 위한 파동 방정식 등의 문제에 적용하여 보았다. 본 논문에서 제안한 최적화 기법은 버텍스 쉐이더 용 컴파일러 구현에 쓰일 수 있으며, 향후 프로그래밍이 가능한 GPU 상에서의 실시간 그래픽스 소프트웨어 개발에 유용하게 사용될 수 있을 것이다.

  • PDF

High Speed Implementation of LEA on ARMv8 (ARMv8 상에서 LEA 암호화 고속 구현)

  • Seo, Hwa-jeong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.10
    • /
    • pp.1929-1934
    • /
    • 2017
  • Lightweight block cipher (Lightweight Encryption Algorithm, LEA), is the most promising block cipher algorithm due to its efficient implementation feature and high security level. The LEA block cipher is widely used in real-field applications and there are many efforts to enhance the performance of LEA in terms of execution timing to achieve the high availability under any circumstances. In this paper, we enhance the performance of LEA block cipher, particularly on ARMv8 processors. The LEA implementation is optimized by using new SIMD instructions namely NEON engine and 24 LEA encryption operations are simultaneously performed in parallel way. In order to reduce the number of memory access, we utilized the all NEON registers to retain the intermediate results. Finally, we evaluated the performance of the LEA implementation, and the proposed implementations on Apple A7 and Apple A9 achieved the 2.4 cycles/byte and 2.2 cycles/byte, respectively.

Accelerating Symmetric and Asymmetric Cryptographic Algorithms with Register File Extension for Multi-words or Long-word Operation (다수 혹은 긴 워드 연산을 위한 레지스터 파일 확장을 통한 대칭 및 비대칭 암호화 알고리즘의 가속화)

  • Lee Sang-Hoon;Choi Lynn
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.2 s.308
    • /
    • pp.1-11
    • /
    • 2006
  • In this paper, we propose a new register file architecture called the Register File Extension for Multi-words or Long-word Operation (RFEMLO) to accelerate both symmetric and asymmetric cryptographic algorithms. Based on the idea that most of cryptographic algorithms heavily use multi-words or long-word operations, RFEMLO allows multiple contiguous registers to be specified as a single operand. Thus, a single instruction can specify a SIMD-style multi-word operation or a long-word operation. RFEMLO can be applied to general purpose processors by adding instruction set for multi-words or long-word operands and functional units for additional instruction set. To evaluate the performance of RFEMLO, we use Simplescalar/ARM 3.0 (with gcc 2.95.2) and run detailed simulations on various symmetric and asymmetric cryptographic algorithms. By applying RFEMLO, we could get maximum 62% and 70% reductions in the total instruction count of symmetric and asymmetric cryptographic algorithms respectively. Also, performance results show that a speedup of 1.4 to 2.6 can be obtained in symmetric cryptographic algorithms and a speedup of 2.5 to 3.3 can be obtained for asymmetric cryptographic algorithms when we apply RFEMLO to a processor with an in-order pipeline. We also found that RFEMLO can effectively improve the performance of these cryptographic algorithms with much less cost compared to issue-width increase available in Superscalar implementations. Moreover, the RFEMLO can also be applied to Superscalar processor, leading to additional 83% and 138% performance gain in symmetric and asymmetric cryptographic algorithms.

Study of parallelization methods for real-time HEVC encoder implementation (실시간 HEVC 인코더 구현을 위한 병렬화 기법에 관한 연구)

  • Ahn, Yongjo;Hwang, Taejin;Lee, Dongkyu;Kim, Sangmin;Oh, Seoung-Jun;Sim, Dong-Gyu
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2013.06a
    • /
    • pp.119-122
    • /
    • 2013
  • ITU-T VCEG 과 ISO/IEC MPEG 이 공동으로 구성한 JCT-VC (Joint Collaborative Team on Video Coding)이 표준화를 진행 중인 HEVC (High Efficiency Video Coding)은 H.264/AVC 대비 약 2 배의 압축효율을 갖는다. 하지만, 계층적 구조를 갖는 가변크기 블록의 사용과 재귀적 부호화 구조에 따른 인코더의 복잡도 증가는 개선해야 할 문제점으로 지적되고 있다. 본 논문에서는 현재 표준화가 진행 중인 HEVC 인코더의 실시간 구현을 위한 SIMD 명령어를 이용한 data-level 병렬화 기법, CPU 및 GPU 를 이용한 multi-threading 기법과 같은 다양한 병렬화 기법을 소개한다. 또한, 이러한 병렬화 기법들을 HEVC 인코더에 적용하기 위해 적합한 연산 및 기능 모듈에 대하여 소개한다. 본 연구를 통하여 HM (HEVC reference model)에 적용한 결과 $832{\times}480$ 영상의 경우 20-30fps 의 부호화 속도를 나타냈으며, $1920{\times}1080$ 영상의 경우 5-10fps 의 부호화 속도를 나타내었다.

  • PDF

Design and Implementation of Realtime MPEG-2 to MPEG-4 Transcoder (실시간 MPEG-2 to MPEG-4 트랜스코더의 설계 및 구현)

  • Kim Je Woo;Kim Yong-Hwan;Kim Tae-Wan;Choi Beong-Ho
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2003.11a
    • /
    • pp.143-146
    • /
    • 2003
  • 최근 디지털 당송과 이동통신 단말기의 대중화가 이루어짐에 따라 고화질 고해상도의 멀티미디어 컨텐츠의 이동통신 단말기에서의 재생 서비스에 대한 수요가 증가하고 있다 이동통신 단말기에서 멀티미디어 컨텐츠 재생 서비스를 제공하기 위해서는 디지털 방송 컨텐츠를 단말기에 적합한 컨텐츠로 변환할 필요가 있다. 본 논문은 디지털 방송 규격인 MPEG-2 컨텐츠를 이동통신 단말기에서 지원하는 MPEG-4 SP(Simple Profile) 컨텐츠로 실시간으로 변환하는 트랜스 코더에 대한 설계와 구현 기술을 제안한다. 구현된 트래스코더는 화질 유지와 계산량 감소를 위한 적응적 움직임벡터 재구성, 매크로블록 모드 선택, 그리고 움직임벡터 scaling 등의 알고리즘을 포함하고, 인텔사에서 제공하는 SIMD(Single Instruction Multiple Data) 명령어를 이용하여 최적화되었다. 트랜스코더는 30fps, 8Mbps, $720\times480$ 해상도의 멀티미디어 컨텐츠를 다양한 비트율의 30fps, $352\times240$ 해상도의 MPEG-4 컨텐츠로 실시간 변환할 수 있다.

  • PDF

Study of Parallelization Methods for Software based Real-time HEVC Encoder Implementation (소프트웨어 기반 실시간 HEVC 인코더 구현을 위한 병렬화 기법에 관한 연구)

  • Ahn, Yong-Jo;Hwang, Tae-Jin;Lee, Dongkyu;Kim, Sangmin;Oh, Seoung-Jun;Sim, Dong-Gyu
    • Journal of Broadcast Engineering
    • /
    • v.18 no.6
    • /
    • pp.835-849
    • /
    • 2013
  • Joint Collaborative Team on Video Coding (JCT-VC), which have founded ISO/IEC MPEG and ITU-T VCEG, has standardized High Efficiency Video Coding (HEVC). Standardization of HEVC has started with purpose of twice or more coding performance compared to H.264/AVC. However, flexible and hierarchical coding block and recursive coding structure are problems to overcome of HEVC standard. Many fast encoding algorithms for reducing computational complexity of HEVC encoder have been proposed. However, it is hard to implement a real-time HEVC encoder only with those fast encoding algorithms. In this paper, for implementation of software-based real-time HEVC encoder, data-level parallelism using SIMD instructions and CPU/GPU multi-threading methods are proposed. And we also proposed appropriate operations and functional modules to apply the proposed methods on HM 10.0 software. Evaluation of the proposed methods implemented on HM 10.0 software showed 20-30fps for $832{\times}480$ sequences and 5-10fps for $1920{\times}1080$ sequences, respectively.

Design of an Image Processing ASIC Architecture using Parallel Approach with Zero or Little (통신부담을 감소시킨 영상처리를 위한 병렬처리 방식 ASIC구조 설계)

  • 안병덕;정지원;선우명훈
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.10
    • /
    • pp.2043-2052
    • /
    • 1994
  • This paper proposes a new parallel ASIC architecture for real-time image processing to reduce inter-processing element (inter-PE) communication overhead, called a Sliding Memory Plane (SliM) Image Processor. The Slim Image Processor consists of $3\times3$ processing elements (PEs) connected by a mesh topology. With easy scalability due to the topology. a set of SliM Image Processors can form a mesh-connected SIMD parallel architecture. called the SliM Array Processor. The idea of sliding means that all pixels are slided into all neighboring PEs without interrupting PEs and without a coprocessor or a DMA controller. Since the inter-PE communication and computation occur simultaneously. the inter-PE communication overhead, significant disadvantage of existing machines greatly diminishes. Two I/O planes provide a buffering capability and reduce the date I/O overhead. In addition, using the by-passing path provides eight-way connectivity even with four links. with these salient features. SliM shows a significant performance improvement. This paper presents architectures of a PE and the SliM Image Processor, and describes the design of an instruction set.

  • PDF

Design Space Exploration of Many-Core Processor for High-Speed Cluster Estimation (고속의 클러스터 추정을 위한 매니코어 프로세서의 디자인 공간 탐색)

  • Seo, Jun-Sang;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.10
    • /
    • pp.1-12
    • /
    • 2014
  • This paper implements and improves the performance of high computational subtractive clustering algorithm using a single instruction, multiple data (SIMD) based many-core processor. In addition, this paper implements five different processing element (PE) architectures (PEs=16, 64, 256, 1,024, 4,096) to select an optimal PE architecture for the subtractive clustering algorithm by estimating execution time and energy efficiency. Experimental results using two different medical images and three different resolutions ($128{\times}128$, $256{\times}256$, $512{\times}512$) show that PEs=4,096 achieves the highest performance and energy efficiency for all the cases.

Design to Chip with Multi-Access Memory System and Parallel Processor for 16 Processing Elements of Image Processing Purpose (영상처리용 16개의 처리기를 위한 다중접근기억장치 및 병렬처리기의 칩 설계)

  • Lim, Jae-Ho;Park, Seong-Mi;Park, Jong-Won
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.11
    • /
    • pp.1401-1408
    • /
    • 2011
  • This dissertation present a chip with Multi-Access Memory System(MAMS) and parallel processor for 16 Processing Elements of image processing purpose. MAMS is a kind of parallel access memory system and can simultaneously access to random pixel datas with eight types. It is possible to set a interval about pixel datas to access, too. The parallel processor built-in MAMS actually has been realized in 2003 but its performance fell short of a real time process for high-definition images. I designed a improved parallel processing system by means of addition and expansion of Memory Modules and Processing Elements of previous one. It is feasible to perform a Morphological Closing at the speed of 3 times of the previous one and 6 times of serial system.