• Title/Summary/Keyword: SIMT Architecture

Search Result 8, Processing Time 0.019 seconds

An Implementation of a Memory Operation System Architecture for Memory Latency Penalty Reduction in SIMT Based Stream Processor (Memory Latency Penalty를 개선한 SIMT 기반 Stream Processor의 Memory Operation System Architecture 설계)

  • Lee, Kwang-Yeob
    • Journal of IKEEE
    • /
    • v.18 no.3
    • /
    • pp.392-397
    • /
    • 2014
  • In this paper, we propose a memory operation system architecture for memory latency penalty reduction in SIMT architecture based stream processor. The proposed architecture applied non-blocking cache architecture to reduce cache miss penalty generated by blocking cache architecture. We verified that the proposed memory operation architecture improve the performance of the stream processor by comparing processing performances of various algorithms. We measured the performance improvement rate that was improved in accordance with the ratio of memory instruction in each algorithm. As a result, we confirmed that the performance of stream processor improves up to minimum 8.2% and maximum 46.5%.

Design of a SIMT architecture GP-GPU Using Tile based on Graphic Pipeline Structure (타일 기반 그래픽 파이프라인 구조를 사용한 SIMT 구조 GP-GPU 설계)

  • Kim, Do-Hyun;Kim, Chi-Yong
    • Journal of IKEEE
    • /
    • v.20 no.1
    • /
    • pp.75-81
    • /
    • 2016
  • This paper proposes a design of the tile based on graphic pipeline to improve the graphic application performance in SIMT based GP-GPU. The proposed Tile based on graphics pipeline avoids unnecessary graphic processing operation, and processes the rasterization step in parallel. The massive data processing in parallel through SIMT architecture improve the computational performance, thereby improving the 3D graphic pipeline performance. The more vertex data of 3D model, the higher performance. The proposed structure was confirmed to improve processing performance of up to 3 times from about 1.18 times as compared to 'RAMP' and previous studies.

An implementation of a unified ALU in multi-core GPGPU based on SIMT architecture (SIMT 구조 기반 멀티코어 GPGPU의 통합 ALU 설계)

  • Kyung, Gyu-taek;Kwak, Jae-Chang;Lee, Kwang-yeob
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2013.10a
    • /
    • pp.540-543
    • /
    • 2013
  • This paper describes an implementation of a unified ALU on multi-core GPGPU based on SIMT architecture. Our unified ALU can operate conditional branch instructions, data movement instructions, integer arithmetic instructions and floating-point arithmetic instructions. Since multi-core GPGPU contains a lot of ALU for parallel processing of various types, the main point of this paper is to design the minimum size ALU by unifying similar processing of each operations on circit level. All instrunctions were tested by making a test program. And we compare this results with results of CPU operations to verify our ALU. Our unified ALU's gate size is approximately 20,000 and the maximum operation frequency is 430MHz.

  • PDF

Implementation of the SIMT based Image Signal Processor for the Image Processing (영상처리를 위한 SIMT 기반 Image Signal Processor 구현)

  • Hwang, Yun-Seop;Jeon, Hee-Kyeong;Lee, Kwan-ho;Lee, Kwang-yeob
    • Journal of IKEEE
    • /
    • v.20 no.1
    • /
    • pp.89-93
    • /
    • 2016
  • In this paper, we proposed SIMT based Image Signal Processor which can apply various image preprocessing algorithms and allow parallel processing of application programs such as image recognition. Conventional ISP has the hard-wired image enhancement algorithm of which the processing speed is fast, but there was difficult to optimize performance depending on various image processing algorithms. The proposed ISP improved the processing time applying SIMT architecture and processed a variety of image processing algorithms as an instruction based processor. We used Xilinx Virtex-7 board and the processing time compared to cell multicore processor, ARM Cortex-A9, ARM Cortex-A15 was reduced by about 71 percent, 63 percent and 33 percent, respectively.

Design of a High-Performance Mobile GPGPU with SIMT Architecture based on a Small-size Warp Scheduler (작은 크기의 Warp 스케쥴러 기반 SIMT구조 고성능 모바일 GPGPU 설계)

  • Lee, Kwang-Yeob
    • Journal of IKEEE
    • /
    • v.25 no.3
    • /
    • pp.479-484
    • /
    • 2021
  • This paper proposed and designed a structure to achieve high performance with a small number of cores in GPGPU with SIMT structure. GPGPU for application to mobile devices requires a structure to increase performance compared to power consumption. In order to reduce power consumption, the number of cores decreased, but to improve performance, the size of the warp scheduler for managing threads was set to 4, which was greatly reduced than 32 of general GPGPU. Reducing warp size can reduce the number of idle cycles in pipelines and efficiently apply memory latency to reduce miss penalty when accessing cache memory. The designed GPGPU measured computational performance using a test program that includes floating point operations and measured power consumption through a 28nm CMOS process to obtain 104.5GFlops/Watt as a performance per power. The results of this paper showed about four times better performance per power compared to Tegra K1 of Nvidia

Design of a Dispatch Unit & Operand Selection Unit for Improving the SIMT Based GP-GPU Instruction Performance (SIMT구조 GP-GPU의 명령어 처리 성능 향상을 위한 Dispatch Unit과 Operand Selection Unit설계)

  • Kwak, Jae Chang
    • Journal of IKEEE
    • /
    • v.19 no.3
    • /
    • pp.455-459
    • /
    • 2015
  • This paper proposes a dispatch unit of GP-GPU with SIMT architecture to support the acceleration of general-purpose operation as well as graphics processing. If all the information of an operand used instructions issued from the warp scheduler is decoded, an unnecessary operand load occurs, resulting in register loads. To resolve this problem, this paper proposes a method that can reduce the operand load and the load on the resister by decoding only the information of the operand using a pre-decoding method. The operand information from the dispatch unit is passed to the operand selection unit with preventing register bank collisions. Thus the overall performance are improved. In the simulation test, the total clock cycles required by processing 10,000 arbitrary instructions issued from the wrap scheduler using ModelSim SE 10.0b are measured. It shows that the application of the dispatch unit equipped with the pre-decoding function proposed in this paper can make an improvement of about 12% in processing performance compared to the conventional method.

A Study on Architecture Improving Performance of openCV (openCV 의 성능 향상을 위한 아키텍처 연구)

  • Cho, Yeongpil;Heo, Ingoo;Kim, Yongjoo;Paek, Yunheung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.11a
    • /
    • pp.18-20
    • /
    • 2011
  • 최근 컴퓨터 비전의 활용 영역이 증가함에 따라 컴퓨터 비전의 대표적인 라이브러리인 openCV의 사용 또한 증가하는 추세이다. openCV 에는 컴퓨터 비전 알고리즘의 특성상 massive 한 연산을 수행해야 하는 부분이 상당수 존재한다. 본 논문은 이러한 연산량의 부담을 줄임으로써 openCV 의 성능 향상을 위한 아키텍처를 연구한다. openCV 의 massive 한 연산은 라이브러리 함수에 있는 내부 반복문에서 발생하기 때문에, 본 논문은 반복문의 특성을 분석하고 이를 가속할 수 있는 아키텍처가 무엇인지 연구한다. 결론적으로 반복문의 각 iteration 이 독립적일 경우에는 SIMD (Single Instruction Multiple Data)와 SIMT (Single Instruction Multiple Thread)이 적합하며 반복문의 각 iteration 이 의존적일 경우에는 MIMD (Multiple Instruction Multiple Data)를 바탕으로 하는 파이프라인 아키텍처가 적합하다.

A Design of a High Performance Stream Processor without Superscalar Architecture (슈퍼스칼라 구조를 갖지 않는 고성능 Stream Processor 설계)

  • Lee, Kwan-Ho;Kim, Chi-Yong
    • Journal of IKEEE
    • /
    • v.21 no.1
    • /
    • pp.77-80
    • /
    • 2017
  • In this paper, we proposed a way to improve performance of GP-GPU by deletion of superscalar issue from its original form. At first, we simplified the structure of stream processor in order to eliminate superscalar issue. Under this condition, preservation of hardware size and increasing of thread number were followed by functional improvement of GP-GPU. As the number of thread was getting larger, we proposed the new model of warp scheduler which adjusts the group of thread. This superscalar issue-deleted warp scheduler transferred the instructions to warp which was activated by Round Robin Scheduling. Performance comparison was conducted by Gaussian filtering and the results indicated that our newly designed GP-GPU showing 7.89 times better in its performance than original one.