• Title/Summary/Keyword: latency hiding

Search Result 15, Processing Time 0.021 seconds

An Efficient Latency Hiding method using accumulation buffer (누적 버퍼를 활용한 효율적인 Latency Hiding기법)

  • Lee, Min-Woo;Han, Tack-Don
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2012.07a
    • /
    • pp.297-300
    • /
    • 2012
  • 현재 cache의 성능 향상을 위한 많은 기법들이 제안되고 있으며, Latency Hiding 기법 역시 cache의 효율적인 사용을 위해 많은 연구가 진행 되어 왔다. write buffer를 사용한 write Latency hiding기법이나 multi threading을 사용한 Latency Hiding 방법 등 여러 기법들이 연구되어 왔으며, 지금도 Latency hiding을 위한 많은 연구들이 지속적으로 진행되고 있다. 본 논문 역시 효율적인 Latency Hiding을 위한 누적 버퍼를 제안한다. 본 논문은 누적 버퍼의 활용도를 조사하여 얼마나 효율적으로 Latency를 은폐했는지, 또 버퍼를 사용함으로써 얻는 다른 이점에 대해 집중적으로 연구하였다.

  • PDF

Analysis of the GPGPU Performance for Various Combinations of Workloads Executed Concurrently (동시에 실행되는 워크로드 조합에 따른 GPGPU 성능 분석)

  • Kim, Dongwhan;Eom, Hyeonsang
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.3
    • /
    • pp.165-170
    • /
    • 2017
  • Many studies have utilized GPGPU (General-Purpose Graphic Processing Unit) and its high computing power to compute complex tasks. The characteristics of GPGPU programs necessitate the operations of memory copy between the host and device. A high latency period can affect the performance of the program. Thus, it is required to significantly improve the performance of GPGPU programs by optimizations. By executing multiple GPGPU programs simultaneously, the latency hiding effect of memory copy is achieved by overlapping the memory copy and computing operations in GPGPU. This paper presents the results of analyzing the latency hiding effect for memory copy operations. Furthermore, we propose a performance anticipation model and an algorithm for the limitations of using pinned memory, and show that the use of the proposed algorithm results in a 41% performance increase.

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.4
    • /
    • pp.1-9
    • /
    • 2019
  • LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.

Memory Latency Hiding Techniques (메모리 지연을 감추는 기법들)

  • Ki, An-Do
    • Electronics and Telecommunications Trends
    • /
    • v.13 no.3 s.51
    • /
    • pp.61-70
    • /
    • 1998
  • The obvious way to make a computer system more powerful is to make the processor as fast as possible. Furthermore, adopting a large number of such fast processors would be the next step. This multiprocessor system could be useful only if it distributes workload uniformly and if its processors are fully utilized. To achieve a higher processor utilization, memory access latency must be reduced as much as possible and even more the remaining latency must be hidden. The actual latency can be reduced by using fast logic and the effective latency can be reduced by using cache. This article discusses what the memory latency problem is, how serious it is by presenting analytical and simulation results, and existing techniques for coping with it; such as write-buffer, relaxed consistency model, multi-threading, data locality optimization, data forwarding, and data prefetching.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

Study of Cache Performance on GPGPU

  • Choi, Kyu Hyun;Kim, Seon Wook
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.4 no.2
    • /
    • pp.78-82
    • /
    • 2015
  • General-purpose graphics processing units (GPGPUs) provide tremendous computational and processing power. Despite the latency hiding mechanism, a GPU architecture requires high memory bandwidth and lower latency between computational units and the memory system. For this reason, the current GPU architecture has private L1 caches in each core and a shared L2 cache to increase performance by reducing memory latency. But in some cases, this CPU-like cache design is not suitable for GPGPUs. In this paper, we analyze detailed cache performance related to GPGPU application characteristics, and suggest technical alternatives for the GPGPU architecture as future work.

Branch Prediction Latency Hiding Scheme using Branch Pre-Prediction and Modified BTB (분기 선예측과 개선된 BTB 구조를 사용한 분기 예측 지연시간 은폐 기법)

  • Kim, Ju-Hwan;Kwak, Jong-Wook;Jhon, Chu-Shik
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.10
    • /
    • pp.1-10
    • /
    • 2009
  • Precise branch predictor has a profound impact on system performance in modern processor architectures. Recent works show that prediction latency as well as prediction accuracy has a critical impact on overall system performance as well. However, prediction latency tends to be overlooked. In this paper, we propose Branch Pre-Prediction policy to tolerate branch prediction latency. The proposed solution allows that branch predictor can proceed its prediction without any information from the fetch engine, separating the prediction engine from fetch stage. In addition, we propose newly modified BTE structure to support our solution. The simulation result shows that proposed solution can hide most prediction latency with still providing the same level of prediction accuracy. Furthermore, the proposed solution shows even better performance than the ideal case, that is the predictor which always takes a single cycle prediction latency. In our experiments, IPC improvement is up to 11.92% and 5.15% in average, compared to conventional predictor system.

Garbage Collection Synchronization Technique for Improving Tail Latency of Cloud Databases (클라우드 데이터베이스에서의 꼬리응답시간 감소를 위한 가비지 컬렉션 동기화 기법)

  • Han, Seungwook;Hahn, Sangwook Shane;Kim, Jihong
    • Journal of KIISE
    • /
    • v.44 no.8
    • /
    • pp.767-773
    • /
    • 2017
  • In a distributed system environment, such as a cloud database, the tail latency needs to be kept short to ensure uniform quality of service. In this paper, through experiments on a Cassandra database, we show that long tail latency is caused by a lack of memory space because the database cannot receive any request until free space is reclaimed by writing the buffered data to the storage device. We observed that, since the performance of the storage device determines the amount of time required for writing the buffered data, the performance degradation of Solid State Drive (SSD) due to garbage collection results in a longer tail latency. We propose a garbage collection synchronization technique, called SyncGC, that simultaneously performs garbage collection in the java virtual machine and in the garbage collection in SSD concurrently, thus hiding garbage collection overheads in the SSD. Our evaluations on real SSDs show that SyncGC reduces the tail latency of $99.9^{th}$ and, $99.9^{th}-percentile$ by 31% and 36%, respectively.

Early Start Branch Prediction to Resolve Prediction Delay (분기 명령어의 조기 예측을 통한 예측지연시간 문제 해결)

  • Kwak, Jong-Wook;Kim, Ju-Hwan
    • The KIPS Transactions:PartA
    • /
    • v.16A no.5
    • /
    • pp.347-356
    • /
    • 2009
  • Precise branch prediction is a critical factor in the IPC Improvement of modern microprocessor architectures. In addition to the branch prediction accuracy, branch prediction delay have a profound impact on overall system performance as well. However, it tends to be overlooked when the architects design the branch predictor. To tolerate branch prediction delay, this paper proposes Early Start Prediction (ESP) technique. The proposed solution dynamically identifies the start instruction of basic block, called as Basic Block Start Address (BB_SA), and the solution uses BB_SA when predicting the branch direction, instead of branch instruction address itself. The performance of the proposed scheme can be further improved by combining short interval hiding technique between BB_SA and branch instruction. The simulation result shows that the proposed solution hides prediction latency, with providing same level of prediction accuracy compared to the conventional predictors. Furthermore, the combination with short interval hiding technique provides a substantial IPC improvement of up to 10.1%, and the IPC is actually same with ideal branch predictor, regardless of branch predictor configurations, such as clock frequency, delay model, and PHT size.

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization (GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법)

  • Thuan, Do Cong;Choi, Yong;Kim, Jong Myon;Kim, Cheol Hong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.5
    • /
    • pp.219-230
    • /
    • 2017
  • General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.