• Title/Summary/Keyword: L1 data cache

Search Result 22, Processing Time 0.022 seconds

Variable latency L1 data cache architecture design in multi-core processor under process variation

  • Kong, Joonho
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.9
    • /
    • pp.1-10
    • /
    • 2015
  • In this paper, we propose a new variable latency L1 data cache architecture for multi-core processors. Our proposed architecture extends the traditional variable latency cache to be geared toward the multi-core processors. We added a specialized data structure for recording the latency of the L1 data cache. Depending on the added latency to the L1 data cache, the value stored to the data structure is determined. It also tracks the remaining cycles of the L1 data cache which notifies data arrival to the reservation station in the core. As in the variable latency cache of the single-core architecture, our proposed architecture flexibly extends the cache access cycles considering process variation. The proposed cache architecture can reduce yield losses incurred by L1 cache access time failures to nearly 0%. Moreover, we quantitatively evaluate performance, power, energy consumption, power-delay product, and energy-delay product when increasing the number of cache access cycles.

A Locality-Aware Write Filter Cache for Energy Reduction of STTRAM-Based L1 Data Cache

  • Kong, Joonho
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.16 no.1
    • /
    • pp.80-90
    • /
    • 2016
  • Thanks to superior leakage energy efficiency compared to SRAM cells, STTRAM cells are considered as a promising alternative for a memory element in on-chip caches. However, the main disadvantage of STTRAM cells is high write energy and latency. In this paper, we propose a low-cost write filter (WF) cache which resides between the load/store queue and STTRAM-based L1 data cache. To maximize efficiency of the WF cache, the line allocation and access policies are optimized for reducing energy consumption of STTRAM-based L1 data cache. By efficiently filtering the write operations in the STTRAM-based L1 data cache, our proposed WF cache reduces energy consumption of the STTRAM-based L1 data cache by up to 43.0% compared to the case without the WF cache. In addition, thanks to the fast hit latency of the WF cache, it slightly improves performance by 0.2%.

Energy-Performance Efficient 2-Level Data Cache Architecture for Embedded System (내장형 시스템을 위한 에너지-성능 측면에서 효율적인 2-레벨 데이터 캐쉬 구조의 설계)

  • Lee, Jong-Min;Kim, Soon-Tae
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.5
    • /
    • pp.292-303
    • /
    • 2010
  • On-chip cache memories play an important role in both performance and energy consumption points of view in resource-constrained embedded systems by filtering many off-chip memory accesses. We propose a 2-level data cache architecture with a low energy-delay product tailored for the embedded systems. The L1 data cache is small and direct-mapped, and employs a write-through policy. In contrast, the L2 data cache is set-associative and adopts a write-back policy. Consequently, the L1 data cache is accessed in one cycle and is able to provide high cache bandwidth while the L2 data cache is effective in reducing global miss rate. To reduce the penalty of high miss rate caused by the small L1 cache and power consumption of address generation, we propose an ECP(Early Cache hit Predictor) scheme. The ECP predicts if the L1 cache has the requested data using both fast address generation and L1 cache hit prediction. To reduce high energy cost of accessing the L2 data cache due to heavy write-through traffic from the write buffer laid between the two cache levels, we propose a one-way write scheme. From our simulation-based experiments using a cycle-accurate simulator and embedded benchmarks, the proposed 2-level data cache architecture shows average 3.6% and 50% improvements in overall system performance and the data cache energy consumption.

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

  • Huangfu, Yijie;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.2
    • /
    • pp.69-77
    • /
    • 2017
  • Recent GPUs have adopted cache memory to benefit general-purpose GPU (GPGPU) programs. However, unlike CPU programs, GPGPU programs typically have considerably less temporal/spatial locality. Moreover, the L1 data cache is used by many threads that access a data size typically considerably larger than the L1 cache, making it critical to bypass L1 data cache intelligently to enhance GPU cache performance. In this paper, we examine GPU cache access behavior and propose a simple hardware-based GPU cache bypassing method that can be applied to GPU applications without recompiling programs. Moreover, we introduce a hybrid method that integrates static profiling information and hardware-based bypassing to further enhance performance. Our experimental results reveal that hardware-based cache bypassing can boost performance for most benchmarks, and the hybrid method can achieve performance comparable to state-of-the-art compiler-based bypassing with considerably less profiling cost.

New Two-Level L1 Data Cache Bypassing Technique for High Performance GPUs

  • Kim, Gwang Bok;Kim, Cheol Hong
    • Journal of Information Processing Systems
    • /
    • v.17 no.1
    • /
    • pp.51-62
    • /
    • 2021
  • On-chip caches of graphics processing units (GPUs) have contributed to improved GPU performance by reducing long memory access latency. However, cache efficiency remains low despite the facts that recent GPUs have considerably mitigated the bottleneck problem of L1 data cache. Although the cache miss rate is a reasonable metric for cache efficiency, it is not necessarily proportional to GPU performance. In this study, we introduce a second key determinant to overcome the problem of predicting the performance gains from L1 data cache based on the assumption that miss rate only is not accurate. The proposed technique estimates the benefits of the cache by measuring the balance between cache efficiency and throughput. The throughput of the cache is predicted based on the warp occupancy information in the warp pool. Then, the warp occupancy is used for a second bypass phase when workloads show an ambiguous miss rate. In our proposed architecture, the L1 data cache is turned off for a long period when the warp occupancy is not high. Our two-level bypassing technique can be applied to recent GPU models and improves the performance by 6% on average compared to the architecture without bypassing. Moreover, it outperforms the conventional bottleneck-based bypassing techniques.

Designing a low-power L1 cache system using aggressive data of frequent reference patterns

  • Jung, Bo-Sung;Lee, Jung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.7
    • /
    • pp.9-16
    • /
    • 2022
  • Today, with the advent of the 4th industrial revolution, IoT (Internet of Things) systems are advancing rapidly. For this reason, a various application with high-performance and large-capacity are emerging. Therefore, there is a need for low-power and high-performance memory for computing systems with these applications. In this paper, we propose an effective structure for the L1 cache memory, which consumes the most energy in the computing system. The proposed cache system is largely composed of two parts, the L1 main cache and the buffer cache. The main cache is 2 banks, and each bank consists of a 2-way set association. When the L1 cache hits, the data is copied into buffer cache according to the proposed algorithm. According to simulation, the proposed L1 cache system improved the performance of energy delay products by about 65% compared to the existing 4-way set associative cache memory.

Workload Characteristics-based L1 Data Cache Switching-off Mechanism for GPUs

  • Do, Thuan Cong;Kim, Gwang Bok;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.10
    • /
    • pp.1-9
    • /
    • 2018
  • Modern graphics processing units (GPUs) have become one of the most attractive platforms in exploiting high thread level parallelism with the support of new programming tools such as CUDA and OpenCL. Recent GPUs has applied cache hierarchy to support irregular memory access patterns; however, L1 data cache (L1D) exhibits poor efficiency in the GPU. This paper shows that the L1D does not always positively affect the applications in terms of performance and energy efficiency for the GPU. The performance of the GPU is even harmed by using the L1D for lots of applications. Our proposed technique exploits the characteristics of the currently-executed applications to predict the performance impact of the L1D on the GPU and then decides whether to continuously use the cache for the application or not. Our experimental results show that the proposed technique improves the GPU performance by 9.4% and saves up to 52.1% of the power consumption in the L1D.

Energy-efficient Set-associative Cache Using Bi-mode Way-selector (에너지 효율이 높은 이중웨이선택형 연관사상캐시)

  • Lee, Sungjae;Kang, Jinku;Lee, Juho;Youn, Jiyong;Lee, Inhwan
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.1 no.1
    • /
    • pp.1-10
    • /
    • 2012
  • The way-lookup cache and the way-tracking cache are considered to be the most energy-efficient when used for level 1 and level 2 caches, respectively. This paper proposes an energy-efficient set-associative cache using the bi-mode way-selector that combines the way selecting techniques of the way-tracking cache and the way-lookup cache. The simulation results using an Alpha 21264-based system show that the bi-mode way-selecting L1 instruction cache consumes 27.57% of the energy consumed by the conventional set-associative cache and that it is as energy-efficient as the way-lookup cache when used for L1 instruction cache. The bi-mode way-selecting L1 data cache consumes 28.42% of the energy consumed by the conventional set-associative cache, which means that it is more energy-efficient than the way-lookup cache by 15.54% when used for L1 data cache. The bi-mode way-selecting L2 cache consumes 15.41% of the energy consumed by the conventional set-associative cache, which means that it is more energy-efficient than the way-tracking cache by 16.16% when used for unified L2 cache. These results show that the proposed cache can provide the best level of energy-efficiency regardless of the cache level.

Impacts of multiple cache block sizes on system performance (다양한 cache block크기에 의한 시스템의 성능 변화)

  • 이성환;김준성
    • Proceedings of the IEEK Conference
    • /
    • 2003.07d
    • /
    • pp.1347-1350
    • /
    • 2003
  • 본 논문에서는 instruction과 data cache로 나누어지는 L1 cache를 가진 시스템에서 instruction과 data cache 각각의 block 크기 변화가 전체 시스템의 성능에 미치는 영향을 고찰하였다. 이를 위하여 SPEC CPU 벤치마크 프로그램을 입력으로 하는 SimpleScalar를 이용한 시뮬레이션을 수행하였다. 본 연구를 통해서, instruction과 data 각각의 특성에 맞는 cache block 크기를 사용하는 것이 일률적인 cache block 크기를 사용하는 것에 비하여 전체 시스템의 성능을 더욱 향상시켜 준다는 것을 보여준다.

  • PDF

Cache memory system for high performance CPU with 4GHz (4Ghz 고성능 CPU 위한 캐시 메모리 시스템)

  • Jung, Bo-Sung;Lee, Jung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.2
    • /
    • pp.1-8
    • /
    • 2013
  • TIn this paper, we propose a high performance L1 cache structure on the high clock CPU of 4GHz. The proposed cache memory consists of three parts, i.e., a direct-mapped cache to support fast access time, a two-way set associative buffer to exploit temporal locality, and a buffer-select table. The most recently accessed data is stored in the direct-mapped cache. If a data has a high probability of a repeated reference, when the data is replaced from the direct-mapped cache, the data is selectively stored into the two-way set associative buffer. For the high performance and low power consumption, we propose an one way among two ways set associative buffer is selectively accessed based on the buffer-select table(BST). According to simulation results, Energy $^*$ Delay product can improve about 45%, 70% and 75% compared with a direct mapped cache, a four-way set associative cache, and a victim cache with two times more space respectively.