• Title/Summary/Keyword: L1 캐시 메모리

Search Result 12, Processing Time 0.022 seconds

Designing a low-power L1 cache system using aggressive data of frequent reference patterns

  • Jung, Bo-Sung;Lee, Jung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.7
    • /
    • pp.9-16
    • /
    • 2022
  • Today, with the advent of the 4th industrial revolution, IoT (Internet of Things) systems are advancing rapidly. For this reason, a various application with high-performance and large-capacity are emerging. Therefore, there is a need for low-power and high-performance memory for computing systems with these applications. In this paper, we propose an effective structure for the L1 cache memory, which consumes the most energy in the computing system. The proposed cache system is largely composed of two parts, the L1 main cache and the buffer cache. The main cache is 2 banks, and each bank consists of a 2-way set association. When the L1 cache hits, the data is copied into buffer cache according to the proposed algorithm. According to simulation, the proposed L1 cache system improved the performance of energy delay products by about 65% compared to the existing 4-way set associative cache memory.

Performance of the Finite Difference Method Using Cache and Shared Memory for Massively Parallel Systems (대규모 병렬 시스템에서 캐시와 공유메모리를 이용한 유한 차분법 성능)

  • Kim, Hyun Kyu;Lee, Hyo Jong
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.4
    • /
    • pp.108-116
    • /
    • 2013
  • Many algorithms have been introduced to improve performance by using massively parallel systems, which consist of several hundreds of processors. A typical example is a GPU system of many processors which uses shared memory. In the case of image filtering algorithms, which make references to neighboring points, the shared memory helps improve performance by frequently accessing adjacent pixels. However, using shared memory requires rewriting the existing codes and consequently results in complexity of the codes. Recent GPU systems support both L1 and L2 cache along with shared memory. Since the L1 cache memory is located in the same area as the shared memory, the improvement of performance is predictable by using the cache memory. In this paper, the performance of cache and shared memory were compared. In conclusion, the performance of cache-based algorithm is very similar to the one of shared memory. The complexity of the code appearing in a shared memory system, however, is resolved with the cache-based algorithm.

Cache memory system for high performance CPU with 4GHz (4Ghz 고성능 CPU 위한 캐시 메모리 시스템)

  • Jung, Bo-Sung;Lee, Jung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.2
    • /
    • pp.1-8
    • /
    • 2013
  • TIn this paper, we propose a high performance L1 cache structure on the high clock CPU of 4GHz. The proposed cache memory consists of three parts, i.e., a direct-mapped cache to support fast access time, a two-way set associative buffer to exploit temporal locality, and a buffer-select table. The most recently accessed data is stored in the direct-mapped cache. If a data has a high probability of a repeated reference, when the data is replaced from the direct-mapped cache, the data is selectively stored into the two-way set associative buffer. For the high performance and low power consumption, we propose an one way among two ways set associative buffer is selectively accessed based on the buffer-select table(BST). According to simulation results, Energy $^*$ Delay product can improve about 45%, 70% and 75% compared with a direct mapped cache, a four-way set associative cache, and a victim cache with two times more space respectively.

Cache Sensitive T-tree Index Structure (캐시를 고려한 T-트리 인덱스 구조)

  • Lee Ig-hoon;Kim Hyun Chul;Hur Jae Yung;Lee Snag-goo;Shim JunHo;Chang Juho
    • Journal of KIISE:Databases
    • /
    • v.32 no.1
    • /
    • pp.12-23
    • /
    • 2005
  • In the past decade, advances in speed of commodity CPUs have iu out-paced advances in memory latency Main-memory access is therefore increasingly a performance bottleneck for many computer applications, including database systems. To reduce memory access latency, cache memory incorporated in the memory subsystem. but cache memories can reduce the memory latency only when the requested data is found in the cache. This mainly depends on the memory access pattern of the application. At this point, previous research has shown that B+ trees perform much faster than T-trees because B+ trees are more cache conscious than T-trees, and also proposed 'Cache Sensitive B+trees' (CSB. trees) that are more cache conscious than B+trees. The goal of this paper is to make T-trees be cache conscious as CSB-trees. We propose a new index structure called a 'Cache Sensitive T-trees (CST-trees)'. We implemented CST-trees and compared performance of CST-trees with performance of other index structures.

Effect of Microkernel Structure on Cache Memory Performance (마이크로커널 구조가 캐시 메모리의 성능에 미치는 영향)

  • Chang, Moon-Seok;Koh, Kern
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.1
    • /
    • pp.68-80
    • /
    • 2000
  • The modern software technology toward modularization has changed the cache accessing behavior dramatically. Many modern operating systems are also departing from the past monolithic structure toward the highly modularized structure referred to as microkernel. Microkernel-based operating systems are more portable and extensible, but are likely to have worse performance. This paper quantitatively analyzes the effect of microkernel structure on cache memory to identify the primary factor for its performance degradation. Through the experiment performed on a Intel Pentium Pro processor platform, we found that the microkernel structure suffers from remarkably higher misses for L1, L2 cache and TLB than the monolithic one does. We also found that the performance of a microkernel is more dependent on the efficiency of cache memory than IPC. Finally, we found that these results come from the effect of frequent context switches mainly caused by the structural feature of a microkernel.

  • PDF

Efficient Cache Architecture for Transactional Memory (트랜잭셔널 메모리를 위한 효율적인 캐시 구조)

  • Choi, Dong-Min;Kim, Seung-Hun;Ro, Won-Woo
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.4
    • /
    • pp.1-8
    • /
    • 2011
  • Traditional transactional memory systems are no longer able to guarantee the performance of diverse applications with overflowed transactions since there is the drawback that tracking the data for logging is difficult. Especially, this mechanism has a disadvantage of increasing communication delay for sustaining the state which is required to detect the conflict on the overflowed transactions from the first level cache in the transactional memory systems. To address this point, we have focused on the cache architecture of the systems to reduce the overhead caused by overflows and cache misses. In this paper, we present Supportive Cache which reduces additional overhead during transactions. Supportive Cache performs a parallel look-up with L1 private cache and uses the same replacement policy as L1 private cache. We evaluate the performance of the proposed design by comparing LogTM-SE with and without Supportive Cache. The simulation results show that our system improves the performance by 37% on average, compared to the original LogTM-SE which uses the same hardware resource.

An L1 Cache Prefetching Scheme using Excessively Aggressive Prefetchering and a Small Direct-mapped Filtering Cache (공격적인 선인출 및 직접 사상 필터링을 이용한 L1 캐시 선인출 기법)

  • Chon, Young-Suk
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.11
    • /
    • pp.836-852
    • /
    • 2006
  • This paper proposes an L1 cache prefetch scheme using an excessively aggressive hardware prefetcher and a hardware prefetch filter having a small direct-mapped filtering cache. A quantitative analysis method has been introduced and applied to analyze nonideal effects of aggressive cache prefetching. From those analysis results, the structure and algorithm of a prefetch filter has been derived and simulated, and the overall system performance has been measured using a cycle-by-cycle cache simulator. Experimental results show that the proposed scheme improves the overall system performance by 18% on the average over several benchmarks

Compact Field Remapping for Dynamically Allocated Structures (동적으로 할당된 구조체를 위한 압축된 필드 재배치)

  • Kim, Jeong-Eun;Han, Hwan-Soo
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.10
    • /
    • pp.1003-1012
    • /
    • 2005
  • The most significant difference of embedded systems from general purpose systems is that embedded systems are allowed to use only limited resources including battery and memory. Especially, the number of applications increases which deal with multimedia data. In those systems with high data computations, the delay of memory access is one of the major bottlenecks hurting the system performance. As a result, many researchers have investigated various techniques to reduce the memory access cost. Most programs generally have locality in memory references. Temporal locality of references means that a resource accessed at one point will be used again in the near future. Spatial locality of references is that likelihood of using a resource gets higher if resources near it were just accessed. The latest embedded processors usually adapt cache memory to exploit these two types of localities. Processors access faster cache memory than off-chip memory, reducing the latency. In this paper we will propose the enhanced dynamic allocation technique for structure-type data in order to eliminate unused memory space and to reduce both the cache miss rate and the application execution time. The proposed approach aggregates fields from multiple records dynamically allocated and consecutively remaps them on the memory space. Experiments on Olden benchmarks show $13.9\%$ L1 cache miss rate drop and $15.9\%$ L2 cache miss drop on average, compared to the previously proposed techniques. We also find execution time reduced by $10.9\%$ on average, compared to the previous work.

A Study on Design and Cache Replacement Policy for Cascaded Cache Based on Non-Volatile Memories (비휘발성 메모리 시스템을 위한 저전력 연쇄 캐시 구조 및 최적화된 캐시 교체 정책에 대한 연구)

  • Juhee Choi
    • Journal of the Semiconductor & Display Technology
    • /
    • v.22 no.3
    • /
    • pp.106-111
    • /
    • 2023
  • The importance of load-to-use latency has been highlighted as state-of-the-art computing cores adopt deep pipelines and high clock frequencies. The cascaded cache was recently proposed to reduce the access cycle of the L1 cache by utilizing differences in latencies among banks of the cache structure. However, this study assumes the cache is comprised of SRAM, making it unsuitable for direct application to non-volatile memory-based systems. This paper proposes a novel mechanism and structure for lowering dynamic energy consumption. It inserts monitoring logic to keep track of swap operations and write counts. If the ratio of swap operations to total write counts surpasses a set threshold, the cache controller skips the swap of cache blocks, which leads to reducing write operations. To validate this approach, experiments are conducted on the non-volatile memory-based cascaded cache. The results show a reduction in write operations by an average of 16.7% with a negligible increase in latencies.

  • PDF

Instructions and Data Prefetch Mechanism using Displacement History Buffer (변위 히스토리 버퍼를 이용한 명령어 및 데이터 프리페치 기법)

  • Jeong, Yong Su;Kim, JinHyuk;Cho, Tae Hwan;Choi, SangBang
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.10
    • /
    • pp.82-94
    • /
    • 2015
  • In this paper, we propose hardware prefetch mechanism with an efficient cache replacement policy by giving priority to the trigger block in which a spatial region and producing a spatial region by using the displacement field. It could be taken into account the sequence of the program since a history is based on the trigger block of history record, and it could be quickly prefetching the instructions or data address by adding a stored value to the trigger address and displacement field since a history is stored as a displacement value. Also, we proposed a method of replacing at random by the cache replacement policy from the low priority block when the cache area is full after giving priority to the trigger block. We analyzed using the memory simulator program gem5 and PARSEC benchmark to assess the performance of the hardware prefetcher. As a result, compared to the existing hardware prefecture to generate the spatial region using a bit vector, L1 data cache miss rate was reduced about 44.5% on average and an average of 26.1% of L1 instruction misses occur. In addition, IPC (Instruction Per Cycle) showed an improvement of about 23.7% on average.