• Title/Summary/Keyword: L1 data cache

Search Result 22, Processing Time 0.024 seconds

Design of Cache Memory System for Next Generation CPU (차세대 CPU를 위한 캐시 메모리 시스템 설계)

  • Jo, Ok-Rae;Lee, Jung-Hoon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.6
    • /
    • pp.353-359
    • /
    • 2016
  • In this paper, we propose a high performance L1 cache structure for the high clock CPU. The proposed cache memory consists of three parts, i.e., a direct-mapped cache to support fast access time, a two-way set associative buffer to reduce miss ratio, and a way-select table. The most recently accessed data is stored in the direct-mapped cache. If a data has a high probability of a repeated reference, when the data is replaced from the direct-mapped cache, the data is stored into the two-way set associative buffer. For the high performance and fast access time, we propose an one way among two ways set associative buffer is selectively accessed based on the way-select table (WST). According to simulation results, access time can be reduced by about 7% and 40% comparing with a direct cache and Intel i7-6700 with two times more space respectively.

Efficient Cache Architecture for Transactional Memory (트랜잭셔널 메모리를 위한 효율적인 캐시 구조)

  • Choi, Dong-Min;Kim, Seung-Hun;Ro, Won-Woo
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.4
    • /
    • pp.1-8
    • /
    • 2011
  • Traditional transactional memory systems are no longer able to guarantee the performance of diverse applications with overflowed transactions since there is the drawback that tracking the data for logging is difficult. Especially, this mechanism has a disadvantage of increasing communication delay for sustaining the state which is required to detect the conflict on the overflowed transactions from the first level cache in the transactional memory systems. To address this point, we have focused on the cache architecture of the systems to reduce the overhead caused by overflows and cache misses. In this paper, we present Supportive Cache which reduces additional overhead during transactions. Supportive Cache performs a parallel look-up with L1 private cache and uses the same replacement policy as L1 private cache. We evaluate the performance of the proposed design by comparing LogTM-SE with and without Supportive Cache. The simulation results show that our system improves the performance by 37% on average, compared to the original LogTM-SE which uses the same hardware resource.

Performance Analysis of Cache and Internal Memory of a High Performance DSP for an Optimal Implementation of Motion Picture Encoder (고성능 DSP에서 동영상 인코더의 최적화 구현을 위한 캐쉬 및 내부 메모리 성능 분석)

  • Lim, Se-Hun;Chung, Sun-Tae
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.5
    • /
    • pp.72-81
    • /
    • 2008
  • High Performance DSP usually supports cache and internal memory. For an optimal implementation of a multimedia stream application on such a high performance DSP, one needs to utilize the cache and internal memory efficiently. In this paper, we investigate performance analysis of cache, and internal memory configuration and placement necessary to achieve an optimal implementation of multimedia stream applications like motion picture encoder on high performance DSP, TMS320C6000 series, and propose strategies to improve performance for cache and internal memory placement. From the results of analysis and experiments, it is verified that 2-way L2 cache configuration with the remaining memory configured as internal memory shows relatively good performance. Also, it is shown that L1P cache hit rate is enhanced when frequently called routines and routines having caller-callee relationships with them are continuously placed in the internal memory and that L1D cache hit rate is enhanced by the simple change of the data size. The results in the paper are expected to contribute to the optimal implementation of multimedia stream applications on high performance DSPs.

Warp-Based Load/Store Reordering to Improve GPU Time Predictability

  • Huangfu, Yijie;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.2
    • /
    • pp.58-68
    • /
    • 2017
  • While graphics processing units (GPUs) can be used to improve the performance of real-time embedded applications that require high throughput, it is challenging to estimate the worst-case execution time (WCET) of GPU programs, because modern GPUs are designed for improving the average-case performance rather than time predictability. In this paper, a reordering framework is proposed to regulate the access to the GPU data cache, which helps to improve the accuracy of the estimation of GPU L1 data cache miss rate with low performance overhead. Also, with the improved cache miss rate estimation, tighter WCET estimations can be achieved for GPU programs.

Analysis on the GPU Performance according to Hierarchical Memory Organization (계층적 메모리 구성에 따른 GPU 성능 분석)

  • Choi, Hongjun;Kim, Jongmyon;Kim, Cheolhong
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.3
    • /
    • pp.22-32
    • /
    • 2014
  • Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.

Analysis of GPGPU Performance by dedicating L2 Cache for Texture Data (텍스쳐 데이터를 위한 2차 캐쉬 구조를 가지는 그래픽 처리 장치의 성능 분석)

  • Kim, Gwang Bok;Kim, Cheol Hong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2017.01a
    • /
    • pp.143-144
    • /
    • 2017
  • 최근 그래픽 처리 장치는 DRAM에 대한 접근을 줄이고자 여러 메모리 계층을 사용하고 있다. GPGPU의 L2 캐쉬는 요청 데이터의 타입에 따라 별도로 접근하는 L1 메모리와 다르게 레이턴시가 긴 DRAM에 접근하기 전에 모든 데이터 타입이 접근 가능한 캐쉬이다. 본 논문에서는 애플리케이션에서 명시하는 다양한 데이터 타입에 대하여 접근 및 적재를 허용하는 L2 캐쉬를 오직 텍스쳐 데이터만을 허용하도록 하여 변화하는 성능을 분석하고자 한다. 본 실험을 위해 텍스쳐 데이터 이외의 데이터 타입은 L2 캐쉬를 바이패스하여 바로 DRAM에 접근하도록 구조를 변경한다. 실험을 통한 분석 결과 텍스쳐 데이터만을 허용하는 경우 대부분의 벤치마크에서 성능 감소가 발생하여 기존 구조대비 평균 5.58% 감소율을 확인하였다. 반대로, 본 논문의 실험 환경에서의 L2 캐쉬의 적중률이 낮은 애플리케이션인 needle은 불필요한 L2 접근을 바이패스 함으로써 전체적인 성능 증가를 이끌어낸 것으로 분석된다.

  • PDF

Instructions and Data Prefetch Mechanism using Displacement History Buffer (변위 히스토리 버퍼를 이용한 명령어 및 데이터 프리페치 기법)

  • Jeong, Yong Su;Kim, JinHyuk;Cho, Tae Hwan;Choi, SangBang
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.52 no.10
    • /
    • pp.82-94
    • /
    • 2015
  • In this paper, we propose hardware prefetch mechanism with an efficient cache replacement policy by giving priority to the trigger block in which a spatial region and producing a spatial region by using the displacement field. It could be taken into account the sequence of the program since a history is based on the trigger block of history record, and it could be quickly prefetching the instructions or data address by adding a stored value to the trigger address and displacement field since a history is stored as a displacement value. Also, we proposed a method of replacing at random by the cache replacement policy from the low priority block when the cache area is full after giving priority to the trigger block. We analyzed using the memory simulator program gem5 and PARSEC benchmark to assess the performance of the hardware prefetcher. As a result, compared to the existing hardware prefecture to generate the spatial region using a bit vector, L1 data cache miss rate was reduced about 44.5% on average and an average of 26.1% of L1 instruction misses occur. In addition, IPC (Instruction Per Cycle) showed an improvement of about 23.7% on average.

Cache Sensitive T-tree Index Structure (캐시를 고려한 T-트리 인덱스 구조)

  • Lee Ig-hoon;Kim Hyun Chul;Hur Jae Yung;Lee Snag-goo;Shim JunHo;Chang Juho
    • Journal of KIISE:Databases
    • /
    • v.32 no.1
    • /
    • pp.12-23
    • /
    • 2005
  • In the past decade, advances in speed of commodity CPUs have iu out-paced advances in memory latency Main-memory access is therefore increasingly a performance bottleneck for many computer applications, including database systems. To reduce memory access latency, cache memory incorporated in the memory subsystem. but cache memories can reduce the memory latency only when the requested data is found in the cache. This mainly depends on the memory access pattern of the application. At this point, previous research has shown that B+ trees perform much faster than T-trees because B+ trees are more cache conscious than T-trees, and also proposed 'Cache Sensitive B+trees' (CSB. trees) that are more cache conscious than B+trees. The goal of this paper is to make T-trees be cache conscious as CSB-trees. We propose a new index structure called a 'Cache Sensitive T-trees (CST-trees)'. We implemented CST-trees and compared performance of CST-trees with performance of other index structures.

A new warp scheduling technique for improving the performance of GPUs by utilizing MSHR information (GPU 성능 향상을 위한 MSHR 정보 기반 워프 스케줄링 기법)

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.3
    • /
    • pp.72-83
    • /
    • 2017
  • GPUs can provide high throughput with latency hiding by executing many warps in parallel. MSHR(Miss Status Holding Registers) for L1 data cache tracks cache miss requests until required data is serviced from lower level memory. In recent GPUs, excessive requests for cache resources cause underutilization problem of GPU resources due to cache resource reservation fails. In this paper, we propose a new warp scheduling technique to reduce stall cycles under MSHR resource shortage. Cache miss rates for each warp is predicted based on the observation that each warp shows similar cache miss rates for long period. The warps showing low miss rates or computation-intensive warps are given high priority to be issued when MSHR is full status. Our proposal improves GPU performance by utilizing cache resource more efficiently based on cache miss rate prediction and monitoring the MSHR entries. According to our experimental results, reservation fail cycles can be reduced by 25.7% and IPC is increased by 6.2% with the proposed scheduling technique compared to loose round robin scheduler.

Compact Field Remapping for Dynamically Allocated Structures (동적으로 할당된 구조체를 위한 압축된 필드 재배치)

  • Kim, Jeong-Eun;Han, Hwan-Soo
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.10
    • /
    • pp.1003-1012
    • /
    • 2005
  • The most significant difference of embedded systems from general purpose systems is that embedded systems are allowed to use only limited resources including battery and memory. Especially, the number of applications increases which deal with multimedia data. In those systems with high data computations, the delay of memory access is one of the major bottlenecks hurting the system performance. As a result, many researchers have investigated various techniques to reduce the memory access cost. Most programs generally have locality in memory references. Temporal locality of references means that a resource accessed at one point will be used again in the near future. Spatial locality of references is that likelihood of using a resource gets higher if resources near it were just accessed. The latest embedded processors usually adapt cache memory to exploit these two types of localities. Processors access faster cache memory than off-chip memory, reducing the latency. In this paper we will propose the enhanced dynamic allocation technique for structure-type data in order to eliminate unused memory space and to reduce both the cache miss rate and the application execution time. The proposed approach aggregates fields from multiple records dynamically allocated and consecutively remaps them on the memory space. Experiments on Olden benchmarks show $13.9\%$ L1 cache miss rate drop and $15.9\%$ L2 cache miss drop on average, compared to the previously proposed techniques. We also find execution time reduced by $10.9\%$ on average, compared to the previous work.