• Title/Summary/Keyword: L2-Cache

Search Result 57, Processing Time 0.035 seconds

Performance of the Finite Difference Method Using Cache and Shared Memory for Massively Parallel Systems (대규모 병렬 시스템에서 캐시와 공유메모리를 이용한 유한 차분법 성능)

  • Kim, Hyun Kyu;Lee, Hyo Jong
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.4
    • /
    • pp.108-116
    • /
    • 2013
  • Many algorithms have been introduced to improve performance by using massively parallel systems, which consist of several hundreds of processors. A typical example is a GPU system of many processors which uses shared memory. In the case of image filtering algorithms, which make references to neighboring points, the shared memory helps improve performance by frequently accessing adjacent pixels. However, using shared memory requires rewriting the existing codes and consequently results in complexity of the codes. Recent GPU systems support both L1 and L2 cache along with shared memory. Since the L1 cache memory is located in the same area as the shared memory, the improvement of performance is predictable by using the cache memory. In this paper, the performance of cache and shared memory were compared. In conclusion, the performance of cache-based algorithm is very similar to the one of shared memory. The complexity of the code appearing in a shared memory system, however, is resolved with the cache-based algorithm.

A Level One Cache Organization for Chip-Size Limited Single Processor (칩의 크기가 제한된 단일칩 프로세서를 위한 레벨 1 캐시구조)

  • Ju YoungKwan;Kim Sukil
    • The KIPS Transactions:PartA
    • /
    • v.12A no.2 s.92
    • /
    • pp.127-136
    • /
    • 2005
  • This paper measured a proper ratio of the size of demand fetch cache $L_1$ to that of prefetch cache $L_P$ by imulation when the size of $L_1$ and $L_P$ are constant which organize space-limited level 1 cache of a single microprocessor chip. The analysis of our experiment showed that in the condition of the sum of the size of $L_1$ and $L_P$ are 16 KB, the level 1 cache organization by constituting $L_P$ with 4 KB and employing OBL and FIFO as a prefetch technique and a cache replacement policy respectively resulted in the best performance. Also, this analysis showed that in the condition of the sum of the size of $L_1$ and $L_P$ are over 32 KB, employing dynamic filtering as prefetch technique of $L_P$ are more advantageous and splitting level 1 cache by constituting $L_1$ with 28 KB and $L_P$ with 4 KB in the case of 32 KB of space are available, by constituting $L_1$ with 48 KB and $L_P$ with 16 KB in the case of 64 KB elicited the best performance.

Performance Analysis of Cache and Internal Memory of a High Performance DSP for an Optimal Implementation of Motion Picture Encoder (고성능 DSP에서 동영상 인코더의 최적화 구현을 위한 캐쉬 및 내부 메모리 성능 분석)

  • Lim, Se-Hun;Chung, Sun-Tae
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.5
    • /
    • pp.72-81
    • /
    • 2008
  • High Performance DSP usually supports cache and internal memory. For an optimal implementation of a multimedia stream application on such a high performance DSP, one needs to utilize the cache and internal memory efficiently. In this paper, we investigate performance analysis of cache, and internal memory configuration and placement necessary to achieve an optimal implementation of multimedia stream applications like motion picture encoder on high performance DSP, TMS320C6000 series, and propose strategies to improve performance for cache and internal memory placement. From the results of analysis and experiments, it is verified that 2-way L2 cache configuration with the remaining memory configured as internal memory shows relatively good performance. Also, it is shown that L1P cache hit rate is enhanced when frequently called routines and routines having caller-callee relationships with them are continuously placed in the internal memory and that L1D cache hit rate is enhanced by the simple change of the data size. The results in the paper are expected to contribute to the optimal implementation of multimedia stream applications on high performance DSPs.

An Interference Matrix Based Approach to Bounding Worst-Case Inter-Thread Cache Interferences and WCET for Multi-Core Processors

  • Yan, Jun;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.5 no.2
    • /
    • pp.131-140
    • /
    • 2011
  • Different cores typically share the last-level cache in a multi-core processor. Threads running on different cores may interfere with each other. Therefore, the multi-core worst-case execution time (WCET) analyzer must be able to safely and accurately estimate the worst-case inter-thread cache interference. This is not supported by current WCET analysis techniques that manly focus on single thread analysis. This paper presents a novel approach to analyze the worst-case cache interference and bounding the WCET for threads running on multi-core processors with shared L2 instruction caches. We propose to use an interference matrix to model inter-thread interference, on which basis we can calculate the worst-case inter-thread cache interference. Our experiments indicate that the proposed approach can give a worst-case bound less than 1%, as in benchmark fib-call, and an average 16.4% overestimate for threads running on a dual-core processor with shared-L2 cache. Our approach dramatically improves the accuracy of WCET overestimatation by on average 20.0% compared to work.

Study of Cache Performance on GPGPU

  • Choi, Kyu Hyun;Kim, Seon Wook
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.4 no.2
    • /
    • pp.78-82
    • /
    • 2015
  • General-purpose graphics processing units (GPGPUs) provide tremendous computational and processing power. Despite the latency hiding mechanism, a GPU architecture requires high memory bandwidth and lower latency between computational units and the memory system. For this reason, the current GPU architecture has private L1 caches in each core and a shared L2 cache to increase performance by reducing memory latency. But in some cases, this CPU-like cache design is not suitable for GPGPUs. In this paper, we analyze detailed cache performance related to GPGPU application characteristics, and suggest technical alternatives for the GPGPU architecture as future work.

Cache memory system for high performance CPU with 4GHz (4Ghz 고성능 CPU 위한 캐시 메모리 시스템)

  • Jung, Bo-Sung;Lee, Jung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.2
    • /
    • pp.1-8
    • /
    • 2013
  • TIn this paper, we propose a high performance L1 cache structure on the high clock CPU of 4GHz. The proposed cache memory consists of three parts, i.e., a direct-mapped cache to support fast access time, a two-way set associative buffer to exploit temporal locality, and a buffer-select table. The most recently accessed data is stored in the direct-mapped cache. If a data has a high probability of a repeated reference, when the data is replaced from the direct-mapped cache, the data is selectively stored into the two-way set associative buffer. For the high performance and low power consumption, we propose an one way among two ways set associative buffer is selectively accessed based on the buffer-select table(BST). According to simulation results, Energy $^*$ Delay product can improve about 45%, 70% and 75% compared with a direct mapped cache, a four-way set associative cache, and a victim cache with two times more space respectively.

Analysis of GPGPU Performance by dedicating L2 Cache for Texture Data (텍스쳐 데이터를 위한 2차 캐쉬 구조를 가지는 그래픽 처리 장치의 성능 분석)

  • Kim, Gwang Bok;Kim, Cheol Hong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2017.01a
    • /
    • pp.143-144
    • /
    • 2017
  • 최근 그래픽 처리 장치는 DRAM에 대한 접근을 줄이고자 여러 메모리 계층을 사용하고 있다. GPGPU의 L2 캐쉬는 요청 데이터의 타입에 따라 별도로 접근하는 L1 메모리와 다르게 레이턴시가 긴 DRAM에 접근하기 전에 모든 데이터 타입이 접근 가능한 캐쉬이다. 본 논문에서는 애플리케이션에서 명시하는 다양한 데이터 타입에 대하여 접근 및 적재를 허용하는 L2 캐쉬를 오직 텍스쳐 데이터만을 허용하도록 하여 변화하는 성능을 분석하고자 한다. 본 실험을 위해 텍스쳐 데이터 이외의 데이터 타입은 L2 캐쉬를 바이패스하여 바로 DRAM에 접근하도록 구조를 변경한다. 실험을 통한 분석 결과 텍스쳐 데이터만을 허용하는 경우 대부분의 벤치마크에서 성능 감소가 발생하여 기존 구조대비 평균 5.58% 감소율을 확인하였다. 반대로, 본 논문의 실험 환경에서의 L2 캐쉬의 적중률이 낮은 애플리케이션인 needle은 불필요한 L2 접근을 바이패스 함으로써 전체적인 성능 증가를 이끌어낸 것으로 분석된다.

  • PDF

Reuse Information based Thrashing Resistant Cache Management Scheme

  • Sim, Gyu Yeon;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.3
    • /
    • pp.9-16
    • /
    • 2017
  • In recent computing systems, LRU replacement policy has been widely used because it can be simply implemented and applicable to most programs. However, if the working set size of the program is bigger than the actual cache size, LRU replacement policy may occur thrashing problem. Thrashing problem means that cache blocks are consistently replaced without re-referencing in the cache. This paper proposes a new cache management scheme to solve the thrashing problem in the second-level cache. The proposed scheme measures per set reuse frequency using EAF structure to find thrashing sets. When the cache miss occurs, it tests whether the address of the missed block is stored or not. If the address of the missed block is stored, it means that the recently evicted block is re-requested, so the reuse frequency is predicted high. In this case, the corresponding counter of the set is increased. When the counter value is bigger than the threshold value, we assume that the corresponding set shows high reuse frequency. The proposed scheme assigns the set with high reuse frequency to the additional small size cache to keep the blocks in the cache for a long time. Our experimental results show that the proposed scheme improves the IPC by 3.81% on average.

Cache Management using a Adaptive Parity Group Configuration in RAID 5 Controller (적응형 패리티 그룹 구성을 이용한 RAID 5 제어기에서의 캐시 운영)

  • Huh, Jung-Ho;Song, Ja-Young;Chang, Tae-Mu
    • The KIPS Transactions:PartA
    • /
    • v.10A no.2
    • /
    • pp.83-92
    • /
    • 2003
  • RAID 5 is a widely-used technique used to construct disk systems of high reliability and performance. This paper proposes APGOC (Adaptive Parity Group On Cache) organization on cache to solve "small write" problem of RAID 5 especially in OLTP (On-Line Transaction Processing System) environments. In our approach, when user process makes a request for a file to kernel, the information on the read/write characteristics is added to the file data structure of the file system. With this information, data and parity cache can be managed interchangeably through parity fetching. Therefore we can enhance the cache utilization and improve the disk request response time. Our method is analyzed and evaluated with a simulation method. Comparing with previous works, we observed about 6~l3% of performance enhancement.hancement.

Multicore Real-Time Scheduling to Reduce Inter-Thread Cache Interferences

  • Ding, Yiqiang;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.7 no.1
    • /
    • pp.67-80
    • /
    • 2013
  • The worst-case execution time (WCET) of each real-time task in multicore processors with shared caches can be significantly affected by inter-thread cache interferences. The worst-case inter-thread cache interferences are dependent on how tasks are scheduled to run on different cores. Therefore, there is a circular dependence between real-time task scheduling, the worst-case inter-thread cache interferences, and WCET in multicore processors, which is not the case for single-core processors. To address this challenging problem, we present an offline real-time scheduling approach for multicore processors by considering the worst-case inter-thread interferences on shared L2 caches. Our scheduling approach uses a greedy heuristic to generate safe schedules while minimizing the worst-case inter-thread shared L2 cache interferences and WCET. The experimental results demonstrate that the proposed approach can reduce the utilization of the resulting schedule by about 12% on average compared to the cyclic multicore scheduling approaches in our theoretical model. Our evaluation indicates that the enhanced scheduling approach is more likely to generate feasible and safe schedules with stricter timing constraints in multicore real-time systems.