Search | Korea Science

Study of Cache Performance on GPGPU

Choi, Kyu Hyun;Kim, Seon Wook
- IEIE Transactions on Smart Processing and Computing
- /
- v.4 no.2
- /
- pp.78-82
- /
- 2015
General-purpose graphics processing units (GPGPUs) provide tremendous computational and processing power. Despite the latency hiding mechanism, a GPU architecture requires high memory bandwidth and lower latency between computational units and the memory system. For this reason, the current GPU architecture has private L1 caches in each core and a shared L2 cache to increase performance by reducing memory latency. But in some cases, this CPU-like cache design is not suitable for GPGPUs. In this paper, we analyze detailed cache performance related to GPGPU application characteristics, and suggest technical alternatives for the GPGPU architecture as future work.
https://doi.org/10.5573/IEIESPC.2015.4.2.078 인용 PDF KSCI

Cache Sensitive T-tree Main Memory Index for Range Query Search (범위질의 검색을 위한 캐시적응 T-트리 주기억장치 색인구조)

Choi, Sang-Jun;Lee, Jong-Hak
- Journal of Korea Multimedia Society
- /
- v.12 no.10
- /
- pp.1374-1385
- /
- 2009
Recently, advances in speed of the CPU have for out-paced advances in memory speed. Main-memory access is increasingly a performance bottleneck for main-memory database systems. To reduce memory access speed, cache memory have incorporated in the memory subsystem. However cache memories can reduce the memory speed only when the requested data is found in the cache. We propose a new cache sensitive T-tree index structure called as $CST^*$-tree for range query search. The $CST^*$-tree reduces the number of cache miss occurrences by loading the reduced internal nodes that do not have index entries. And it supports the sequential access of index entries for range query by connecting adjacent terminal nodes and internal index nodes. For performance evaluation, we have developed a cost model, and compared our $CST^*$-tree with existing CST-tree, that is the conventional cache sensitive T-tree, and $T^*$-tree, that is conventional the range query search T -tree, by using the cost model. The results indicate that cache miss occurrence of $CST^*$-tree is decreased by 20~30% over that of CST-tree in a single value search, and it is decreased by 10~20% over that of $T^*$-tree in a range query search.
PDF

Performance Analysis of n-way Associative Cache and Fully Associative Cache (n-way Set Associative Cache와 Fully Associative Cache성능 분석)

Jo, Yong-Hun;Kim, Jeong-Seon
- The Transactions of the Korea Information Processing Society
- /
- v.4 no.3
- /
- pp.802-810
- /
- 1997
In this paper, the performance of direce mapping caches, 2_, 4_, 8_, .., 4096_way way set associative caches, and fully assiciative caches are analyized by trace simulation for verivying their effectiveness.In general, it is well known that as n, the number of main memory lines to be stored into one cache line number in direct mapping cache, increases, the performance of the cache memory should get higher linearly.According to our analysis, however, it is not true on all the cache organizations.It is shown that as n increases, miss ratios get lower only when the small cache(less than 256K) using large line size is used.It is also shown that fully associative mapping achieves high performance only when small size cache using large line size ia used.
PDF

A Cache-Conscious Compression Index Based on the Level of Compression Locality (압축 지역성 수준에 기반한 캐쉬 인식 압축 색인)

Kim, Won-Sik;Yoo, Jae-Jun;Lee, Jin-Soo;Han, Wook-Shin
- Journal of Korea Multimedia Society
- /
- v.13 no.7
- /
- pp.1023-1043
- /
- 2010
As main memory get cheaper, it becomes increasingly affordable to load entire index of DBMS and to access the index. Since speed gap between CPU and main memory is growing bigger, many researches to reduce a cost of main memory access are under the progress. As one of those, cache conscious trees can reduce the cost of main memory access. Since cache conscious trees reduce the number of cache miss by compressing data in node, cache conscious trees can reduce the cost of main memory. Existing cache conscious trees use only fixed one compression technique without consideration of properties of data in node. First, this paper proposes the DC-tree that uses various compression techniques and change data layout in a node according to properties of data in order to reduce cache miss. Second, this paper proposes the level of compression locality that describes properties of data in node by formula. Third, this paper proposes Forced Partial Decomposition (FPD) that reduces the nutter of cache miss. DC-trees outperform 1.7X than B+-tree, 1.5X than simple prefix B+-tree, and 1.3X than pkB-tree, in terms of the number of cache misses. Since proposed DC-trees can be adopted in commercial main memory database system, we believe that DC-trees are practical result.
PDF KSCI

Design and Implementation of an In-Memory File System Cache with Selective Compression (대용량 파일시스템을 위한 선택적 압축을 지원하는 인-메모리 캐시의 설계와 구현)

Choe, Hyeongwon;Seo, Euiseong
- Journal of KIISE
- /
- v.44 no.7
- /
- pp.658-667
- /
- 2017
The demand for large-scale storage systems has continued to grow due to the emergence of multimedia, social-network, and big-data services. In order to improve the response time and reduce the load of such large-scale storage systems, DRAM-based in-memory cache systems are becoming popular. However, the high cost of DRAM severely restricts their capacity. While the method of compressing cache entries has been proposed to deal with the capacity limitation issue, compression and decompression, which are technically difficult to parallelize, induce significant processing overhead and in turn retard the response time. A selective compression scheme is proposed in this paper for in-memory file system caches that rapidly estimates the compression ratio of incoming cache entries with their Shannon entropies and compresses cache entries with low compression ratio. In addition, a description is provided of the design and implementation of an in-kernel in-memory file system cache with the proposed selective compression scheme. The evaluation showed that the proposed scheme reduced the execution time of benchmarks by approximately 18% in comparison to the conventional non-compressing in-memory cache scheme. It also provided a cache hit ratio similar to the all-compressing counterpart and reduced 7.5% of the execution time by reducing the compression overhead. In addition, it was shown that the selective compression scheme can reduce the CPU time used for compression by 28% compared to the case of the all-compressing scheme.
https://doi.org/10.5626/JOK.2017.44.7.658 인용 KSCI

An On-chip Cache and Main Memory Compression System Optimized by Considering the Compression rate Distribution of Compressed Blocks (압축블록의 압축률 분포를 고려해 설계한 내장캐시 및 주 메모리 압축시스템)

Yim, Keun-Soo;Lee, Jang-Soo;Hong, In-Pyo;Kim, Ji-Hong;Kim, Shin-Dug;Lee, Yong-Surk;Koh, Kern
- Journal of KIISE:Computer Systems and Theory
- /
- v.31 no.1_2
- /
- pp.125-134
- /
- 2004
Recently, an on-chip compressed cache system was presented to alleviate the processor-memory Performance gap by reducing on-chip cache miss rate and expanding memory bandwidth. This research Presents an extended on-chip compressed cache system which also significantly expands main memory capacity. Several techniques are attempted to expand main memory capacity, on-chip cache capacity, and memory bandwidth as well as reduce decompression time and metadata size. To evaluate the performance of our proposed system over existing systems, we use execution-driven simulation method by modifying a superscalar microprocessor simulator. Our experimental methodology has higher accuracy than previous trace-driven simulation method. The simulation results show that our proposed system reduces execution time by 4-23% compared with conventional memory system without considering the benefits obtained from main memory expansion. The expansion rates of data and code areas of main memory are 57-120% and 27-36%, respectively.
PDF KSCI

Analysis on the GPU Performance according to Hierarchical Memory Organization (계층적 메모리 구성에 따른 GPU 성능 분석)

Choi, Hongjun;Kim, Jongmyon;Kim, Cheolhong
- The Journal of the Korea Contents Association
- /
- v.14 no.3
- /
- pp.22-32
- /
- 2014
Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.
https://doi.org/10.5392/JKCA.2014.14.03.022 인용 PDF KSCI

Implementation of parallel blocked LU decomposition program for utilizing cache memory on GP-GPUs (GP-GPU의 캐시메모리를 활용하기 위한 병렬 블록 LU 분해 프로그램의 구현)

Kim, Youngtae;Kim, Doo-Han;Yu, Myoung-Han
- Journal of Internet Computing and Services
- /
- v.14 no.6
- /
- pp.41-47
- /
- 2013
GP-GPUs are general purposed GPUs for numerical computation based on multiple threads which are originally for graphic processing. GP-GPUs provide cache memory in a form of shared memory which user programs can access directly, unlikely typical cache memory. In this research, we implemented the parallel block LU decomposition program to utilize cache memory in GP-GPUs. The parallel blocked LU decomposition program designed with Nvidia CUDA C run 7~8 times faster than nun-blocked LU decomposition program in the same GP-GPU computation environment.
https://doi.org/10.7472/jksii.2013.14.6.41 인용 PDF KSCI

Design of an Asynchronous Instruction Cache based on a Mixed Delay Model (혼합 지연 모델에 기반한 비동기 명령어 캐시 설계)

Jeon, Kwang-Bae;Kim, Seok-Man;Lee, Je-Hoon;Oh, Myeong-Hoon;Cho, Kyoung-Rok
- The Journal of the Korea Contents Association
- /
- v.10 no.3
- /
- pp.64-71
- /
- 2010
Recently, to achieve high performance of the processor, the cache is splits physically into two parts, one for instruction and one for data. This paper proposes an architecture of asynchronous instruction cache based on mixed-delay model that are DI(delay-insensitive) model for cache hit and Bundled delay model for cache miss. We synthesized the instruction cache at gate-level and constructed a test platform with 32-bit embedded processor EISC to evaluate performance. The cache communicates with the main memory and CPU using 4-phase hand-shake protocol. It has a 8-KB, 4-way set associative memory that employs Pseudo-LRU replacement algorithm. As the results, the designed cache shows 99% cache hit ratio and reduced latency to 68% tested on the platform with MI bench mark programs.
https://doi.org/10.5392/JKCA.2010.10.3.064 인용 PDF KSCI

Cache memory system for high performance CPU with 4GHz (4Ghz 고성능 CPU 위한 캐시 메모리 시스템)

Jung, Bo-Sung;Lee, Jung-Hoon
- Journal of the Korea Society of Computer and Information
- /
- v.18 no.2
- /
- pp.1-8
- /
- 2013
TIn this paper, we propose a high performance L1 cache structure on the high clock CPU of 4GHz. The proposed cache memory consists of three parts, i.e., a direct-mapped cache to support fast access time, a two-way set associative buffer to exploit temporal locality, and a buffer-select table. The most recently accessed data is stored in the direct-mapped cache. If a data has a high probability of a repeated reference, when the data is replaced from the direct-mapped cache, the data is selectively stored into the two-way set associative buffer. For the high performance and low power consumption, we propose an one way among two ways set associative buffer is selectively accessed based on the buffer-select table(BST). According to simulation results, Energy $^*$ Delay product can improve about 45%, 70% and 75% compared with a direct mapped cache, a four-way set associative cache, and a victim cache with two times more space respectively.
https://doi.org/10.9708/jksci.2013.18.2.001 인용 PDF KSCI

Search Result 412, Processing Time 0.024 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)