• Title/Summary/Keyword: Cache-miss

Search Result 99, Processing Time 0.024 seconds

A New Flash Memory Package Structure with Intelligent Buffer System and Performance Evaluation (버퍼 시스템을 내장한 새로운 플래쉬 메모리 패키지 구조 및 성능 평가)

  • Lee Jung-Hoon;Kim Shin-Dug
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.2
    • /
    • pp.75-84
    • /
    • 2005
  • This research is to design a high performance NAND-type flash memory package with a smart buffer cache that enhances the exploitation of spatial and temporal locality. The proposed buffer structure in a NAND flash memory package, called as a smart buffer cache, consists of three parts, i.e., a fully-associative victim buffer with a small block size, a fully-associative spatial buffer with a large block size, and a dynamic fetching unit. This new NAND-type flash memory package can achieve dramatically high performance and low power consumption comparing with any conventional NAND-type flash memory. Our results show that the NAND flash memory package with a smart buffer cache can reduce the miss ratio by around 70% and the average memory access time by around 67%, over the conventional NAND flash memory configuration. Also, the average miss ratio and average memory access time of the package module with smart buffer for a given buffer space (e.g., 3KB) can achieve better performance than package modules with a conventional direct-mapped buffer with eight times(e.g., 32KB) as much space and a fully-associative configuration with twice as much space(e.g., 8KB)

Empirical Modeling for Cache Miss Rates in Multiprocessors (다중 프로세서에서의 캐시접근 실패율을 위한 경험적 모델링)

  • Lee, Kang-Woo;Yang, Gi-Joo;Park, Choon-Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.1_2
    • /
    • pp.15-34
    • /
    • 2006
  • This paper introduces an empirical modeling technique. This technique uses a set of sample results which are collected from a few small scale simulations. Empirical models are developed by applying a couple of statistical estimation techniques to these samples. We built two types of models for cache miss rates in Symmetric Multiprocessor systems. One is for the changes of input data set size while the specification of target system is fixed. The other is for the changes of the number of processors in target system while the input data set size is fixed. To develop accurate models, we built individual model for every kind of cache misses for each shared data structure in a program. The final model is then obtained by integrating them. Besides, combined use of Least Mean Squares and Robust Estimations enhances the quality of models by minimizing the distortion due to outliers. Empirical modeling technique produces extremely accurate models without analysis on sample data. In addition, since only snail scale simulations are necessary, once a set of samples can be collected, empirical method can be adopted in any research areas. In 17 cases among 24 trials, empirical models present extremely low prediction errors below $1\%$. In the remaining cases, the accuracy is excellent, as well. The models sustain high quality even when the behavioral characteristics of programs are irregular and the number of samples are barely enough.

Implementation and Performance Analysis of Efficient Packet Processing Method For DPI (Deep Packet Inspection) System using Dual-Processors (듀얼 프로세서 기반 DPI (Deep Packet Inspection) 엔진을 위한 효율적 패킷 프로세싱 방안 구현 및 성능 분석)

  • Yang, Joon-Ho;Han, Seung-Jae
    • The KIPS Transactions:PartC
    • /
    • v.16C no.4
    • /
    • pp.417-422
    • /
    • 2009
  • Implementation of DPI(Deep Packet Inspection) system on a general purpose multiprocessor platform is an attractive option from the implementation cost point of view, since it does not require high-cost customized hardware. Load balancing has been considered as a primary means to achieve high performance in multi processor systems. We claim, however, that in case of DPI system design simply balancing the load of each processor does not necessarily yield the highest system performance. Instead, we propose a method in which tasks are allocated to processors based on their functions. We implemented the proposed method in dual processor Linux system and compare its performance with the existing load balancing methods. Under the proposed method, one processor is dedicated to deal with interrupt handling and generic packet processing, while another processor is dedicated to DPI processing. According to experimental results, the proposed scheme outperforms the existing schemes by 60%, mainly because of the reduction of cache miss and spin lock occurrences.

Design of a High-Performance Mobile GPGPU with SIMT Architecture based on a Small-size Warp Scheduler (작은 크기의 Warp 스케쥴러 기반 SIMT구조 고성능 모바일 GPGPU 설계)

  • Lee, Kwang-Yeob
    • Journal of IKEEE
    • /
    • v.25 no.3
    • /
    • pp.479-484
    • /
    • 2021
  • This paper proposed and designed a structure to achieve high performance with a small number of cores in GPGPU with SIMT structure. GPGPU for application to mobile devices requires a structure to increase performance compared to power consumption. In order to reduce power consumption, the number of cores decreased, but to improve performance, the size of the warp scheduler for managing threads was set to 4, which was greatly reduced than 32 of general GPGPU. Reducing warp size can reduce the number of idle cycles in pipelines and efficiently apply memory latency to reduce miss penalty when accessing cache memory. The designed GPGPU measured computational performance using a test program that includes floating point operations and measured power consumption through a 28nm CMOS process to obtain 104.5GFlops/Watt as a performance per power. The results of this paper showed about four times better performance per power compared to Tegra K1 of Nvidia

General Web Cache Implementation Using NIO (NIO를 이용한 범용 웹 캐시 구현)

  • Lee, Chul-Hui;Shin, Yong-Hyeon
    • Journal of Advanced Navigation Technology
    • /
    • v.20 no.1
    • /
    • pp.79-85
    • /
    • 2016
  • Network traffic is increased rapidly, due to mobile and social network, such as smartphones and facebook, in recent web environment. In this paper, we improved web response time of existing system using direct buffer of NIO and DMA. This solved the disadvantage of JAVA, such as CPU performance reduction due to the blocking of I/O, garbage collection of buffer. Key values circulated many data due to priority change put on a hash map operated easily and apply a priority modification algorithm. Large response data is separated and stored at a fast direct buffer and improved performance. This paper showed that the proposed method using NIO was much improved performance, in many test situations of cache hit and cache miss.

Analysis on the GPU Performance according to Hierarchical Memory Organization (계층적 메모리 구성에 따른 GPU 성능 분석)

  • Choi, Hongjun;Kim, Jongmyon;Kim, Cheolhong
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.3
    • /
    • pp.22-32
    • /
    • 2014
  • Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.

An Address Translation Technique Large NAND Flash Memory using Page Level Mapping (페이지 단위 매핑 기반 대용량 NAND플래시를 위한 주소변환기법)

  • Seo, Hyun-Min;Kwon, Oh-Hoon;Park, Jun-Seok;Koh, Kern
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.3
    • /
    • pp.371-375
    • /
    • 2010
  • SSD is a storage medium based on NAND Flash memory. Because of its short latency, low power consumption, and resistance to shock, it's not only used in PC but also in server computers. Most SSDs use FTL to overcome the erase-before-overwrite characteristic of NAND flash. There are several types of FTL, but page mapped FTL shows better performance than others. But its usefulness is limited because of its large memory footprint for the mapping table. For example, 64MB memory space is required only for the mapping table for a 64GB MLC SSD. In this paper, we propose a novel caching scheme for the mapping table. By using the mapping-table-meta-data we construct a fully associative cache, and translate the address within O(1) time. The simulation results show more than 80 hit ratio with 32KB cache and 90% with 512KB cache. The overall memory footprint was only 1.9% of 64MB. The time overhead of cache miss was measured lower than 2% for most workload.

Cache and Pipeline Architecture Improvement and Low Power Design of Embedded Processor (임베디드 프로세서의 캐시와 파이프라인 구조개선 및 저전력 설계)

  • Jung, Hong-Kyun;Ryoo, Kwang-Ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2008.10a
    • /
    • pp.289-292
    • /
    • 2008
  • This paper presents a branch prediction algorithm and a 4-way set-associative cache for performance improvement of OpenRISC processor and a clock gating algorithm using ODC (Observability Don't Care) operation for a low-power processor. The branch prediction algorithm has a structure using BTB(Branch Target Buffer) and 4-way set associative cache has lower miss rate than direct-mapped cache. The clock gating algorithm reduces dynamic power consumption. As a result of estimation of performance and dynamic power, the performance of the OpenRISC processor using the proposed algorithm is improved about 8.9% and dynamic power of the processor using samsung $0.18{\mu}m$ technology library is reduced by 13.9%.

  • PDF

Performance and Power Consumption Improvement of Embedded RISC Core (임베디드 RISC 코어의 성능 및 전력 개선)

  • Jung, Hong-Kyun;Ryoo, Kwang-Ki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.2
    • /
    • pp.453-461
    • /
    • 2010
  • This paper presents a branch prediction algorithm and a 4-way set-associative cache for performance improvement of embedded RISC core and a clock-gating algorithm using ODC (Observability Don't Care) operation to improve the power consumption of the core. The branch prediction algorithm has a structure using BTB(Branch Target Buffer) and 4-way set associative cache has lower miss rate than direct-mapped cache. Pseudo-LRU Policy, which is one of the Line Replacement Policies, is used for decreasing the number of bits that store LRU value. The clock gating algorithm reduces dynamic power consumption. As a result of estimation of performance and dynamic power, the performance of the OpenRISC core applied the proposed architecture is improved about 29% and dynamic power of the core using Chartered $0.18{\mu}m$ technology library is reduced by 16%.

Making Cache-Conscious CCMR-trees for Main Memory Indexing (주기억 데이타베이스 인덱싱을 위한 CCMR-트리)

  • 윤석우;김경창
    • Journal of KIISE:Databases
    • /
    • v.30 no.6
    • /
    • pp.651-665
    • /
    • 2003
  • To reduce cache misses emerges as the most important issue in today's situation of main memory databases, in which CPU speeds have been increasing at 60% per year, and memory speeds at 10% per year. Recent researches have demonstrated that cache-conscious index structure such as the CR-tree outperforms the R-tree variants. Its search performance can be poor than the original R-tree, however, since it uses a lossy compression scheme. In this paper, we propose alternatively a cache-conscious version of the R-tree, which we call MR-tree. The MR-tree propagates node splits upward only if one of the internal nodes on the insertion path has empty room. Thus, the internal nodes of the MR-tree are almost 100% full. In case there is no empty room on the insertion path, a newly-created leaf simply becomes a child of the split leaf. The height of the MR-tree increases according to the sequence of inserting objects. Thus, the HeightBalance algorithm is executed when unbalanced heights of child nodes are detected. Additionally, we also propose the CCMR-tree in order to build a more cache-conscious MR-tree. Our experimental and analytical study shows that the two-dimensional MR-tree performs search up to 2.4times faster than the ordinary R-tree while maintaining slightly better update performance and using similar memory space.