• Title/Summary/Keyword: Cache-miss

Search Result 99, Processing Time 0.027 seconds

Low-Power 2-level Cache Architectures for Embedded System (내장형 시스템을 위한 저전력 2-레벨 캐쉬 메모리의 설계)

  • Jong-Min Lee;Soon-Tae Kim;Kyung-Ah Kim;Su-Ho Park;Yong-Ho Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2008.11a
    • /
    • pp.806-809
    • /
    • 2008
  • 온칩(on-chip) 캐쉬는 외부 메모리로의 접근을 감소시키는 중요한 역할을 한다. 본 연구에서는 내장형 시스템에 맞추어 설계된 2-레벨 캐쉬 메모리 구조를 제안하고자 한다. 레벨1(L1) 캐쉬의 구성으로 작은 크기, 직접사상(direct-mapped) 그리고 바로쓰기(write-through)를 채용한다. 대조적으로 레벨2(L2) 캐쉬는 일반적인 캐쉬 크기와 집합연관(Set-associativity) 그리고 나중쓰기(write-back) 정책을 채용한다. 결과적으로 L1캐쉬는 한 사이클 이내에 접근될 수 있고 L2캐쉬는 전체 캐쉬의 미스율(global miss rate)을 낮추는데 효과적이다. 두 캐쉬 계층간 바로쓰기(write-thorough) 정책에서 오는 빈번한 L2 캐쉬 접근으로 인한 에너지 소비를 줄이기 위해 본 연구에서는 One-way 접근 기법을 제안하였다. 본 연구에서 제안한 2-레벨 캐쉬 메모리 구조는 평균적으로 26%의 성능향상과 43%의 에너지 소비 그리고 77%의 에너지-지연 곱에서 이득을 보여주었다.

Design and Implementation of a Main-Memory Database System for Real-time Mobile GIS Application (실시간 모바일 GIS 응용 구축을 위한 주기억장치 데이터베이스 시스템 설계 및 구현)

  • Kang, Eun-Ho;Yun, Suk-Woo;Kim, Kyung-Chang
    • The KIPS Transactions:PartD
    • /
    • v.11D no.1
    • /
    • pp.11-22
    • /
    • 2004
  • As random access memory chip gets cheaper, it becomes affordable to realize main memory-based database systems. Consequently, reducing cache misses emerges as the most important issue in current main memory databases, in which CPU speeds have been increasing at 60% per year, compared to the memory speeds at 10% per you. In this paper, we design and implement a main-memory database system for real-time mobile GIS. Our system is composed of 5 modules: the interface manager provides the interface for PDA users; the memory data manager controls spatial and non-spatial data in main-memory using virtual memory techniques; the query manager processes spatial and non-spatial query : the index manager manages the MR-tree index for spatial data and the T-tree index for non-spatial index : the GIS server interface provides the interface with disk-based GIS. The MR-tree proposed propagates node splits upward only if one of the internal nodes on the insertion path has empty space. Thus, the internal nodes of the MR-tree are almost 100% full. Our experimental study shows that the two-dimensional MR-tree performs search up to 2.4 times faster than the ordinary R-tree. To use virtual memory techniques, the memory data manager uses page tables for spatial data, non- spatial data, T-tree and MR-tree. And, it uses indirect addressing techniques for fast reloading from disk.

Block Replacement Scheme based on Reuse Interval for Hybrid SSD System (Hybrid SSD 시스템을 위한 재사용 간격 기반 블록 교체 기법)

  • Yoo, Sanghyun;Kim, Kyung Tae;Youn, Hee Yong
    • Journal of Internet Computing and Services
    • /
    • v.16 no.5
    • /
    • pp.19-27
    • /
    • 2015
  • Due to the advantages of fast read/write operation and low power consumption, SSD(Solid State Drive) is now widely adopted as storage device of smart phone, laptop computer, server, etc. However, the shortcomings of SSD such as limited number of write operations and asymmetric read/write operation lead to the problem of shortened life span of SSD. Therefore, the block replacement policy of SSD used as cache for HDD is very important. The existing solutions for improving the lifespan of SSD including the LARC scheme typically employ the LRU algorithm to manage the SSD blocks, which may increase the miss rate in SSD due to the replacement of frequently used block instead of rarely used block. In this paper we propose a novel block replacement scheme which considers the block reuse interval to effectively handle various data read/write patterns. The proposed scheme replaces the block in SSD based on the recency decided by reuse interval and age along with hit ratio. Computer simulation using workload trace files reveals that the proposed scheme consistently improves the performance and lifespan of SSD by increasing the hit ratio and decreasing the number of write operations compared to the existing schemes including LARC.

Ultra-low-power DSP for Audio Signal Processing (오디오 신호 처리를 위한 초저전력 DSP 프로세서)

  • Kwon, Kiseok;Ahn, Minwook;Jo, Seokhwan;Lee, Yeonbok;Lee, Seungwon;Park, Young-Hwan;Kim, Sukjin;Kim, Do-Hyung;Kim, Jaehyun
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2014.06a
    • /
    • pp.157-159
    • /
    • 2014
  • In this paper, we introduce SlimSRP, an ultra-low-power digital signal processor (DSP) solution for mobile audio and voice applications. So far, application processors (APs) have taken charge of all the tasks in mobile devices. However, they have suffered from short battery life problems to deal with complex usage scenarios, such as always-on voice trigger with continuous audio playback. From extensive analysis of audio and voice application characteristics, SlimSRP is designed to relive the performance and power burden of APs. It employs three-issue VLIW architecture, and the major low-power and high-performance techniques include: (1) an optimized register-file architecture friendly for constants generation, (2) a powerful instruction set to reduce the number of register file accesses and (3) a unique instruction compression scheme that contributes to saved memory size and reduced cache miss. An implementation of SlimSRP runs at up to 200MHz and the logic occupies 95K NAND2 gates in Samsung 28LPP process. The experimental results demonstrate that a MP3 decoder application with a 128kbps 44.1kHz input can run at 5.1MHz and the logic consumes only 22uW/MHz.

  • PDF

Data Communication Prediction Model in Multiprocessors based on Robust Estimation (로버스트 추정을 이용한 다중 프로세서에서의 데이터 통신 예측 모델)

  • Jun Janghwan;Lee Kangwoo
    • The KIPS Transactions:PartA
    • /
    • v.12A no.3 s.93
    • /
    • pp.243-252
    • /
    • 2005
  • This paper introduces a noble modeling technique to build data communication prediction models in multiprocessors, using Least-Squares and Robust Estimation methods. A set of sample communication rates are collected by using a few small input data sets into workload programs. By applying estimation methods to these samples, we can build analytic models that precisely estimate communication rates for huge input data sets. The primary advantage is that, since the models depend only on data set size not on the specifications of target systems or workloads, they can be utilized to various systems and applications. In addition, the fact that the algorithmic behavioral characteristics of workloads are reflected into the models entitles them to model diverse other performance metrics. In this paper, we built models for cache miss rates which are the main causes of data communication in shared memory multiprocessor systems. The results present excellent prediction error rates; below $1\%$ for five cases out of 12, and about $3\%$ for the rest cases.

An Efficient Computation of Matrix Triple Products (삼중 행렬 곱셈의 효율적 연산)

  • Im, Eun-Jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.3
    • /
    • pp.141-149
    • /
    • 2006
  • In this paper, we introduce an improved algorithm for computing matrix triple product that commonly arises in primal-dual optimization method. In computing $P=AHA^{t}$, we devise a single pass algorithm that exploits the block diagonal structure of the matrix H. This one-phase scheme requires fewer floating point operations and roughly half the memory of the generic two-phase algorithm, where the product is computed in two steps, computing first $Q=HA^{t}$ and then P=AQ. The one-phase scheme achieved speed-up of 2.04 on Intel Itanium II platform over the two-phase scheme. Based on memory latency and modeled cache miss rates, the performance improvement was evaluated through performance modeling. Our research has impact on performance tuning study of complex sparse matrix operations, while most of the previous work focused on performance tuning of basic operations.

  • PDF

A Development of Fusion Processor Architecture for Efficient Main Memory Access in CPU-GPU Environment (CPU-GPU환경에서 효율적인 메인메모리 접근을 위한 융합 프로세서 구조 개발)

  • Park, Hyun-Moon;Kwon, Jin-San;Hwang, Tae-Ho;Kim, Dong-Sun
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.11 no.2
    • /
    • pp.151-158
    • /
    • 2016
  • The HSA resolves an old problem with existing CPU and GPU architectures by allowing both units to directly access each other's memory pools via unified virtual memory. In a physically realized system, however, frequent data exchanges between CPU and GPU for a virtual memory block result bottlenecks and coherence request overheads. In this paper, we propose Fusion Processor Architecture for efficient access of main memory from both CPU and GPU. It consists of Job Manager, Re-mapper, and Pre-fetcher to control, organize, and distribute work loads and working areas for GPU cores. These components help on reducing memory exchanges between the two processors and improving overall efficiency by eliminating faulty page table requests. To verify proposed algorithm architectures, we develop an emulator based on QEMU, and compare several architectures such as CUDA(Compute Unified Device Architecture), OpenMP, OpenCL. As a result, Proposed fusion processor architectures show 198% faster than others by removing unnecessary memory copies and cache-miss overheads.

Flash memory system with spatial smart buffer for the substitution of a hard-disk (하드디스크 대용을 위한 공간적 스마트 버퍼 플래시 메모리 시스템)

  • Jung, Bo-Sung;Jung, Jung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.3
    • /
    • pp.41-49
    • /
    • 2009
  • Flash memory has become increasingly requestion for the importance and the demand as a storage due to its low power consumption, cheap prices and large capacity medium. This research is to design a high performance flash memory structure for the substitution of a hard-disk by dynamic prefetching of aggressive spatial locality from the spatial smart buffer system. The proposed buffer system in a NAND flash memory consists of three parts, i.e., a fully associative victim buffer for temporal locality, a fully associative spatial buffer for spatial locality, and a dynamic fetching unit. We proposed new dynamic prefetching algorithm for aggressive spatial locality. That is to use the flash memory instead of the hard disk, the proposed flash system can achieve better performance gain by overcoming many drawbacks of the flash memory by the new structure and the new algorithm. According to the simulation results, compared with the smart buffer system, the average miss ratio is reduced about 26% for Mediabench applications. The average memory access times are improved about 35% for Mediabench applications, over 30% for Spec2000 applications.

An Application-Specific and Adaptive Power Management Technique for Portable Systems (휴대장치를 위한 응용프로그램 특성에 따른 적응형 전력관리 기법)

  • Egger, Bernhard;Lee, Jae-Jin;Shin, Heon-Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.8
    • /
    • pp.367-376
    • /
    • 2007
  • In this paper, we introduce an application-specific and adaptive power management technique for portable systems that support dynamic voltage scaling (DVS). We exploit both the idle time of multitasking systems running soft real-time tasks as well as memory- or CPU-bound code regions. Detailed power and execution time profiles guide an adaptive power manager (APM) that is linked to the operating system. A post-pass optimizer marks candidate regions for DVS by inserting calls to the APM. At runtime, the APM monitors the CPU's performance counters to dynamically determine the affinity of the each marked region. for each region, the APM computes the optimal voltage and frequency setting in terms of energy consumption and switches the CPU to that setting during the execution of the region. Idle time is exploited by monitoring system idle time and switching to the energy-wise most economical setting without prolonging execution. We show that our method is most effective for periodic workloads such as video or audio decoding. We have implemented our method in a multitasking operating system (Microsoft Windows CE) running on an Intel XScale-processor. We achieved up to 9% of total system power savings over the standard power management policy that puts the CPU in a low Power mode during idle periods.