• Title/Summary/Keyword: cache performance

Search Result 659, Processing Time 0.035 seconds

Design and Performance Analysis of High Performance Processor-Memory Integrated Architectures (고성능 프로세서-메모리 혼합 구조의 설계 및 성능 분석)

  • Kim, Young-Sik;Kim, Shin-Dug;Han, Tack-Don
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.10
    • /
    • pp.2686-2703
    • /
    • 1998
  • The widening pClformnnce gap between processor and memory causes an emergence of the promising architecture, processor-memory (PM) integration In this paper, various design issues for P-M integration are studied, First, an analytical model of the DRAM access time is constructed considering both the bank conflict ratio and the DRAM page hit ratio. Then the points of both the performance improvement and the perfonnance bottle neck are found by the proposed model as designing on-chip DRAM architectures. This paper proposes the new architecture, called the delayed precharge bank architecture, to improve the perfonnance of memory system as increasing the DRAM page hit ratio. This paper also adapts an efficient bank interleaving mechanism to the proposed architecture. This architecture is verified !II he better than the hierarchical multi-bank architecture as well as the conventional bank architecture by executiun driven simulation. Eight SPEC95 benchmarks are used for simulation as changing parameters for the cache architecture, the number of DRAM banks, and the delayed time quantum.

  • PDF

Improving Log-Structured File System Performance by Utilizing Non-Volatile Memory (비휘발성 메모리를 이용한 로그 구조 파일 시스템의 성능 향상)

  • Kang, Yang-Wook;Choi, Jong-Moo;Lee, Dong-Hee;Noh, Sam-H.
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.5
    • /
    • pp.537-541
    • /
    • 2008
  • Log-Structured File System(LFS) is a disk based file system that is optimized for improving the write performance. LFS gathers dirty data in memory as long as possible, and flushes all dirty data sequentially at once. In a real system, however, maintaining dirty data in memory should be flushed into a disk to meet file system consistency issues even if more memory is still available. This synchronizations increase the cleaner overhead of LFS and make LFS to write down more metadata into a disk. In this paper, by adapting Non-volatile RAM(NV-RAM) we modifies LFS and virtual memory subsystem to guarantee that LFS could gather enough dirty data in the memory and reduce small disk writes. By doing so, we improves the performance of LFS by around 2.5 times than the original LFS.

Development of Communication Module for a Mobile Integrated SNS Gateway (모바일 통합 SNS 게이트웨이 통신 모듈 개발)

  • Lee, Shinho;Kwon, Dongwoo;Kim, Hyeonwoo;Ju, Hongtaek
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.39B no.2
    • /
    • pp.75-85
    • /
    • 2014
  • Recently, mobile SNS traffic has increased tremendously due to the deployment of smart devices such as smart phones and smart tablets. In this paper, mobile integrated SNS gateway is proposed to cope with massive SNS traffic. Most of mobile SNS applications update the information with individual connection to the corresponding servers. The proposed gateway integrates these applications. It is for reducing SNS traffic caused by continuous data request and improving the mobile communication performance. The key elements of the mobile integrated SNS gateway are the synchronization, cache and integrated certification. The proposed protocol and gateway system have implemented on the testbed which deployed on the real network to evaluate the performance of the proposed gateway. Finally, we present the caching performance of gateway system implementation.

Improving Flash Translation Layer for Hybrid Flash-Disk Storage through Sequential Pattern Mining based 2-Level Prefetching Technique (하이브리드 플래시-디스크 저장장치용 Flash Translation Layer의 성능 개선을 위한 순차패턴 마이닝 기반 2단계 프리패칭 기법)

  • Chang, Jae-Young;Yoon, Un-Keum;Kim, Han-Joon
    • The Journal of Society for e-Business Studies
    • /
    • v.15 no.4
    • /
    • pp.101-121
    • /
    • 2010
  • This paper presents an intelligent prefetching technique that significantly improves performance of hybrid fash-disk storage, a combination of flash memory and hard disk. Since flash memory embedded in a hybrid device is much faster than hard disk in terms of I/O operations, it can be utilized as a 'cache' space to improve system performance. The basic strategy for prefetching is to utilize sequential pattern mining, with which we can extract the access patterns of objects from historical access sequences. We use two techniques for enhancing the performance of hybrid storage with prefetching. One of them is to modify a FAST algorithm for mapping the flash memory. The other is to extend the unit of prefetching to a block level as well as a file level for effectively utilizing flash memory space. For evaluating the proposed technique, we perform the experiments using the synthetic data and real UCC data, and prove the usability of our technique.

External Merge Sorting in Tajo with Variable Server Configuration (매개변수 환경설정에 따른 타조의 외부합병정렬 성능 연구)

  • Lee, Jongbaeg;Kang, Woon-hak;Lee, Sang-won
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.820-826
    • /
    • 2016
  • There is a growing requirement for big data processing which extracts valuable information from a large amount of data. The Hadoop system employs the MapReduce framework to process big data. However, MapReduce has limitations such as inflexible and slow data processing. To overcome these drawbacks, SQL query processing techniques known as SQL-on-Hadoop were developed. Apache Tajo, one of the SQL-on-Hadoop techniques, was developed by a Korean development group. External merge sort is one of the heavily used algorithms in Tajo for query processing. The performance of external merge sort in Tajo is influenced by two parameters, sort buffer size and fanout. In this paper, we analyzed the performance of external merge sort in Tajo with various sort buffer sizes and fanouts. In addition, we figured out that there are two major causes of differences in the performance of external merge sort: CPU cache misses which increase as the sort buffer size grows; and the number of merge passes determined by fanout.

An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU (Coloring이 적용된 Gauss-Seidel 해법을 통한 CPU와 GPU의 연산 효율에 관한 연구)

  • Yoon, Jong Seon;Jeon, Byoung Jin;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.41 no.2
    • /
    • pp.117-124
    • /
    • 2017
  • The performance of the colored Gauss-Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss-Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss-Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.

Optimizing LRU Lock Management in the Linux Kernel for Improving Parallel Write Throughout in Many-Core CPU Systems (매니코어 CPU 시스템의 병렬 쓰기 성능 향상을 위한 리눅스 커널의 LRU 관리 최적화 기법)

  • Eun-Kyu Byun;Gibeom Gu;Kwang-Jin Oh;Jiwoo Bang
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.12 no.7
    • /
    • pp.209-216
    • /
    • 2023
  • Modern HPC systems are equipped with many-core CPUs with dozens of cores. When performing parallel I/O in such a system, there is a limit to scalability due to the problem of the LRU lock management policy of the Linux system. The study proposes an improved FinerLRU to solve this problem. Our new FinerLRU improves the parallel write performance of file systems using the buffer cache through granular lock management by increasing the number of LRU locks upto the maximum number of cores. The proposed method was implemented in Linux 5.18.11, and the performance was measured on two types of CPUs, Intel Icelake Xeon and Intel Knights landing, with different characteristics, and it was found that a performance improvement of about two times can be obtained in both types of systems.

A Performance Study on CPU-GPU Data Transfers of Unified Memory Device (통합메모리 장치에서 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.5
    • /
    • pp.133-138
    • /
    • 2022
  • Recently, as GPU performance has improved in HPC and artificial intelligence, its use is becoming more common, but GPU programming is still a big obstacle in terms of productivity. In particular, due to the difficulty of managing host memory and GPU memory separately, research is being actively conducted in terms of convenience and performance, and various CPU-GPU memory transfer programming methods are suggested. Meanwhile, recently many SoC (System on a Chip) products such as Apple M1 and NVIDIA Tegra that bundle CPU, GPU, and integrated memory into one large silicon package are emerging. In this study, data between CPU and GPU devices are used in such an integrated memory device and performance-related research is conducted during transmission. It shows different characteristics from the existing environment in which the host memory and GPU memory in the CPU are separated. Here, we want to compare performance by CPU-GPU data transmission method in NVIDIA SoC chips, which are integrated memory devices, and NVIDIA SMX-based V100 GPU devices. For the experimental workload for performance comparison, a two-dimensional matrix transposition example frequently used in HPC applications was used. We analyzed the following performance factors: the difference in GPU kernel performance according to the CPU-GPU memory transfer method for each GPU device, the transfer performance difference between page-locked memory and pageable memory, overall performance comparison, and performance comparison by workload size. Through this experiment, it was confirmed that the NVIDIA Xavier can maximize the benefits of integrated memory in the SoC chip by supporting I/O cache consistency.

A Backup-Cache for Leakage-Energy-Reduction and High Performance System (누수에너지 절약과 시스템 성능 향상을 위한 백업 캐시 제안)

  • Choi ByeongChang;Woo JangBok;Suh Hyo-Joong
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.11a
    • /
    • pp.874-876
    • /
    • 2005
  • 임베디드 시스템에서의 캐시 메모리는 시스템의 성능에 큰 영향을 줄뿐만 아니라 전체 에너지 소비 중 $50\%$ 정도를 소비하고 있어 캐시 메모리의 성능과 에너지 소비는 큰 관심거리 중 하나다. 공정의 미세화로 캐시 메모리의 에너지 소비 중 누수 전류에 의한 에너지 소비의 비중이 더 커지고 있어, 정적 에너지 소비를 줄이기 위한 다양한 연구가 진행 중이다. 에너지 절약과 성능 향상은 손익 상쇄(Trade-off)관계에 있어 두 가지 목표를 동시에 달성하기는 힘들다. 본 논문에서는 성능 향상을 위하여 여러 가지 캐시 구조중 접속 속도가 가장 빠른 직접 사상 캐시를 사용하고, 완전 연관 캐시를 사용하여 직접 사상 캐시의 단정을 보완 할 수 있는 백업 캐시 시스템을 제안한다. 시스템 성능을 향상 시키면서 백업 캐시의 누수에너지를 절약하기 위해 직접 사상 캐시와 완전 연관 캐시를 서로 다른 한계 전압을 가지는 SRAM으로 구성한다. 직접 사상 캐시는 낮은 한계 전압의 SRAM로 구성하여 높은 성능을 내고, 완전 연관 캐시는 직접 사상 캐시에 비해 상대적으로 속도는 느리지만 누수 에너지가 적은 높은 한계 전압을 가지는 SRAM으로 구성하여 직접 사상 캐시를 보완하는 역할을 할 것이다.

  • PDF

Efficient Service Discovery Scheme based on Clustering for Ubiquitous Computing Environments (유비쿼터스 컴퓨팅 환경에서 클러스터링 기반 효율적인 서비스 디스커버리 기법)

  • Kang, Eun-Young
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.9 no.2
    • /
    • pp.123-128
    • /
    • 2009
  • In ubiquitous computing environments, service discovery to search for an available service is an important issue. In this paper, we propose an efficient service discovery scheme that is combined a node id-based clustering service discovery scheme and a P2P caching-based information spreading scheme. To search quickly a service, proposed scheme store key information in neighbor's local cache and search services using it's information. We do not use a central look up server and do not rely on flooding. Through simulation, we show that the proposed scheme improves the performance of response time and network load compared to other methods.

  • PDF