• Title/Summary/Keyword: Code cache

Search Result 30, Processing Time 0.022 seconds

Expanding Code Caches for Embedded Java Systems using Client Ahead-Of-Time Compilation (내장형 자바 시스템을 위한 클라이언트 선행 컴파일 기법을 이용한 코드 캐시 확장)

  • Hong, Sung-Hyun;Kim, Jin-Chul;Shin, Jin-Woo;Kwon, Jin-Woo;Lee, Joo-Hwan;Moon, Soo-Mook
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.8
    • /
    • pp.868-872
    • /
    • 2010
  • Many embedded Java systems are equipped with limited memory, which can constrain the code cache size provided for Java just-in-time compilation, affecting the Java performance. This paper proposes expanding the limited code cache when it is full, by saving the machine code for some methods in the code cache into the file system of the permanent storage and reloading it to the code cache when they are re-invoked later. This is applying the client ahead-of-time compilation during the execution time for the purpose of enlarging the code cache. Our experimental results indicate that the proposed execution method can improve the performance by as much as 1.6 times compared to the conventional method, when the code cache size is reduced by half.

Acceleration of LU-SGS Code on Latest Microprocessors Considering the Increase of Level 2 Cache Hit-Rate (최신 마이크로프로세서에서 2차 캐쉬 적중률 증가를 고려한 LU-SGS 코드의 가속)

  • Choi, J.Y.;Oh, Se-Jong
    • Journal of the Korean Society for Aeronautical & Space Sciences
    • /
    • v.30 no.7
    • /
    • pp.68-80
    • /
    • 2002
  • An approach for composing a performance optimized computational code is suggested for latest microprocessors. The concept of the code optimization, called here as localization, is maximizing the utilization of the second level cache that is common to all the latest computer system, and minimizing the access to system main memory. In this study, the localized optimization of LU-SGS (Lower-Upper Symmetric Gauss-Seidel) code for the solution of fluid dynamic equations was carried out in three different levels and tested for several different microprocessor architectures most widely used in these days. The test results of localized optimization showed a remarkable performance gain up to 7.35 times faster solution, depending on the system, than the baseline algorithm for producing exactly the same solution on the same computer system.

Study on the Performance Evaluation and Analysis of Mobile Cache Memory

  • Lee, Sangmin;Kim, Jongwan;Kim, Ji Young;Oh, Dukshin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.6
    • /
    • pp.99-107
    • /
    • 2020
  • In this paper, we analyze the characteristics of mobile cache, which is used to improve the data access speed when executing applications on mobile devices, and verify the importance of mobile cache through a cache data access experiment. The mobile device market has grown at a fast pace over the past decade; however, battery limitations and size, price considerations restrict the usage of fast hardware. Thus, their performance are supplemented by using a memory buffer structure such as the cache memory. The analysis mainly focuses on cache size, hierarchical structure of cache, cache replacement policy, and the effect these features has on mobile performance. For the experimental data, we applied a data set from a microprocessor system study, originally used to test the cache performance. In the experimental results, the average data access speed on a mobile device showed a performance improvement of up to 10 times with the presence of cache memory than without. Accordingly, the cache memory was helpful for the performance improvement of a mobile device when the specifications were identical.

Study of a Low-power Error Correction Circuit for Image Processing (L2 캐시 저 전력 영상 처리를 위한 오류 정정 회로 연구)

  • Lee, Sang-Jun;Park, Jong-Su;Jeon, Ho-Yun;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.33 no.10C
    • /
    • pp.798-804
    • /
    • 2008
  • This paper proposes a low-power circuit for detecting and correcting L2 cache errors during microprocessor data image processing. A simplescalar-ARM is used to analyze input and output data by accessing the microprocessor's L2 cache during image processing in terms of the data input and output frequency as well as the variation of each bit for 32-bit processing. The circuit is implemented based on an H-matrix capable of achieving low power consumption by extracting bits with small and large amounts of variation and allocating bits with similarities in variation. Simulation is performed using H-spice to compare power consumption of the proposed circuit to the odd-weight-column code used in a conventional microprocessor. The experimental results indicated that the proposed circuit reduced power consumption by 17% compared to the odd-weight-column code.

PERFORMANCE ANALYSIS OF THE PARALLEL CUPID CODE IN DISTRIBUTED MEMORY SYSTEM BASED ETHERNET AND INFINIBAND NETWORK (이더넷과 인피니밴드 네트워크 기반의 분산 메모리 시스템에서 병렬성능 분석)

  • Jeon, B.J.;Choi, H.G.
    • Journal of computational fluids engineering
    • /
    • v.19 no.2
    • /
    • pp.24-29
    • /
    • 2014
  • In this study, a parallel performance of CUPID-code has been investigated for both Ethernet and Infiniband network system to examine the effect of cache memory and network-speed. Bi-conjugate gradient solver of CUPID-code has been parallelised by using domain decomposition method and message passing interface (MPI). It is shown that the parallel performance of Ethernet-network system is worse than that of Infiniband-network system due to the slow network-speed and a small cache memory. It is also found that the parallel performance of each system deteriorates for a small problem due to the communication overhead, but the performance of Infiniband-network system is better than Ethernet-network system due to a much faster network-speed. For a large problem, the parallel performance depends less on network system.

Performance of the Finite Difference Method Using Cache and Shared Memory for Massively Parallel Systems (대규모 병렬 시스템에서 캐시와 공유메모리를 이용한 유한 차분법 성능)

  • Kim, Hyun Kyu;Lee, Hyo Jong
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.4
    • /
    • pp.108-116
    • /
    • 2013
  • Many algorithms have been introduced to improve performance by using massively parallel systems, which consist of several hundreds of processors. A typical example is a GPU system of many processors which uses shared memory. In the case of image filtering algorithms, which make references to neighboring points, the shared memory helps improve performance by frequently accessing adjacent pixels. However, using shared memory requires rewriting the existing codes and consequently results in complexity of the codes. Recent GPU systems support both L1 and L2 cache along with shared memory. Since the L1 cache memory is located in the same area as the shared memory, the improvement of performance is predictable by using the cache memory. In this paper, the performance of cache and shared memory were compared. In conclusion, the performance of cache-based algorithm is very similar to the one of shared memory. The complexity of the code appearing in a shared memory system, however, is resolved with the cache-based algorithm.

Multicore-Aware Code Co-Positioning to Reduce WCET on Dual-Core Processors with Shared Instruction Caches

  • Ding, Yiqiang;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.1
    • /
    • pp.12-25
    • /
    • 2012
  • For real-time systems it is important to obtain the accurate worst-case execution time (WCET). Furthermore, how to improve the WCET of applications that run on multicore processors is both significant and challenging as the WCET can be largely affected by the possible inter-core interferences in shared resources such as the shared L2 cache. In order to solve this problem, we propose an innovative approach that adopts a code positioning method to reduce the inter-core L2 cache interferences between the different real-time threads that adaptively run in a multi-core processor by using different strategies. The worst-case-oriented strategy is designed to decrease the worst-case WCET among these threads to as low as possible. The other two strategies aim at reducing the WCET of each thread to almost equal percentage or amount. Our experiments indicate that the proposed multicore-aware code positioning approaches, not only improve the worst-case performance of the real-time threads but also make good tradeoffs between efficiency and fairness for threads that run on multicore platforms.

Optimization of LU-SGS Code for the Acceleration on the Modern Microprocessors

  • Jang, Keun-Jin;Kim, Jong-Kwan;Cho, Deok-Rae;Choi, Jeong-Yeol
    • International Journal of Aeronautical and Space Sciences
    • /
    • v.14 no.2
    • /
    • pp.112-121
    • /
    • 2013
  • An approach for composing a performance optimized computational code is suggested for the latest microprocessors. The concept of the code optimization, termed localization, is maximizing the utilization of the second level cache that is common to all the latest computer systems, and minimizing the access to system main memory. In this study, the localized optimization of the LU-SGS (Lower-Upper Symmetric Gauss-Seidel) code for the solution of fluid dynamic equations was carried out in three different levels and tested for several different microprocessor architectures widely used these days. The test results of localized optimization showed a remarkable performance gain of more than two times faster solution than the baseline algorithm for producing exactly the same solution on the same computer system.

A Study on an Error Correction Code Circuit for a Level-2 Cache of an Embedded Processor (임베디드 프로세서의 L2 캐쉬를 위한 오류 정정 회로에 관한 연구)

  • Kim, Pan-Ki;Jun, Ho-Yoon;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.1
    • /
    • pp.15-23
    • /
    • 2009
  • Microprocessors, which need correct arithmetic operations, have been the subject of in-depth research in relation to soft errors. Of the existing microprocessor devices, the memory cell is the most vulnerable to soft errors. Moreover, when soft errors emerge in a memory cell, the processes and operations are greatly affected because the memory cell contains important information and instructions about the entire process or operation. Users do not realize that if soft errors go undetected, arithmetic operations and processes will have unexpected outcomes. In the field of architectural design, the tool that is commonly used to detect and correct soft errors is the error check and correction code. The Itanium, IBM PowerPC G5 microprocessors contain Hamming and Rasio codes in their level-2 cache. This research, however, focuses on huge server devices and does not consider power consumption. As the operating and threshold voltage is currently shrinking with the emergence of high-density and low-power embedded microprocessors, there is an urgent need to develop ECC (error check correction) circuits. In this study, the in-output data of the level-2 cache were analyzed using SimpleScalar-ARM, and a 32-bit H-matrix for the level-2 cache of an embedded microprocessor is proposed. From the point of view of power consumption, the proposed H-matrix can be implemented using a schematic editor of Cadence. Therefore, it is comparable to the modified Hamming code, which uses H-spice. The MiBench program and TSMC 0.18 um were used in this study for verification purposes.

Performance Analysis of Caching Instructions on SVLIW Processor and VLIW Processor (SVLIW 프로세서와 VLIW 프로세서의 명령어 캐싱에 따른 성능 분석)

  • Ji, Sung-Hyun;Park, No-Kwang;Kim, Suk-Il
    • Journal of IKEEE
    • /
    • v.1 no.1 s.1
    • /
    • pp.101-110
    • /
    • 1997
  • SVLIW processor architectures can resolve resource collisions and data dependencies between the instructions while scheduling VLIW instructions at run-time. As a result, long NOP word instructions can be removed from the object code produced for the processor. Thus, the occurrence of cache misses on the SVLIW processor would be lesser than that on the same cache size VLIW processor. Less frequent cache misses on the SVLIW processor would incur less frequent memory access, and thus, the total execution cycles to complete an application would be shortened compared with cases on the VLIW processor. Such a feature eventually compromises effects of longer instruction pipeline stages than those of the VLIW processor. In this paper, we formulate and compare two execution cycle models of the two architectures. A simulation results show that the longer memory access cycles when cache miss occurs, the total execution cycles of SVLIW processor would be shorter than those of VLIW processor.

  • PDF