• 제목/요약/키워드: Memory hierarchy

검색결과 61건 처리시간 0.022초

TP-Sim: 트레이스 기반의 프로세싱 인 메모리 시뮬레이터 (TP-Sim: A Trace-driven Processing-in-Memory Simulator)

  • 김정근
    • 반도체디스플레이기술학회지
    • /
    • 제22권3호
    • /
    • pp.78-83
    • /
    • 2023
  • This paper proposes a lightweight trace-driven Processing-In-Memory (PIM) simulator, TP-Sim. TP-Sim is a General Purpose PIM (GP-PIM) simulator that evaluates various PIM system performance-related metrics. Based on instruction and memory traces extracted from the Intel Pin tool, TP-Sim can replay trace files for multiple models of PIM architectures to compare its performance. To verify the availability of TP-Sim, we estimated three different system configurations on the STREAM benchmark. Compared to the traditional Host CPU-only systems with conventional memory hierarchy, simple GP-PIM architecture achieved better performance; even the Host CPU has the same number of in-order cores. For further study, we also extend TP-Sim as a part of a heterogeneous system simulator that contains CPU, GPGPU, and PIM as its primary and co-processors.

  • PDF

CPU-GPU 메모리 계층을 고려한 고처리율 병렬 KMP 알고리즘 (High Throughput Parallel KMP Algorithm Considering CPU-GPU Memory Hierarchy)

  • 박소은;김대희;이명호;박능수
    • 전기학회논문지
    • /
    • 제67권5호
    • /
    • pp.656-662
    • /
    • 2018
  • Pattern matching algorithm is widely used in many application fields such as bio-informatics, intrusion detection, etc. Among many string matching algorithms, KMP (Knuth-Morris-Pratt) algorithm is commonly used because of its fast execution time when using large texts. However, the processing speed of KMP algorithm is also limited when the text size increases significantly. In this paper, we propose a high throughput parallel KMP algorithm considering CPU-GPU memory hierarchy based on OpenCL in GPGPU (General Purpose computing on Graphic Processing Unit). We focus on the optimization for the allocation of work-times and work-groups, the local memory copy of the pattern data and the failure table, and the overlapping of the data transfer with the string matching operations. The experimental results show that the execution time of the optimized parallel KMP algorithm is about 3.6 times faster than that of the non-optimized parallel KMP algorithm.

Efficient Use of On-chip Memory through Profile-Driven Array Reorganization

  • Cho, Doosan;Youn, Jonghee
    • 대한임베디드공학회논문지
    • /
    • 제6권6호
    • /
    • pp.345-359
    • /
    • 2011
  • In high performance embedded systems, the use of multiple on-chip memories is an essential architectural feature for exploiting inherent parallelism in multimedia applications. This feature allows multiple data accesses to be executed in parallel. However, it remains difficult to effectively exploit of multiple on-chip memories. The successful use of this architecture strongly depends on how to efficiently detect and exploit memory parallelism in target applications. In this paper, we propose a technique based on a linear array access descriptor [1], which is generated from profiled data, to detect and exploit memory parallelism. The proposed technique tackles an array reorganization problem to maximize memory parallelism in multimedia applications. We present preliminary experiments applying the proposed technique onto a representative coarse grained reconfigurable array processor (CGRA) with multimedia kernel codes. Our experimental results demonstrate that our technique optimizes data placement by putting independent data on separate storage. The results exhibit 9.8% higher performance on average compared to the existing method.

프로세싱 인 메모리 시스템에서의 PolyBench 구동에 대한 동작 성능 및 특성 분석과 고찰 (Performance Analysis and Identifying Characteristics of Processing-in-Memory System with Polyhedral Benchmark Suite)

  • 김정근
    • 반도체디스플레이기술학회지
    • /
    • 제22권3호
    • /
    • pp.142-148
    • /
    • 2023
  • In this paper, we identify performance issues in executing compute kernels from PolyBench, which includes compute kernels that are the core computational units of various data-intensive workloads, such as deep learning and data-intensive applications, on Processing-in-Memory (PIM) devices. Therefore, using our in-house simulator, we measured and compared the various performance metrics of workloads based on traditional out-of-order and in-order processors with Processing-in-Memory-based systems. As a result, the PIM-based system improves performance compared to other computing models due to the short-term data reuse characteristic of computational kernels from PolyBench. However, some kernels perform poorly in PIM-based systems without a multi-layer cache hierarchy due to some kernel's long-term data reuse characteristics. Hence, our evaluation and analysis results suggest that further research should consider dynamic and workload pattern adaptive approaches to overcome performance degradation from computational kernels with long-term data reuse characteristics and hidden data locality.

  • PDF

메모리 계층 구조를 사용한 타일 기반 레스터라이져 설계 (A Design of a Tile Based Rasterizer Using Memory Hierarchy Structure)

  • 김도현;곽재창
    • 전기전자학회논문지
    • /
    • 제19권4호
    • /
    • pp.590-595
    • /
    • 2015
  • 본 논문은 타일 기반 레스터라이져에서 연산이 필요하지 않은 하위 계층에 대한 호출을 막아 연산의 효율을 올릴수 있는 계층 구조의 설계를 제안한다. 제안하는 계층 구조는 내외부 판정과 각 하위 계층이 가지는 타일의 최대 좌표값, 최소 좌표값을 이용하여 하위 계층을 3가지 형태로 분류한다. 각 하위 계층이 분류되는 형태에 따라 해당 계층의 연산의 필요 여부를 구분할 수 있으며 연산이 필요하지 않는 하위 계층에 대한 호출을 수행하지 않는 것으로 그래픽 처리과정의 전체 연산량을 줄일 수 있다. 제안하는 구조를 이용하여 하위 계층의 분류를 통해 그래픽 처리의 연산 시간을 줄일 수 있으며 3D 모델을 구성하는 정점의 밀집도가 클수록 높은 효율을 보인다.

Efficient Management of PCM-based Swap Systems with a Small Page Size

  • Park, Yunjoo;Bahn, Hyokyung
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • 제15권5호
    • /
    • pp.476-484
    • /
    • 2015
  • Due to the recent advances in non-volatile memory technologies such as PCM, a new memory hierarchy of computer systems is expected to appear. In this paper, we explore the performance of PCM-based swap systems and discuss how this system can be managed efficiently. Specifically, we introduce three management techniques. First, we show that the page fault handling time can be reduced by attaching PCM on DIMM slots, thereby eliminating the software stack overhead of block I/O and the context switch time. Second, we show that it is effective to reduce the page size and turn off the read-ahead option under the PCM swap system where the page fault handling time is sufficiently small. Third, we show that the performance is not degraded even with a small DRAM memory under a PCM swap device; this leads to the reduction of DRAM's energy consumption significantly compared to HDD-based swap systems. We expect that the result of this paper will lead to the transition of the legacy swap system structure of "large memory - slow swap" to a new paradigm of "small memory - fast swap."

그래프 프로세싱을 위한 GRU 기반 프리페칭 (Gated Recurrent Unit based Prefetching for Graph Processing)

  • 시바니 자드하브;파만 울라;나정은;윤수경
    • 반도체디스플레이기술학회지
    • /
    • 제22권2호
    • /
    • pp.6-10
    • /
    • 2023
  • High-potential data can be predicted and stored in the cache to prevent cache misses, thus reducing the processor's request and wait times. As a result, the processor can work non-stop, hiding memory latency. By utilizing the temporal/spatial locality of memory access, the prefetcher introduced to improve the performance of these computers predicts the following memory address will be accessed. We propose a prefetcher that applies the GRU model, which is advantageous for handling time series data. Display the currently accessed address in binary and use it as training data to train the Gated Recurrent Unit model based on the difference (delta) between consecutive memory accesses. Finally, using a GRU model with learned memory access patterns, the proposed data prefetcher predicts the memory address to be accessed next. We have compared the model with the multi-layer perceptron, but our prefetcher showed better results than the Multi-Layer Perceptron.

  • PDF

LDF-CLOCK: The Least-Dirty-First CLOCK Replacement Policy for PCM-based Swap Devices

  • Yoo, Seunghoon;Lee, Eunji;Bahn, Hyokyung
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • 제15권1호
    • /
    • pp.68-76
    • /
    • 2015
  • Phase-change memory (PCM) is a promising technology that is anticipated to be used in the memory hierarchy of future computer systems. However, its access time is relatively slower than DRAM and it has limited endurance cycle. Due to this reason, PCM is being considered as a high-speed storage medium (like swap device) or long-latency memory. In this paper, we adopt PCM as a virtual memory swap device and present a new page replacement policy that considers the characteristics of PCM. Specifically, we aim to reduce the write traffic to PCM by considering the dirtiness of pages when making a replacement decision. The proposed replacement policy tracks the dirtiness of a page at the granularity of a sub-page and replaces the least dirty page among pages not recently used. Experimental results with various workloads show that the proposed policy reduces the amount of data written to PCM by 22.9% on average and up to 73.7% compared to CLOCK. It also extends the lifespan of PCM by 49.0% and reduces the energy consumption of PCM by 3.0% on average.

데이터 망각을 활용한 비휘발성 메모리 기반 파일 캐시 관리 기법 (Forgetting based File Cache Management Scheme for Non-Volatile Memory)

  • 강동우;최종무
    • 정보과학회 논문지
    • /
    • 제42권8호
    • /
    • pp.972-978
    • /
    • 2015
  • 비휘발성 메모리는 바이트 단위 접근과 비휘발성을 지원한다. 이러한 특성들은 비휘발성 메모리를 캐시, 메모리, 디스크와 같은 메모리 계층 구조 가운데 하나의 영역으로 사용을 가능케 한다. 비휘발성 메모리의 흥미로운 특성은 데이터 보존 기간이 실제로는 제한적인 기간을 가지고 있다는 것이다. 게다가 데이터 보존 기간과 쓰기 지연간의 트레이드오프가 존재 한다. 본 논문에서는 이를 활용하여 비휘발성 메모리를 파일 캐시로 사용하는 새로운 관리 기법을 제안한다. 제안하는 기법은 기존의 캐시 관리 기법과는 반대로 짧은 데이터 보존 시간으로 데이터를 저장하고 쓰기 성능을 개선한다. 제안하는 기법은 LRU 대비 평균 접근 지연 시간을 최대 31%, 평균 24.4%로 감소시킴을 보인다.

웹 서버 클러스터에서 사용자 응답시간 개선을 위한 메모리 관리 (Memory Management for Improving User Response Time in Web Server Clusters)

  • 정지영;김성수
    • 한국정보과학회논문지:시스템및이론
    • /
    • 제28권9호
    • /
    • pp.434-441
    • /
    • 2001
  • 클러스터 시스템의 각 노드에 존재하는 메모리들을 효율적으로 관리하기 위하여 네트워크 메로리의 개념이 등장하였으며 빈번하게 디스크를 접근하는 응용분야에서 속도 향상을 위해 사용될 수 있다. 이는 전통적인 메모리 계층(hierarchy) 구조인 메로리와 디스크 사이에 네트워크 메모리를 추가함으로써 얻어진다. 본 논문에서는 웹 서버 클러스터를 대상으로 문서의 접근 유형에 대한 사전 정보를 요구하지 않고 실제적으로 구현 가능하며 다양한 웹 문서 접근 확률 분포 값에 대하여 우수한 사용자 응답시간을 가지는 메로리 관리 기법을 제안하고 다양한 시뮬레이션을 통해 제안된 방식의 우수성을 검증하였다.

  • PDF