• Title/Summary/Keyword: NUMA System

Search Result 35, Processing Time 0.022 seconds

Keeping-ownership Cache Replacement Policies for Remote Access Caches of NUMA System (NUMA 시스템에서 소유권에 근거한 원격 캐시 교체 정책)

  • 신숭현;곽종욱;장성태;전주식
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.8
    • /
    • pp.473-486
    • /
    • 2004
  • NUMA systems have remote access caches(RAC) in each local node to reduce the overhead for repeated remote memory accesses. By this RAC, memory latency and network traffic can be reduced and the performance of the multiprocessor system can be improved. Until now, several cache replacement policies have been proposed in recent years, and there also is cache replacement policy for multiprocessor systems. In this paper, we propose a cache replacement policy which is based on cache line coherence information. In this policy, the cache line that does not have an ownership is replaced first with respect to cache line that has an ownership. Like this way, the overhead to transfer ownership is avoided and the memory latency can be decreased. We also propose “Keeping-Ownership replacement policy with MRU (KOM)” and “Keeping-Ownership replacement policy with Reference Bit(KORB)” to reduce the frequent replacement penalty of the ownership-lacking cache line. We compare and analyze these with LRU and Pseudo LRU(PLRU). The simulation shows that KOM outperforms the PLRU by 25%, and KORB outperforms the PLRU by 13%. Although the hardware cost of KOM is very small, the performance of KOM is nearly equal to that of the LRU.

Scratchpad-Memory Management Using NUMA Infrastructure on Linux (Linux 상에서 NUMA 지원을 응용한 스크래치 패드 메모리 관리방법)

  • Park, Byung-Hun;Seo, Dae-Wha
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.11a
    • /
    • pp.41-42
    • /
    • 2009
  • 현재 많은 임베디드 SoC(System-On-Chip)에는 캐시 메모리의 단점을 보완하기 위해 온-칩(On-Chip) SRAM, 즉, SPM(Scratchpad Memory)를 내장하고 있으며 SPM은 그 특성상 캐시 메모리와 달리 소프트웨어가 직접 관리해야 한다. 본 논문에서는 NUMA를 지원하는 Linux 상에서 이식성이 높으면서 단순하게 구현할 수 있는 SPM 관리 방법을 제안한다.

Simulation-based Design Verification for High-performance Computing System

  • Jeong Taikyeong T.
    • Journal of Korea Multimedia Society
    • /
    • v.8 no.12
    • /
    • pp.1605-1612
    • /
    • 2005
  • This paper presents the knowledge and experience we obtained by employing multiprocessor systems as a computer simulation design verification to study high-performance computing system. This paper also describes a case study of symmetric multiprocessors (SMP) kernel on a 32 CPUs CC-NUMA architecture using an actual architecture. A small group of CPUs of CC-NUMA, high-performance computer system, is clustered into a processing node or cluster. By simulating the system design verification tools; we discussed SMP OS kernel on a CC-NUMA multiprocessor architecture performance which is $32\%$ of the total execution time and remote memory access latency is occupied $43\%$ of the OS time. In this paper, we demonstrated our simulation results for multiprocessor, high-performance computing system performance, using simulation-based design verification.

  • PDF

Load Balancing of Unidirectional Dual-link CC-NUMA System Using Dynamic Routing Method (단방향 이중연결 CC-NUMA 시스템의 동적 부하 대응 경로 설정 기법)

  • Suh Hyo-Joon
    • The KIPS Transactions:PartA
    • /
    • v.12A no.6 s.96
    • /
    • pp.557-562
    • /
    • 2005
  • Throughput and latency of interconnection network are important factors of the performance of multiprocessor systems. The dual-link CC-NUMA architecture using point-to-point unidirectional link is one of the popular structures in high-end commercial systems. In terms of optimal path between nodes, several paths exist with the optimal hop count by its native multi-path structure. Furthermore, transaction latency between nodes is affected by congestion of links on the transaction path. Hence the transaction latency may get worse if the transactions make a hot spot on some links. In this paper, I propose a dynamic transaction routing algorithm that maintains the balanced link utilization with the optimal path length, and I compare the performance with the fixed path method on the dual-link CC-NUMA systems. By the proposed method, the link competition is alleviated by the real-time path selection, and consequently, dynamic transaction algorithm shows a better performance. The program-driven simulation results show $1{\~}10\%$ improved fluctuation of link utilization, $1{\~}3\%$ enhanced acquirement of link, and $1{\~}6\%$ improved system performance.

The Effect of Mesh Reordering on Laplacian Smoothing for Nonuniform Memory Access Architecture-based High Performance Computing Systems (NUMA구조를 가진 고성능 컴퓨팅 시스템에서의 메쉬 재배열의 라플라시안 스무딩에 대한 효과)

  • Kim, Jbium
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.3
    • /
    • pp.82-88
    • /
    • 2014
  • We study the effect of mesh reordering on Laplacian smoothing for parallel high performance computing systems. Specifically, we use the Reverse-Cuthill McKee algorithm to reorder meshes and use Laplacian Smoothing to improve the mesh quality on Nonuniform memory access architecture-based parallel high performance computing systems. First, we investigate the effect of using mesh reordering on Laplacian smoothing for a single core system and extend the idea to NUMA-based high performance computing systems.

Providing scalable single-operating-system NUMA abstraction of physically discrete resources

  • Baik Song An;Myung Hoon Cha;Sang-Min Lee;Won Hyuk Yang;Hong Yeon Kim
    • ETRI Journal
    • /
    • v.46 no.3
    • /
    • pp.501-512
    • /
    • 2024
  • With an explosive increase of data produced annually, researchers have been attempting to develop solutions for systems that can effectively handle large amounts of data. Single-operating-system (OS) non-uniform memory access (NUMA) abstraction technology is an important technology that ensures the compatibility of single-node programming interfaces across multiple nodes owing to its higher cost efficiency compared with scale-up systems. However, existing technologies have not been successful in optimizing user performance. In this paper, we introduce a single-OS NUMA abstraction technology that ensures full compatibility with the existing OS while improving the performance at both hypervisor and guest levels. Benchmark results show that the proposed technique can improve performance by up to 4.74× on average in terms of execution time compared with the existing state-of-the-art opensource technology.

Performance Evaluation for a Multiprocessor Computer System Using a Commercial Workload (상용 작업부하를 이용한 다중프로세서 컴퓨터 시스템 성능 평가)

  • 박진원
    • Journal of the Korea Society for Simulation
    • /
    • v.8 no.1
    • /
    • pp.35-49
    • /
    • 1999
  • The CC-NUMA based, distributed shared memory is an emerging architecture for multiprocessor computer systems because of its scalability and easy of programming. In this paper, we analyzed performance of a ring-based, CC-NUMA multiprocessor computer system using a commercial workload targeted for popular OLTP applications. Based on the traces collected from real machines, the characteristics of the commercial workload could be obtained. The simulation results showed that the bottleneck on the ring could be effectively removed by using a dual ring structure. We believe our simulation methodology and results will help us to design better multiprocessor computer systems for commercial application domains.

  • PDF

Page replication mechanism using adjustable DELAY counter in NUMA multiprocessors (NUMA 다중처리기에서 조정가능한 지연 카운터를 이용한 페이집 복사 기법)

  • 이종우;조유곤
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.33B no.6
    • /
    • pp.23-33
    • /
    • 1996
  • The exploitation of locality of reference in shared memory NUMA multiprocessors is one of the improtant problems in parallel processing today. In this paper, we propose a revised hardeare reference counter to help operating system to manage locality. In contrast to the previous one, the value of counter can abe adjusted dynamically and periodically to adapt the page replication policy to the various memory reference patterns of processors. We use execution-driven simulation of real applications to evaluate the effectiveness of our adjustable DELAY counter. Our main conclusijon is that by using the adjustable DELAY counter the t normalized average memory access costs and the variance of them become smaller for most applications than the previous one and more robust memory management policies can be provided for the operating systems.

  • PDF

Dual Ring CC-NUMA System using Repeater Node (리피터 노드를 장착한 이중 링 CC-NUMA 시스템)

  • 경진미;김인석;김봉준;장성태
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10c
    • /
    • pp.697-699
    • /
    • 2002
  • CC-NUMA 구조에서는 원격 메모리에 대한 접근이 불가피한 구조적인 특성 때문에 사호 연결망이 성능을 좌우하는 큰 변수로 작용한다. 기존에 사용되는 버스는 대역폭의 한계와 물리적 확장성 때문에 대규모의 시스템에는 적합하지 않다. 이를 대체하는 고속의 지점간 링크를 도입한 이중 링 구조는 이러한 버스의 한계를 극복하고는 있지만 많은 노드를 거쳐야 하는 문제로 인해 응답 지연 시간이 증가하는 단점을 안고 있다. 본 논문에서는 요청과 응답 패킷의 지연 시간을 줄이는 방안으로 리피터 노드를 이용한 다중링을 제안한다. 제안된 시스템은 링과 링 사이의 구조가 대칭형을 이루고 있어 요청을 내보내는 링을 제외한 다른 링의 hop수는 똑같은 수치를 갖고 있으며, 이중 링에 비해 최대의 hop수와 최소의 hop수의 차가 적고 평균 hop수 또한 적어 좋은 성능을 보인다. 본 논문에서는 또한 이러한 구조를 유지하기 위한 리피터 노드의 구조를 제안하며 리피터 노드의 구조와 노드의 확장에 따른 다양한 성능을 확률 구동 시뮬레이터를 사용하여 평가를 수행한다.

  • PDF

Efficient Processing of Grouped Aggregation on Non-Uniformed Memory Access Architecture (비균등 메모리 접근 구조에서의 효율적인 그룹화 집단 연산의 처리)

  • Choe, Seongjun;Min, Jun-Ki
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.14-27
    • /
    • 2018
  • Recently, to alleviate the memory bottleneck problme occurred in Symmetric Multiprocessing (SMP) architecture, Non-Uniform Memory Access (NUMA) architecture was proposed. In addition, since an aggregation operator is an important operator providing properties and summary of data, the efficiency of the aggregation operator is crucial to overall performance of a system. Thus, in this paper, we propose an efficient aggregation processing technique on NUMA architecture. Our proposed technique consists of partition phase and merge phase. In the partition phase, the target relation is partitioned into several partial relations according to grouping attribute. Thus, since each thread can process aggregation operator on partial relation independently, we prevent the remote memory access during the merge phase. Furthermore, at the merge phase, we improve the performance of the aggregation processing by letting each thread compute aggregation with a local hash table as well as avoiding lock contention to merge aggregation results generated by all threads into one.