Search | Korea Science

Performance and Scalability of OpenMP Programs on Chip-MultiThreading Server (칩 멀티쓰레딩 서버에서 OpenMP 프로그램의 성능과 확장성)

Lee Myung-Ho;Kim Yong-Kyu
- The KIPS Transactions:PartA
- /
- v.13A no.2 s.99
- /
- pp.137-146
- /
- 2006
Shared Memory Multiprocessor (SMP) systems adopting Chip-level MultiThreading (CMT) technology are becoming mainstream servers in commercial applications and High Performance Computining (HPC) applications as well. OpenMP has become the standard paradigm to parallelize applications for SMP mostly because of its ease of use. As the demand for more computing power in HPC applications is growing rapidly, obtaining high performance and scalability for these applications parallelized using OpenMP API's will become more important. In this paper, we study the performance and scalability of HPC applications parallelized using OpenMP, SPEC OMPL (standard OpenMP benchmark suite), on the Sun Fire E25K server which adopts CMT technology. We also study the effect of CMT on SPEC OMPL.
https://doi.org/10.3745/KIPSTA.2006.13A.2.137 인용 PDF KSCI

An Optimized Cache Coherence Protocol in Multiprocessor System Connected by Slotted Ring (슬롯링으로 연결된 다중처리기 시스템에서 최적화된 캐쉬일관성 프로토콜)

Min, Jun-Sik;Chang, Tae-Mu
- The Transactions of the Korea Information Processing Society
- /
- v.7 no.12
- /
- pp.3964-3975
- /
- 2000
There are two policies for maintaining consistency among the multiple processor caches in a multiprocessor system: Write invalidate and Write update. In the write invalidate policy, whenever a processor attempt to write its cached block, it has to invalidate all the same copies of the updated block in the system. As a results of this frequent invalidations, this policy results in high cache miss ratio. On the other hand, the write update policy renew them, instead of invalidating all the same copies. This policy has to transfer the updated contents through interconnection network, whether the updated block is ptivate or not. Therefore the network suffer from heavy transaction traffic. In this paper we present an efficient cache coherence protocol for shared memory multiprocessor system connected by slotted ring. This protocol is based on the write update policy, but the updated contents are transferred only in case of updating the shared block. Otherwise, if the updated block is private, the updated contents are not transferred. We analyze the proposed protocol and enforce simulation to compare it with previous version.
PDF

Design of Real Time Task Scheduling for Line Controller of Continuous Manufacturing Process Automation (연속 공정 자동화를 위한 라인 제어기에서의 실시간 작업 스케쥴링에 관한 연구)

Lee, Joon-Soo;Cho, Young-Jo;Lim, Mee-Seub;Park, Jung-Min;Choy, Ick;Lim, Jun-Hong;Kim, Kwang-Bae
- Proceedings of the KIEE Conference
- /
- 1992.07a
- /
- pp.365-368
- /
- 1992
This paper presents an approach to the design of real time task scheduling for a line controller of continuous manufacturing process automation. The line controller has multiprocessor-based architecture with shared memory and is operated by firmware. This firmware contains menu-driven software supporting real-time database management and fuction-block control language. The multitasking line control processor performs the following three functions: 1) interprets the function block control language by virtue of shared memory in the database; 2) invokes an interupt service routine as required by external hardware; 3) detects errors and notifies the user. We propose real time task scheduling method.
PDF

Performance Analysis of A Distributed Shared Memory Multiprocessor System Using PASEC (PARSEC을 이용한 분산공유메모리 다중프로세서 시스템의 성능분석)

Park, Joon-Seok;Jeon, Chang-Ho
- The Transactions of the Korea Information Processing Society
- /
- v.7 no.10
- /
- pp.3049-3054
- /
- 2000
In this paper, the effects of the hardware components and runtime environments on the overall performance of a distributed shared memory system are analyzed through simulation. In simulation, the system is modeled using PARSE[1.2] closely to the real runtime environment and the 2D FFT is virtually executed on it. The results of simulation show that the minor hardware components such as bus interfaces and local bus of a processor, which are usuallyignored or neglected when analyzing performance. have significant impacts on the overall system performance. Performance variations caused from runtime environments such as loop overhead and code optimuzatio are also analyzed quantitatively.
PDF

Comparative and Combined Performance Studies of OpenMP and MPI Codes (OpenMP와 MPI 코드의 상대적, 혼합적 성능 고찰)

Lee Myung-Ho
- The KIPS Transactions:PartA
- /
- v.13A no.2 s.99
- /
- pp.157-162
- /
- 2006
Recent High Performance Computing (HPC) platforms can be classified as Shared-Memory Multiprocessors (SMP), Massively Parallel Processors (MPP), and Clusters of computing nodes. These platforms are deployed in many scientific and engineering applications which require very high demand on computing power. In order to realize an optimal performance for these applications, it is crucial to find and use the suitable computing platforms and programming paradigms. In this paper, we use SPEC HPC 2002 benchmark suite developed in various parallel programming models (MPI, OpenMP, and hybrid of MPI/OpenMP) to find an optimal computing environments and programming paradigms for them through their performance analyses.
https://doi.org/10.3745/KIPSTA.2006.13A.2.157 인용 PDF KSCI

A Study on Modular Min (Modular MIN에 관한 연구)

장창수;최창훈;유창하
- The Journal of the Korea Contents Association
- /
- v.2 no.2
- /
- pp.103-111
- /
- 2002
In parallel application programs with a localized communication, even if the MINs have lour diameters, overall system performance degrades when compared to the hypercube and tree structure. The reason is that it is impossible for MINs to provide some mechanisms for clustering to exploit the locality of reference. However proposed MIN can be constructed suitable for localized communication by providing the shortcut path and multiple paths inside the processor-memory duster which has frequent data communications. Therefore proposed MIN achieves enhanced performance in parallel application program with a localized communication.
PDF

A Study on Direct Cache-to-Cache Transfer for Hybrid Cache Architecture to Reduce Write Operations (쓰기 횟수 감소를 위한 하이브리드 캐시 구조에서의 캐시간 직접 전송 기법에 대한 연구)

Juhee Choi
- Journal of the Semiconductor & Display Technology
- /
- v.23 no.1
- /
- pp.65-70
- /
- 2024
Direct cache-to-cache transfer has been studied to reduce the latency and bandwidth consumption related to the shared data in multiprocessor system. Even though these studies lead to meaningful results, they assume that caches consist of SRAM. For example, if the system employs the non-volatile memory, the one of the most important parts to consider is to decrease the number of write operations. This paper proposes a hybrid write avoidance cache coherence protocol that considers the hybrid cache architecture. A new state is added to finely control what is stored in the non-volatile memory area, and experimental results showed that the number of writes was reduced by about 36% compared to the existing schemes.
PDF

An Efficient Processor Synchronization Scheme on Shared Memory Multiprocessor (공유메모리 다중처리기에서 효율적인 프로세서 동기화 기법)

윤석한;원철호;김덕진
- Journal of the Korean Institute of Telematics and Electronics B
- /
- v.32B no.5
- /
- pp.683-692
- /
- 1995
Many kinds of large scale multiprocessing and parallel-processing systems have recently been developed. The contention on the shared data caused by multiple processors may degrade system performance. So, processor synchronization has become one of the important issues in these systems. To solve the synchornization issues, a lot of software and hardware schemes based on spin lock have been proposed. Although software schemes are easy to implement, hardware schemes are preferred in many systems to gain optimized performance. This paper proposes an efficient processor synchronization scheme, called QCX,and describes its design considerations, hardware, algorithm, protocol. Also, in this paper, the performance of QCX has been evaluated with QOLB[5] and LBP[7] using a simulation. The simulation, with varying the number of processor and the contention on shared variables, measured the average execution times of a workload. The simulation results show that the performances of QCX is best when practicability is considered. QCX is more efficient than QOLB and LBP in two aspects. First, the hardware of QCX is more simple and cost-effective because the cache structure need not be changed. Secondly, QCX is more general because it uses a generic atomic instruction.
PDF

Directory Cache Coherence Scheme using the Number-Balanced Binary Tree (수 평형 이진트리를 이용한 디렉토리 캐쉬 일관성 유지 기법)

Seo, Dae-Wha
- The Transactions of the Korea Information Processing Society
- /
- v.4 no.3
- /
- pp.821-830
- /
- 1997
The directory-based cache coherence scheme is an attractive approach to solve the caceh coherence problem in a large-scale shared-memory multiprocessor.However, the exsting directory-based schemes have some problens such as the enormous storage overhead for a directory, the long invalidation latency, the heavy network condes-tion, and the low scalability.For resolving these problems, we propose a new directroy- based caceh coherence scheme which is suitable for building scalable, shred-memory multiprocessors.In this scheme, each directory en-try ofr a given memory block is a number-balanced binaty tree(NBBT) stucture.The NBBT has several proper-ties to effciently maintain the directory for the cache consistency such that the shape is unique, the maximum depth is [log$_2$n], and the tree has the minimum number of leaf nodes among the binarry tree with n nodes.Therefore, this scheme can reduce the storage overhead, the network traffic, and the inbalidation latency and can ensutr the high- scalability the large-scale shared-memory multiprocessors.
PDF

Incremental Design of MIN using Unit Module (단위 모듈을 이용한 MIN의 점증적 설계)

Choi, Chang-Hoon;Kim, Sung-Chun
- Journal of KIISE:Computer Systems and Theory
- /
- v.27 no.2
- /
- pp.149-159
- /
- 2000
In this paper, we propose a new class of MIN (Multistage Interconnection Network) called SCMIN(ShortCut MIN) which can form a cheap and efficient packet switching interconnection network. SCMIN satisfies full access capability(FAC) and has multiple redundant paths between processor-memory pairs even though SCMIN is constructed with 2.5N-4 SEs which is far fewer SEs than that of MINs. SCMIN can be constructed suitable for localized communication by providing the shortcut path and multiple paths inside the processor-memory cluster which has frequent data communications. Therefore, SCMIN can be used as an attractive interconnection network for parallel applications with a localized communication pattern in shared-memory multiprocessor systems.
PDF

Search Result 52, Processing Time 0.021 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)