통합 검색 | Korea Science

실행시간 적응에 의한 병렬처리시스템의 성능개선 (Performance Improvement of Parallel Processing System through Runtime Adaptation)

박대연;한재선
- 한국정보과학회논문지:시스템및이론
- /
- 제26권7호
- /
- pp.752-765
- /
- 1999
대부분 병렬처리 시스템에서 성능 파라미터는 복잡하고 프로그램의 수행 시 예견할 수 없게 변하기 때문에 컴파일러가 프로그램 수행에 대한 최적의 성능 파라미터들을 컴파일 시에 결정하기가 힘들다. 본 논문은 병렬 처리 시스템의 프로그램 수행 시, 변화하는 시스템 성능 상태에 따라 전체 성능이 최적화로 적응하는 적응 수행 방식을 제안한다. 본 논문에서는 이 적응 수행 방식 중에 적응 프로그램 수행을 위한 이론적인 방법론 및 구현 방법에 대해 제안하고 적응 제어 수행을 위해 프로그램의 데이타 공유 단위에 대한 적응방식(적응 입도 방식)을 사용한다. 적응 프로그램 수행 방식은 프로그램 수행 시 하드웨어와 컴파일러의 도움으로 프로그램 자신이 최적의 성능을 얻을 수 있도록 적응하는 방식이다. 적응 제어 수행을 위해 수행 시에 병렬 분산 공유 메모리 시스템에서 프로세서 간 공유될 수 있은 데이타의 공유 상태에 따라 공유 데이타의 크기를 변화시키는 적응 입도 방식을 적용했다. 적응 입도 방식은 기존의 공유 메모리 시스템의 공유 데이타 단위의 통신 방식에 대단위 데이타의 전송 방식을 사용자의 입장에 투명하게 통합한 방식이다. 시뮬레이션 결과에 의하면 적응 입도 방식에 의해서 하드웨어 분산 공유 메모리 시스템보다 43%까지 성능이 개선되었다. Abstract On parallel machines, in which performance parameters change dynamically in complex and unpredictable ways, it is difficult for compilers to predict the optimal values of the parameters at compile time. Furthermore, these optimal values may change as the program executes. This paper addresses this problem by proposing adaptive execution that makes the program or control execution adapt in response to changes in machine conditions. Adaptive program execution makes it possible for programs to adapt themselves through the collaboration of the hardware and the compiler. For adaptive control execution, we applied the adaptive scheme to the granularity of sharing adaptive granularity. Adaptive granularity is a communication scheme that effectively and transparently integrates bulk transfer into the shared memory paradigm, with a varying granularity depending on the sharing behavior. Simulation results show that adaptive granularity improves performance up to 43% over the hardware implementation of distributed shared memory systems.

Scheduler for parallel processing with finely grained tasks

Hosoi, Takafumi;Kondoh, Hitoshi;Hara, Shinji
- 제어로봇시스템학회:학술대회논문집
- /
- 제어로봇시스템학회 1991년도 한국자동제어학술회의논문집(국제학술편); KOEX, Seoul; 22-24 Oct. 1991
- /
- pp.1817-1822
- /
- 1991
A method of reducing overhead caused by the processor synchronization process and common memory accesses in finely grained tasks is described. We propose a scheduler which considers the preparation time during searching to minimize the redundant accesses to shared memory. Since the suggested hardware (synchronizer) determines the access order of processors and bus arbitration simultaneously by including the synchronization process into the bus arbitration process, the synchronization time vanishes. Therefore this synchronizer has no overhead caused by the processor synchronization[l]. The proposed scheduler algorithm is processed in parallel. The processes share the upper bound derived by each searching and the lower bound function is built considering the preparation time in order to eliminate as many searches as possible. An application of the proposed method to a multi-DSP system to calculate inverse dynamics for robot arms, showed that the sampling time can be twice shorter than that of the conventional one.
PDF

GPU를 이용한 Gabor Texture 특징점 기반의 금속 패드 변색 분류 알고리즘 (Discolored Metal Pad Image Classification Based on Gabor Texture Features Using GPU)

최학남;박은수;김준철;김학일
- 제어로봇시스템학회논문지
- /
- 제15권8호
- /
- pp.778-785
- /
- 2009
This paper presents a Gabor texture feature extraction method for classification of discolored Metal pad images using GPU(Graphics Processing Unit). The proposed algorithm extracts the texture information using Gabor filters and constructs a pattern map using the extracted information. Finally, the golden pad images are classified by utilizing the feature vectors which are extracted from the constructed pattern map. In order to evaluate the performance of the Gabor texture feature extraction algorithm based on GPU, a sequential processing and parallel processing using OpenMP in CPU of this algorithm were adopted. Also, the proposed algorithm was implemented by using Global memory and Shared memory in GPU. The experimental results were demonstrated that the method using Shared memory in GPU provides the best performance. For evaluating the effectiveness of extracted Gabor texture features, an experimental validation has been conducted on a database of 20 Metal pad images and the experiment has shown no mis-classification.
https://doi.org/10.5302/J.ICROS.2009.15.8.778 인용 PDF KSCI

분산 공유 메모리 시스템에서 거짓 공유를 줄이는 호출지 추적 기반 공유 메모리 할당 기법 (Call-Site Tracing-based Shared Memory Allocator for False Sharing Reduction in DSM Systems)

이종우
- 한국정보과학회논문지:시스템및이론
- /
- 제32권7호
- /
- pp.349-358
- /
- 2005
거짓 공유는 공유 메모리 다중 처리기 시스템에서 여러 처리기들이 일관성 유지의 단위 메모리 영역을 공유함으로 인해 발생하는 현상으로써, 메모리 일관성 유지의 정확성에는 아무런 도움을 주지 못하면서 그 비용만 증가시키는 주요 요인이다. 특히 메모리 일관성 유지의 단위가 커질수록 그 피해가 더 커진다고 할 수 있다. 페이지-기반 분산 공유 메모리 시스템에서 거짓 공유를 줄이기 위해서는 공유 페이지에 할당되는 객체들의 특성을 미리 예측하여 참조 패턴이 상이한 객체들이 하나의 공유 페이지에 섞이는 것을 방지하는 것이 필수적이다. 본 논문에서는 병렬 응용 프로그램의 코드 내에서 공유 메모리 할당자를 호출한 위치를 추적하여 서로 다른 호출지에서 요청된 공유 객체가 같은 공유 페이지에 할당되는 것을 방지하는 호출지-추적 기반 거짓 공유 감소 기법(CSTallocator)을 제시한다. CSTallocator는 서로 다른 코드 위치에서 할당 요청된 공유 객체들은 각각 상이한 참조 패턴을 보일 것이라는 가정에 기반하고 있다 이 기법의 효용성을 검증하기 위해 기존 거짓 공유 감소 할당 기법들의 성능과 비교한 결과 기존 방식에 비해 훨씬 더 많은 거짓 공유 폴트를 감소시킨다는 것을 알 수 있었다. 실험은 실제 병렬 응용에 기반한 실행-기반 시뮬레이션 기법을 사용하였다.
PDF KSCI

IMT-2000에서 Multirate를 위한 N-채널 데이터 상관기에 관한 연구 (A Study on N-Channel Data Correlators for Multirate in IMT-2000)

김종엽;이선근;김환용
- 대한전자공학회:학술대회논문집
- /
- 대한전자공학회 2000년도 하계종합학술대회 논문집(1)
- /
- pp.49-52
- /
- 2000
The Multi-Code CDMA systems that are proposed as an effective transmission methodology in the IMT-2000 systems allow higher rate services under the IS-95 CDMA infrastructure. The Multi-Code CDMA systems convert the higher rate data into the lower rate by serial to parallel operation and spread the converted data streams by the multiple walsh codes, and its mobile receiver needs multiple walsh generators and data correlators to demodulate simultaneously multiple walsh code channels. Therefore, the number of data correlators is increased as the number of traffic channels increases. In this paper, we proposed the new structure of the data correlators using walsh overlay coding, the shared accumulator, and FWHT(Fast Walsh Hadamard Transform) algorithm for reducing the bottle-neck effect resulting the increase of the number of data correlators.
PDF

Efficient Implementation of CG and CR Methods for Linear Systems on a Single Processing Node of the HITACHI SR8000

Nishimura, S.;Takahashi, D.;Shigehara, T.;Mizoguchi, H.;Mishima, T.
- 대한전자공학회:학술대회논문집
- /
- 대한전자공학회 2000년도 ITC-CSCC -1
- /
- pp.298-301
- /
- 2000
We discuss the iterative methods for linear systems on a single processing node of the HITACHI SR8000. Each processing node of the SR8000 is a shared memory parallel computer which is composed of eight RISC processors with a pseudo-vector facility. We implement highly optimized codes for basic linear operations including a matrix-vector product and apply them to the conjugate gradient (CG) and the conjugate residual (CR) methods for linear systems. Our tuned codes for both method score nearly 50% of the theoretical peak performance, which is the best in the sense that it corresponds to an asymptotic performance of the inner product.
PDF

병렬 시스템에서의 최적 중복부품수와 최적 부하수준 (Optimal Redundant Units and Load in Parallel Systems)

윤원영;김귀래
- 한국경영과학회지
- /
- 제23권1호
- /
- pp.97-107
- /
- 1998
This paper is concerned with a parallel system that sustains a time-independent load and consists of n components with exponential lifetimes. It is assumed that the total load is shared by the working components and the failures of components increase higher failure rates in the surviving components according to the relationship between the load and the fialure rates. The power rule model among several load-failure rate relationships is considered. We consider the system efficiency meausre as the expected profit earned by the system per unit time. The high load causes high gain but it also occurs frequent system failures. The expected profit per unit time is used as criterion to evaluate the system efficiency. The goal of system engineer is to determine the optimal load and redundant units maximizing the expected profit per unit time. First, the system reliability function is obtained and the optimization problem of the load-sharing parallel system is considered. Given the redundant units, the existence of the optimal load can be proved analytically and given the load, the optimal redundant units can be solved also analytically. The optimal load and redundant units are obtained simultaneously by numerical computation. Some numerical examples are studied.
PDF

요약보고 방법에 의해 병목현상을 개선한 최초경합의 수행중 탐지기법 (On-the-fly Detection of the First Races for Reducing Bottlenecks by Summary Report Method)

김정시;전용기
- 한국정보과학회논문지:시스템및이론
- /
- 제26권9호
- /
- pp.1042-1054
- /
- 1999
공유메모리 병렬프로그램의 오류수정에서 경합의 탐지는 중요하다. 왜냐하면 경합은 잘못된 수행 결과를 초래할 뿐만 아니라, 의도하지 않은 프로그램의 비결정적인 수행을 유발하여 오류수정을 어렵게 하기 때문이다. 특히 최초경합의 탐지는 더욱 중요하다. 그 이유는 최초경합을 제거함으로써 나머지 경합들을 방지할 수도 있기 때문이다. 기존의 수행중 경합 탐지기법들은 접근별 보고방식을 기반으로 하는데, 이 기법들은 임의 공유변수에 대한 병행 쓰레드들의 모든 접근사건들을 검사하기 위해서 접근역사라는 유일한 공유정보를 이용하므로 탐지과정에 심각한 병목현상을 유발시킨다. 그러나, 최초경합 탐지를 위한 경우 이러한 병목현상은 크게 개선될 수 있다. 본 논문에서는, 각 접근사건 검사를 위해 각 쓰레드에 공유되지 않는 독립적인 접근역사를 별개로 두고, 경합을 보고하는 시점인 쓰레드 합류시점에서만 공유되는 접근역사를 이용하도록 함으로써 병목현상을 개선하여 최초경합을 탐지할 수 있는 새로운 수행중 탐지기법을 제안한다. 그러므로 본 기법은 최초경합을 보다 효율적으로 탐지할 수 있기 때문에 수행중 경합 탐지를 더욱 효율적이고 실용적으로 할 수 있다. Abstract Detecting races is important for debugging shared-memory parallel programs, because the races lead to unintended nondeterministic executions of the programs as well as erroneous result and then make debugging programs difficult. Especially, detecting the first races is more important. The reason is that the removal of the first races can make other races disappear. Most existing on-the-fly techniques to detect the races are based on per- access reporting method incurring the serious central bottleneck, because the techniques use unique shared information called access history for checking all accesses of concurrent threads to a shared variable. Such bottleneck, however, can be improved considerably in case of detecting first races. This paper presents a new on-the-fly technique which detects the first races with reduced bottleneck through checking each accesses with private access histories and finally reporting races with shared access histories. Therefore, this technique makes on-the-fly race detection more efficient and practical.

멀티코어 시스템에서 쓰레드 수에 따른 CFD 코드의 OpenMP 병렬 성능 (OPENMP PARALLEL PERFORMANCE OF A CFD CODE ON MULTI-CORE SYSTEMS)

김종관;장근진;김태영;조덕래;김성돈;최정열
- 한국전산유체공학회지
- /
- 제18권1호
- /
- pp.83-90
- /
- 2013
OpenMP is becoming more and more useful as a simple parallel processing paradigm on SMP (Shared Memory Multi-Processors) computing environment with the development of multi-core processors. However, very few data is available publically regarding the OpenMP performance in CFD (Computational Fluid Dynamics). In the present study a CFD test suite is prepared for the performance evaluation of OpenMP on various multi-core systems. The test suite is composed of two-dimensional numerical simulations for inviscid/viscous and reacting/non-reacting flows using three different levels of grid systems. One to five test runs were carried out on various systems from dual-core dual threads to 16-core 32-threads systems by changing the number of threads engaged for each test up to 80. The results exhibit some interesting results and the lessons learned from the tests would be quite helpful for the further use of OpenMP for CFD studies using multi-core processor systems.
https://doi.org/10.6112/kscfe.2013.18.1.083 인용 PDF KSCI

병렬 프로그램의 적응형 실행 기법 (Adaptive Execution Techniques for Parallel Programs)

이재진
- 한국정보과학회논문지:시스템및이론
- /
- 제31권8호
- /
- pp.421-431
- /
- 2004
본 논문은 병렬 프로그램을 실행할 때 계산량이 작은 병렬 루프를 병렬로 실행하는 경우에 생기는 프로그램의 성능 저하를 피하기 위하여, 컴파일 시나 실행 시에 성능 예측 모델을 이용하여 병렬 루프의 성능을 예측한 다음 적응형 실행 기법을 이용하여 병렬 프로그램을 실행하는 방법을 소개한다. 성능예측 알고리즘과 적응형 실행 알고리즘은 컴파일러 전처리기에 구현이 되었으며, 이 전처리기는 병렬 루프가 실행되는 방식을 컴파일 시나 실행 시에 결정하는 코드를 원래의 병렬 프로그램에 삽입한다. Fortran77로 씌어진 다섯 개의 대표적인 과학 수치계산 병렬 벤치마크 프로그램을 32개의 프로세서로 구성된 분산 공유 메모리 병렬 컴퓨터(SGI Origin2000)에 실행하여 본 논문에서 제안한 방법의 성능 평가를 하였을 때, 제안한 기법을 적응한 경우가 32, 16, 8, 및 4개의 프로세서에서 원래의 병렬 프로그램 보다 각각 26%, 20%, 16%, 및 10% 빨리 실행되었다. 이중 한 프로그램은 원래 병렬 프로그램 보다 32개 프로세서에서 두 배 이상 빠르게 실행되었다.
PDF KSCI

검색결과 68건 처리시간 0.02초

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)