• Title/Summary/Keyword: Parallel Overhead

Search Result 157, Processing Time 0.026 seconds

Two-Level Tries: A General Acceleration Structure for Parallel Routing Table Accesses

  • Mingche, Lai;Lei, Gao
    • Journal of Communications and Networks
    • /
    • v.13 no.4
    • /
    • pp.408-417
    • /
    • 2011
  • The stringent performance requirement for the high efficiency of routing protocols on the Internet can be satisfied by exploiting the threaded border gateway protocol (TBGP) on multi-cores, but the state-of-the-art TBGP performance is restricted by a mass of contentions when racing to access the routing table. To this end, the highly-efficient parallel access approach appears to be a promising solution to achieve ultra-high route processing speed. This study proposes a general routing table structure consisting of two-level tries for fast parallel access, and it presents a heuristic-based divide-and-recombine algorithm to solve a mass of contentions, thereby accelerating the parallel route updates of multi-threading and boosting the TBGP performance. As a projected TBGP, this study also modifies the table operations such as insert and lookup, and validates their correctness according to the behaviors of the traditional routing table. Our evaluations on a dual quad-core Xeon server show that the parallel access contentions decrease sharply by 92.5% versus the traditional routing table, and the maximal update time of a thread is reduced by 56.8 % on average with little overhead. The convergence time of update messages are improved by 49.7%.

Multi-core-based Parallel Query of 3D Point Cloud Indexed in Octree (옥트리로 색인한 3차원 포인트 클라우드의 다중코어 기반 병렬 탐색)

  • Han, Soohee
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.31 no.4
    • /
    • pp.301-310
    • /
    • 2013
  • The aim of the present study is to enhance query speed of large 3D point cloud indexed in octree by parallel query using multi-cores. Especially, it is focused on developing methods of accessing multiple leaf nodes in octree concurrently to query points residing within a radius from a given coordinates. To the end, two parallel query methods are suggested using different strategies to distribute query overheads to each core: one using automatic division of 'for routines' in codes controlled by OpenMP and the other considering spatial division. Approximately 18 million 3D points gathered by a terrestrial laser scanner are indexed in octree and tested in a system with a 8-core CPU to evaluate the performances of a non-parallel and the two parallel methods. In results, the performances of the two parallel methods exceeded non-parallel one by several times and the two parallel rivals showed competing aspects confronting various query radii. Parallel query is expected to be accelerated by anticipated improvements of distribution strategies of query overhead to each core.

An Efficient Distributed Shared Memory System for Parallel GIS (병렬 GIS를 위한 효율적인 분산공유메모리 시스템)

  • Jeong, Sang-Hwa;Ryu, Gwang-Yeol;Go, Yun-Yeong;Gwak, Min-Seok
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.5 no.6
    • /
    • pp.700-707
    • /
    • 1999
  • 본 논문에서는 GIS 관련 연산을 실시간에 효율적으로 처리하기 위한 분산공유메모리 기반 병렬처리 시스템을 제안한다. 본 논문의 분산공유메모리 시스템은 메시지전달 방식의 분산메모리 MIMD 컴퓨터 상에 소프트웨어 기반 분산공유메모리 모듈을 탑재함으로써 구현되었다. 또한 GIS 연산의 기본이 되는 공간 객체를 공유의 기본 단위로 설정하고, GIS 데이타의 특성을 반영하여 읽기전용 공유데이타 타입을 추가하였으며, 네트워크 오버헤드를 줄이기 위하여 복수의 객체를 한번에 읽어오는 bulk access가 가능하도록 하였다. 본 시스템에서는 GIS 데이타의 효율적인 분배를 위하여 부하균등화 기법으로 guided self scheduling을 사용하였다. 실험결과 본 시스템은 네트워크 캐쉬의 효율적인 활용을 통하여 소프트웨어 기반 분산메모리 시스템의 오버헤드에도 불구하고 MPI 기반 메시지전달 방식에 비하여 향상된 성능을 얻을 수 있었다.Abstract In this paper, we propose a distributed shared memory(DSM) based parallel processing system to process GIS related computations efficiently in real time. The system is based on a software DSM module implemented on top of a distributed MIMD computer. In the DSM system, spatial object, which is a fundamental structure to represent GIS data, is used as a basic unit for sharing, and a read-only shared data type is added to reflect the characteristics of GIS data. In addition, a bulk access to multiple shared data is made possible to reduce the network overhead. A guided self scheduling method is devised for efficient load balancing in distributing GIS data to parallel processors. The experimental results show that the DSM system performs better than an MPI based message-passing system through the efficient utilization of network cache in spite of the system's software overhead.

Development of Real time Air Quality Prediction System

  • Oh, Jai-Ho;Kim, Tae-Kook;Park, Hung-Mok;Kim, Young-Tae
    • Proceedings of the Korean Environmental Sciences Society Conference
    • /
    • 2003.11a
    • /
    • pp.73-78
    • /
    • 2003
  • In this research, we implement Realtime Air Diffusion Prediction System which is a parallel Fortran model running on distributed-memory parallel computers. The system is designed for air diffusion simulations with four-dimensional data assimilation. For regional air quality forecasting a series of dynamic downscaling technique is adopted using the NCAR/Penn. State MM5 model which is an atmospheric model. The realtime initial data have been provided daily from the KMA (Korean Meteorological Administration) global spectral model output. It takes huge resources of computation to get 24 hour air quality forecast with this four step dynamic downscaling (27km, 9km, 3km, and lkm). Parallel implementation of the realtime system is imperative to achieve increased throughput since the realtime system have to be performed which correct timing behavior and the sequential code requires a large amount of CPU time for typical simulations. The parallel system uses MPI (Message Passing Interface), a standard library to support high-level routines for message passing. We validate the parallel model by comparing it with the sequential model. For realtime running, we implement a cluster computer which is a distributed-memory parallel computer that links high-performance PCs with high-speed interconnection networks. We use 32 2-CPU nodes and a Myrinet network for the cluster. Since cluster computers more cost effective than conventional distributed parallel computers, we can build a dedicated realtime computer. The system also includes web based Gill (Graphic User Interface) for convenient system management and performance monitoring so that end-users can restart the system easily when the system faults. Performance of the parallel model is analyzed by comparing its execution time with the sequential model, and by calculating communication overhead and load imbalance, which are common problems in parallel processing. Performance analysis is carried out on our cluster which has 32 2-CPU nodes.

  • PDF

Design and Implementation of Real-Time Parallel Engine for Discrete Event Wargame Simulation (이산사건 워게임 시뮬레이션을 위한 실시간 병렬 엔진의 설계 및 구현)

  • Kim, Jin-Soo;Kim, Dae-Seog;Kim, Jung-Guk;Ryu, Keun-Ho
    • The KIPS Transactions:PartA
    • /
    • v.10A no.2
    • /
    • pp.111-122
    • /
    • 2003
  • Military wargame simulation models must support the HLA in order to facilitate interoperability with other simulations, and using parallel simulation engines offer efficiency in reducing system overhead generated by propelling interoperability. However, legacy military simulation model engines process events using sequential event-driven method. This is due to problems generated by parallel processing such as synchronous reference to global data domains. Additionally. using legacy simulation platforms result in insufficient utilization of multiple CPUs even if a multiple CPU system is under use. Therefore, in this paper, we propose conversing the simulation engine to an object model-based parallel simulation engine to ensure military wargame model's improved system processing capability, synchronous reference to global data domains, external simulation time processing, and the sequence of parallel-processed events during a crash recovery. The converted parallel simulation engine is designed and implemented to enable parallel execution on a multiple CPU system (SMP).

A New Synchronization Scheme for Parallel Processing of Loop with Constant and Variable Dependence Distance (불변 및 가변 종속거리를 갖는 루프의 병렬처리를 위한 새로운 동기화 기법)

  • 이광형;황종선;박두순
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.32B no.5
    • /
    • pp.693-701
    • /
    • 1995
  • In most application programs, loops usually comprise most of the computation in a program and are the most important source of parallelism. When loops are executed on multiprocessors, the cross iteration data dependences need to be enforced by synchronization between processors. Existing synchronization schemes have been studied mainly on the loop with constant dependence distance. When these schemes are applied to the loop with variable dependence distance, there exists lots of overhead by the use of unnecessary synchronization variables and execution of unuseful synchronization instructions. Even though there exist various variable synchronization schemes, they have a lot of run-time overhead to compute synchronization information. In this paper, we present a new synchronization scheme, Synch-Free/Synch-Hold for managing synchronization efficiently on the loop with constant and variable dependence distance.

  • PDF

Loop Splitting for On-the-fly Race Detection of Sharded-memory Parallel Programs (공유 메모리 병렬 프로그램의 수행중 오류 탐지를 위한 루프 분리)

  • Song, Tae-Seob
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.16 no.3
    • /
    • pp.391-398
    • /
    • 2012
  • Detecting races is important for debugging shared-memory parallel programs, because the races result in unintended non-deterministic executions of the programs. Previous on-the-fly techniques to detect races in parallel programs with general inter-thread coordination show serious space overhead which depends on the maximum parallelism of the program. Therefore, this paper presents a loop splitting technique for on-the-fly race detection of parallel programs which is more efficient in space complexity than previous techniques. This loop splitting technique is the debugged program which preserves semantics of the original program. Monitering loop splitting program in on-the-fly can detect first races.

Transform Nested Loops into MultiThread in Java Programming Language for Parallel Processing (자바 프로그래밍에서 병렬처리를 위한 중첩 루프 구조의 다중스레드 변환)

  • Hwang, Deuk-Young;Choi, Young-Keun
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.8
    • /
    • pp.1997-2012
    • /
    • 1998
  • It is necessary to find out the parallelism in tlle sequential Java program to execute it on the parallel machine. The loop is a fundamental source to exploit parallelism as it process a large portion of total execution time in sequential Java program on the parallel machine. However, a complete parallel execution can hardly be achieved due to data dependence. This paper proposes the method of exploiting the implicit parallelism by structuring a dependence graph through the analysis of data dependence in the existing Java programming language having a nested loop structure. The parallel code generation method through the restructuring compiler and also the translation method of Java source program into multithread statement. which is supported by the Java programming language itself, are proposed here. The perforance evaluatlun of the program translaed into the thread statement is conducted using the trip cunt of loop and the trip Count of luop and the thread count as parameters The resttucturing compiler provides efficient way of exploiting parallelism by reducing manual overhead conveliing sequential Java program into parallel code. The execution time for the Java program as a result can be reduced un the parallel machine.

  • PDF

A Hardware Barrier Synchronization using Multi -drop Scheme in Parallel Computer Systems (병렬 컴퓨터 시스템에서의 Multi-drop 방식을 사용한 하드웨어 장벽 동기화)

  • Lee, June-Bum;Kim, Sung-Chun
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.5
    • /
    • pp.485-495
    • /
    • 2000
  • The parallel computer system that uses parallel program on the application such as a large scale business or complex operation is required. One of crucial operation of parallel computer system is synchronization. A representative method of synchronization is barrier synchronization. A barrier forces all process to wait until all the process reach the barrier and then releases all of the processes. There are software schemes, hardware scheme, or combinations of these mechanism to achieve barrier synchronization which tends to use hardware scheme. Besides, barrier synchronization lets parallel computer system fast because it has fewer start-up overhead. In this paper, we propose a new switch module that can implement fast and fault-tolerant barrier synchronization in hardware scheme. A proposed barrier synchronization is operated not in full-switch-driven method but in processor-driven method. An effective barrier synchronization is executed with inexpensive hardware supports. Therefore, a new proposed hardware barrier synchronization is designed that it is operated in arbitrary network topology. In this paper, we only show comparison of barrier synchronization on Multistage Interconnection Network. This research results in 24.6-24.8% reduced average delay. Through this result, we can expect lower average delay in irregular network.

  • PDF

Parallel Distributed Implementation of GHT on Ethernet Multicluster (이더넷 다중 클러스터에서 GHT의 병렬 분산 구현)

  • Kim, Yeong-Soo;Kim, Myung-Ho;Choi, Heung-Moon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.46 no.3
    • /
    • pp.96-106
    • /
    • 2009
  • Extending the scale of the distributed processing in a single Ethernet cluster is physically restricted by maximum ports per switch. This paper presents an implementation of MPI-based multicluster consisting of multiple Ethernet switches for extending the scale of distributed processing, and a asymptotical analysis for communication overhead through execution-time analysis model. To determine an optimum task partitioning, we analyzed the processing time for various partitioning schemes, and AAP(accumulator array partitioning) scheme was finally chosen to minimize the overall communication overhead. The scope of data partitioned in AAP was modified to fit for incremented nodes, and suitable load balancing algorithm was implemented. We tried to alleviate the communication overhead through exploiting the pipelined broadcast and flat-tree based result gathering, and overlapping of the communication and the computation time. We used the linear pipeline broadcast to reduce the communication overhead in intercluster which is interconnected by a single link. Experimental results shows nearly linear speedup by the proposed parallel distributed GHT implemented on MPI-based Ethernet multicluster with four 100Mbps Ethernet switches and up to 128 nodes of Pentium PC.