• Title/Summary/Keyword: pipelined broadcast

Search Result 8, Processing Time 0.023 seconds

Low Cost Hardware Engine of Atomic Pipeline Broadcast Based on Processing Node Status (프로세서 노드 상황을 고려하는 저비용 파이프라인 브로드캐스트 하드웨어 엔진)

  • Park, Jongsu
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.8
    • /
    • pp.1109-1112
    • /
    • 2020
  • This paper presents a low cost hardware message passing engine of enhanced atomic pipelined broadcast based on processing node status. In this algorithm, the previous atomic pipelined broadcast algorithm is modified to reduce the waiting time until next broadcast communication. For this, the processor change the transmission order of processing nodes based on the nodes' communication channel. Also, the hardware message passing engine architecture of the proposed algorithm is modified to be adopted to multi-core processor. The synthesized logic area of the proposed hardware message passing engine was reduced by about 16%, compared by the pre-existing hardware message passing engine.

Pipelined Broadcast with Enhanced Wormhole Routers (개선된 윔홀 라우터를 이용한 파이프라인 브로드캐스트)

  • Jeon, Min-Soo;Kim, Dong-Seung
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.1
    • /
    • pp.10-15
    • /
    • 2002
  • This paper proposes the Pipelined Broadcast that broadcasts a message of size m in O(m+n-1) time in an n-dimensional hypercube. It is based on the replication tree, which is derived from the reachable sets. It greatly improves the performance compared to Ho-Kao s algorithm with the time of O(m[n/log(n+1)]). The communication in the broadcast uses all-port wormhole router with message replication capability. This paper includes the algorithm together with performance comparisons to previous schemes in practical implementation.

A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System (MPI 브로드캐스트 통신을 위한 서킷 스위칭 기반의 파이프라인 체인 알고리즘 설계)

  • Yun, Heejun;Chung, Wonyoung;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37B no.9
    • /
    • pp.795-805
    • /
    • 2012
  • This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

Parallel Distributed Implementation of GHT on MPI-based PC Cluster (MPI 기반 PC 클러스터에서 GHT의 병렬 분산 구현)

  • Kim, Yeong-Soo;Kim, Jeong-Sahm;Choi, Heung-Moon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.44 no.3
    • /
    • pp.81-89
    • /
    • 2007
  • This paper presents a parallel distributed implementation of the GHT (generalized Hough transform) for the fast processing on the MPI-based PC cluster. We tried to achieve the higher speedup mainly by alleviating the communication overhead through the pipelined broadcast and accumulator array partition strategy and by time overlapping of the communication and the computation over entire process. Experimental results show that nearly linear speedup is reachable by the proposed method on the MPI-based PC clusters connected through 100Mbps Ethernet switch.

Scalable Broadcast Switch Architecture (가변형 방송 스위치 구조)

  • 정갑중;이범철
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2004.05b
    • /
    • pp.291-294
    • /
    • 2004
  • In this paper, we consider the broadcast switch architecture for hish performance multicast packet switching. In input and output buffered switch, we propose a new switch architecture which supports high throughput in broadcast packet switching with switch planes of single input and multiple output crossbars. The proposed switch architecture has a central arbiter that arbitrates requests from plural input ports and generates multiple grant signals to multiple output ports in a packet transmission slot. It provides high speed pipelined arbitration and large scale switching capacity.

  • PDF

Performance Optimization of Parallel Algorithms

  • Hudik, Martin;Hodon, Michal
    • Journal of Communications and Networks
    • /
    • v.16 no.4
    • /
    • pp.436-446
    • /
    • 2014
  • The high intensity of research and modeling in fields of mathematics, physics, biology and chemistry requires new computing resources. For the big computational complexity of such tasks computing time is large and costly. The most efficient way to increase efficiency is to adopt parallel principles. Purpose of this paper is to present the issue of parallel computing with emphasis on the analysis of parallel systems, the impact of communication delays on their efficiency and on overall execution time. Paper focuses is on finite algorithms for solving systems of linear equations, namely the matrix manipulation (Gauss elimination method, GEM). Algorithms are designed for architectures with shared memory (open multiprocessing, openMP), distributed-memory (message passing interface, MPI) and for their combination (MPI + openMP). The properties of the algorithms were analytically determined and they were experimentally verified. The conclusions are drawn for theory and practice.

A full-Hardwired Low-Power MPEG4@SP Video Encoder for Mobile Applications (모바일 향 저전력 동영상 압축을 위한 고집적 MPEG4@SP 동영상 압축기)

  • Shin, Sun Young;Park, Hyun Sang
    • Journal of Broadcast Engineering
    • /
    • v.10 no.3
    • /
    • pp.392-400
    • /
    • 2005
  • Highly integrated MPEG-4@SP video compression engine, VideoCore, is proposed for mobile application. The primary components of video compression require the high memory bandwidth since they access the external memory frequently. They include motion estimation, motion compensation, quantization, discrete cosine transform, variable length coding, and so on. The motion estimation processor adopted in VideoCore utilizes the small-size local memories such that the video compression system accesses external memory as less frequently as possible. The entire video compression system is divided into two distinct sub-systems: the integer-unit motion estimation part and the others, and both operate concurrently in a pipelined architecture. Thus the VideoCore enables the real-time high-quality video compression with a relatively low operation frequency.

Parallel Distributed Implementation of GHT on Ethernet Multicluster (이더넷 다중 클러스터에서 GHT의 병렬 분산 구현)

  • Kim, Yeong-Soo;Kim, Myung-Ho;Choi, Heung-Moon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.46 no.3
    • /
    • pp.96-106
    • /
    • 2009
  • Extending the scale of the distributed processing in a single Ethernet cluster is physically restricted by maximum ports per switch. This paper presents an implementation of MPI-based multicluster consisting of multiple Ethernet switches for extending the scale of distributed processing, and a asymptotical analysis for communication overhead through execution-time analysis model. To determine an optimum task partitioning, we analyzed the processing time for various partitioning schemes, and AAP(accumulator array partitioning) scheme was finally chosen to minimize the overall communication overhead. The scope of data partitioned in AAP was modified to fit for incremented nodes, and suitable load balancing algorithm was implemented. We tried to alleviate the communication overhead through exploiting the pipelined broadcast and flat-tree based result gathering, and overlapping of the communication and the computation time. We used the linear pipeline broadcast to reduce the communication overhead in intercluster which is interconnected by a single link. Experimental results shows nearly linear speedup by the proposed parallel distributed GHT implemented on MPI-based Ethernet multicluster with four 100Mbps Ethernet switches and up to 128 nodes of Pentium PC.