• Title/Summary/Keyword: Computation-Communication Overlapping

Search Result 12, Processing Time 0.022 seconds

Overlapping Effects of Circular Shift Communication and Computation (원형 쉬프트 통신의 중첩 효과 분석)

  • Kim, Jung-Hwan;Rho, Jung-Kyu;Song, Ha-Yoon
    • The KIPS Transactions:PartA
    • /
    • v.9A no.2
    • /
    • pp.197-206
    • /
    • 2002
  • Many researchers have been interested in the optimization of parallel programs through the latency hiding by overlapping the communication with the computation. We ana1yzed overlapping effects in the circular shift communication which is one of the collective communications being frequently used In many data parallel programs. We measured the time which can be possibly overlapped and the time which cannot be overlapped in over all circular shift communication period on an Ethernet switch-based clustered system. The result from each platform nay be used for the input of optimizing compilers. The previous performance models usually have two kinds of drawbacks one is only based on point-to-point communication, so it is not appropriate for analyzing the overall effects of collective communications. The other provides the performance of collective communication, but no overlapping effect. In this paper we extended the previous models and analyzed the experimental results of the extended model.

Computation-Communication Overlapping in AES-CCM Using Thread-Level Parallelism on a Multi-Core Processor (멀티코어 프로세서의 쓰레드-수준 병렬성을 활용한 AES-CCM 계산-통신 중첩화)

  • Lee, Eun-Ji;Lee, Sung-Ju;Chung, Yong-Wha;Lee, Myung-Ho;Min, Byoung-Ki
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.8
    • /
    • pp.863-867
    • /
    • 2010
  • Multi-core processors are becoming increasingly popular. As they are widely adopted in embedded systems as well as desktop PC's, many multimedia applications are being parallelized on multi-core platforms. However, it is difficult to parallelize applications with inherent data dependencies such as encryption algorithms for multimedia data. In order to overcome this limit, we propose a technique to overlap computation and communication using an otherwise idle core in this paper. In particular, we interpret the problem of multimedia computation and communication as a pipeline design problem at the application program level, and derive an optimal number of stages in the pipeline.

A Communication and Computation Overlapping Model through Loop Sub-partitioning and Dynamic Scheduling in Data Parallel Programs (데이타 병렬 프로그램에서 루프 세부 분할 및 동적 스케쥴링을 통한 통신과 계산의 중첩 모델)

  • Kim, Jung-Hwan;Han, Sang-Yong;Cho, Seung-Ho;Kim, Heung-Hwan
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.1
    • /
    • pp.23-33
    • /
    • 2000
  • We propose a model which overlaps communication with computation for efficient communication in the data-parallel programming paradigm. The overlapping model divides a given loop partition into several sub-partitions to obtain computation which can be overlapped with communication. A loop partition sometimes refers to other data partitions, but not all iterations in the loop partition require non-local data. So, a loop partition may be divided into a set of loop iterations which require non-local data, and a set of loop iterations which do not. Each loop sub-partition is dynamically scheduled depending on associated message arrival, The experimental results for a few benchmarks in IBM SP2 show enhanced performance in our overlapping model.

  • PDF

Hybrid All-Reduce Strategy with Layer Overlapping for Reducing Communication Overhead in Distributed Deep Learning (분산 딥러닝에서 통신 오버헤드를 줄이기 위해 레이어를 오버래핑하는 하이브리드 올-리듀스 기법)

  • Kim, Daehyun;Yeo, Sangho;Oh, Sangyoon
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.10 no.7
    • /
    • pp.191-198
    • /
    • 2021
  • Since the size of training dataset become large and the model is getting deeper to achieve high accuracy in deep learning, the deep neural network training requires a lot of computation and it takes too much time with a single node. Therefore, distributed deep learning is proposed to reduce the training time by distributing computation across multiple nodes. In this study, we propose hybrid allreduce strategy that considers the characteristics of each layer and communication and computational overlapping technique for synchronization of distributed deep learning. Since the convolution layer has fewer parameters than the fully-connected layer as well as it is located at the upper, only short overlapping time is allowed. Thus, butterfly allreduce is used to synchronize the convolution layer. On the other hand, fully-connecter layer is synchronized using ring all-reduce. The empirical experiment results on PyTorch with our proposed scheme shows that the proposed method reduced the training time by up to 33% compared to the baseline PyTorch.

The Efficient Execution of Functional Language Loops on the Multithreaded Architectures (다중스레드 구조에서 함수 언어 루프의 효과적 실행)

  • Ha, Sang-Ho
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.3
    • /
    • pp.962-970
    • /
    • 2000
  • Multithreading is attractive in that it can tolerate memory latency and synchronization by effectively overlapping communication with computation. While several compiler techniques have been developed to produce multithreaded codes from functional languages programs, there still remains a lot of works to implement loops effectively. Executing lops in a style of multithreading usually causes some overheads, which can reduce severely the effect of multirheading. This paper suggests several methods in terms of architectures or compilers which can optimize loop execution by multithreading. We then simulate and analyze them for the matrix multiplication program.

  • PDF

Virtual Reality Image Shooting for Single Person Broadcasting with Multiple Smartphones

  • Budiman, Sutanto Edward;Lee, Suk-Ho
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.11 no.2
    • /
    • pp.43-49
    • /
    • 2019
  • Nowadays, one-person media broadcasting has become popular, and with the progress of this popularity, multimedia techniques which can support such broadcasting are also becoming more and more advanced. One of the most emerging multimedia technique used in this field is the virtual reality technology which sets the one-person media broadcasting environment as a virtual reality environment. However, as such an environment requires instruments of high cost, it is not easy for normal individuals to constitute such environments. Therefore, in this paper we propose how to construct virtual reality-like panoramas with a multiple of smartphones. For this purpose, we designed a special rig which can hold firmly 8 smartphone cameras which have overlapping view of the environment such that panorama stitching becomes possible. To reduce the computation cost, we precomputed the homography matrices, and used 1-D pointer structures to store the computed coordinate values.

A linear array SliM-II image processor chip (선형 어레이 SliM-II 이미지 프로세서 칩)

  • 장현만;선우명훈
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.2
    • /
    • pp.29-35
    • /
    • 1998
  • This paper describes architectures and design of a SIMD type parallel image processing chip called SliM-II. The chiphas a linear array of 64 processing elements (PEs), operates at 30 MHz in the worst case simulation and gives at least 1.92 GIPS. In contrast to existing array processors, such as IMAP, MGAP-2, VIP, etc., each PE has a multiplier that is quite effective for convolution, template matching, etc. The instruction set can execute an ALU operation, data I/O, and inter-PE communication simulataneously in a single instruction cycle. In addition, during the ALU/multiplier operation, SliM-II provides parallel move between the register file and on-chip memory as in DSP chips, SliM-II can greatly reduce the inter-PE communication overhead, due to the idea a sliding, which is a technique of overlapping inter-PE communication with computation. Moreover, the bandwidth of data I/O and inter-PE communication increases due to bit-parallel data paths. We used the COMPASS$^{TM}$ 3.3 V 0.6.$\mu$m standrd cell library (v8r4.10). The total number of transistors is about 1.5 muillions, the core size is 13.2 * 13.0 mm$^{2}$ and the package type is 208 pin PQ2 (Power Quad 2). The performance evaluation shows that, compared to a existing array processors, a proposed architeture gives a significant improvement for algorithms requiring multiplications.s.

  • PDF

Parallel Distributed Implementation of GHT on MPI-based PC Cluster (MPI 기반 PC 클러스터에서 GHT의 병렬 분산 구현)

  • Kim, Yeong-Soo;Kim, Jeong-Sahm;Choi, Heung-Moon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.44 no.3
    • /
    • pp.81-89
    • /
    • 2007
  • This paper presents a parallel distributed implementation of the GHT (generalized Hough transform) for the fast processing on the MPI-based PC cluster. We tried to achieve the higher speedup mainly by alleviating the communication overhead through the pipelined broadcast and accumulator array partition strategy and by time overlapping of the communication and the computation over entire process. Experimental results show that nearly linear speedup is reachable by the proposed method on the MPI-based PC clusters connected through 100Mbps Ethernet switch.

Straight Line Detection Using PCA and Hough Transform (주성분 분석과 허프 변환을 이용한 직선 검출)

  • Oh, Jeong-su
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.2
    • /
    • pp.227-232
    • /
    • 2018
  • In a Hough transform that is a representative algorithm for the straight line detection, a great number of edge pixels generated from noisy or complex images cause enormous amount of computation and pseudo straight lines. This paper proposes a two step straight line detection algorithm to improve the conventional Hough transform. In the first step, the proposed algorithm divides an image into non-overlapping blocks and detects the information related to the straight line of the edge pixels in the block using a principal component analysis (PCA). In the second step, it detects the straight lines by performing the Hough transform limited slope area to the pixels associated with the straight line. Simulation results show that the proposed algorithm reduces average of ${\rho}$ computation by 94.6% and prevents the pseudo straight lines although some additional computation is needed.

Parallel Distributed Implementation of GHT on Ethernet Multicluster (이더넷 다중 클러스터에서 GHT의 병렬 분산 구현)

  • Kim, Yeong-Soo;Kim, Myung-Ho;Choi, Heung-Moon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.46 no.3
    • /
    • pp.96-106
    • /
    • 2009
  • Extending the scale of the distributed processing in a single Ethernet cluster is physically restricted by maximum ports per switch. This paper presents an implementation of MPI-based multicluster consisting of multiple Ethernet switches for extending the scale of distributed processing, and a asymptotical analysis for communication overhead through execution-time analysis model. To determine an optimum task partitioning, we analyzed the processing time for various partitioning schemes, and AAP(accumulator array partitioning) scheme was finally chosen to minimize the overall communication overhead. The scope of data partitioned in AAP was modified to fit for incremented nodes, and suitable load balancing algorithm was implemented. We tried to alleviate the communication overhead through exploiting the pipelined broadcast and flat-tree based result gathering, and overlapping of the communication and the computation time. We used the linear pipeline broadcast to reduce the communication overhead in intercluster which is interconnected by a single link. Experimental results shows nearly linear speedup by the proposed parallel distributed GHT implemented on MPI-based Ethernet multicluster with four 100Mbps Ethernet switches and up to 128 nodes of Pentium PC.