• Title/Summary/Keyword: computation-intensive

Search Result 107, Processing Time 0.029 seconds

Distributed In-Memory Caching Method for ML Workload in Kubernetes (쿠버네티스에서 ML 워크로드를 위한 분산 인-메모리 캐싱 방법)

  • Dong-Hyeon Youn;Seokil Song
    • Journal of Platform Technology
    • /
    • v.11 no.4
    • /
    • pp.71-79
    • /
    • 2023
  • In this paper, we analyze the characteristics of machine learning workloads and, based on them, propose a distributed in-memory caching technique to improve the performance of machine learning workloads. The core of machine learning workload is model training, and model training is a computationally intensive task. Performing machine learning workloads in a Kubernetes-based cloud environment in which the computing framework and storage are separated can effectively allocate resources, but delays can occur because IO must be performed through network communication. In this paper, we propose a distributed in-memory caching technique to improve the performance of machine learning workloads performed in such an environment. In particular, we propose a new method of precaching data required for machine learning workloads into the distributed in-memory cache by considering Kubflow pipelines, a Kubernetes-based machine learning pipeline management tool.

  • PDF

On the computation of low-subsonic turbulent pipe flow noise with a hybrid LES/LPCE method

  • Hwang, Seungtae;Moon, Young J.
    • International Journal of Aeronautical and Space Sciences
    • /
    • v.18 no.1
    • /
    • pp.48-55
    • /
    • 2017
  • Aeroacoustic computation of a fully-developed turbulent pipe flow at $Re_{\tau}=175$ and M = 0.1 is conducted by LES/LPCE hybrid method. The generation and propagation of acoustic waves are computed by solving the linearized perturbed compressible equations (LPCE), with acoustic source DP(x,t)/Dt attained by the incompressible large eddy simulation (LES). The computed acoustic power spectral density is closely compared with the wall shear-stress dipole source of a turbulent channel flow at $Re_{\tau}=175$. A constant decaying rate of the acoustic power spectrum, $f^{-8/5}$ is found to be related to the turbulent bursts of the correlated longitudinal structures such as hairpin vortex and their merged structures (or hairpin packets). The power spectra of the streamwise velocity fluctuations across the turbulent boundary layer indicate that the most intensive noise at ${\omega}^+$ < 0.1 is produced in the buffer layer with fluctuations of the longitudinal structures ($k_zR$ < 1.5).

A fast full search algorithm for multiple reference image motion estimation (다중 참조 영상 움직임 추정을 위한 고속 전역탐색법)

  • Kang Hyun-Soo;Park Seong-Mo
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.43 no.1 s.307
    • /
    • pp.1-8
    • /
    • 2006
  • This paper presents a fast full search algorithm for motion estimation applicable to multiple reference images. The proposed method is an extended version of the rate constrained successive elimination algorithm (RSEA) for multiple reference frame applications. We will show that motion estimation for the reference images temporally preceding the first reference image can be less intensive in computation compared with that for the first reference image. for computational reduction, we will drive a new condition to lead the smaller number of candidate blocks for the best matched block. Simulation results explain that our method reduces computation complexity although it has the same quality as RSEA.

A Scalable Heuristic for Pickup-and-Delivery of Splittable Loads and Its Application to Military Cargo-Plane Routing

  • Park, Myoung-Ju;Lee, Moon-Gul
    • Management Science and Financial Engineering
    • /
    • v.18 no.1
    • /
    • pp.27-37
    • /
    • 2012
  • This paper is motivated by a military cargo-plane routing problem which is a pickup-and-delivery problem in which load splits and node revisits are allowed (PDPLS). Although this recent evolution of a VRP-model enhances the efficiency of routing, a solution method is more of a challenge since the node revisits entail closed walks in modeling vehicle routes. For such a case, even a compact IP-formulation is not available and an effective method had been lacking until Nowak et al. (2008b) proposed a heuristic based on a tabu search. Their method provides very reasonable solu-tions as demonstrated by the experiments not only in their paper (Nowak et al., 2008b) but also in ours. However, the computation time seems intensive especially for the class of problems with dynamic transportation requests, including the military cargo-plane routing problem. This paper proposes a more scalable algorithm hybridizing a tabu search for pricing subproblem paused as a single-vehicle routing problem, with a column generation approach based on Dantzig-Wolfe decomposition. As tested on a wide variety of instances, our algorithm produces, in average, a solution of an equiva-lent quality in 10~20% of the computation time of the previous method.

Efficient Resource Allocation Strategies Based on Nash Bargaining Solution with Linearized Constraints (선형 제약 조건화를 통한 내쉬 협상 해법 기반 효율적 자원 할당 방법)

  • Choi, Jisoo;Jung, Seunghyun;Park, Hyunggon
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.65 no.3
    • /
    • pp.463-468
    • /
    • 2016
  • The overall performance of multiuser systems significantly depends on how effectively and fairly manage resources shared by them. The efficient resource management strategies are even more important for multimedia users since multimedia data is delay-sensitive and massive. In this paper, we focus on resource allocation based on a game-theoretic approach, referred to as Nash bargaining solution (NBS), to provide a quality of service (QoS) guarantee for each user. While the NBS has been known as a fair and optimal resource management strategy, it is challenging to find the NBS efficiently due to the computationally-intensive task. In order to reduce the computation requirements for NBS, we propose an approach that requires significantly low complexity even when networks consist of a large number of users and a large amount of resources. The proposed approach linearizes utility functions of each user and formulates the problem of finding NBS as a convex optimization, leading to nearly-optimal solution with significantly reduced computation complexity. Simulation results confirm the effectiveness of the proposed approach.

Performance analysis of local exit for distributed deep neural networks over cloud and edge computing

  • Lee, Changsik;Hong, Seungwoo;Hong, Sungback;Kim, Taeyeon
    • ETRI Journal
    • /
    • v.42 no.5
    • /
    • pp.658-668
    • /
    • 2020
  • In edge computing, most procedures, including data collection, data processing, and service provision, are handled at edge nodes and not in the central cloud. This decreases the processing burden on the central cloud, enabling fast responses to end-device service requests in addition to reducing bandwidth consumption. However, edge nodes have restricted computing, storage, and energy resources to support computation-intensive tasks such as processing deep neural network (DNN) inference. In this study, we analyze the effect of models with single and multiple local exits on DNN inference in an edge-computing environment. Our test results show that a single-exit model performs better with respect to the number of local exited samples, inference accuracy, and inference latency than a multi-exit model at all exit points. These results signify that higher accuracy can be achieved with less computation when a single-exit model is adopted. In edge computing infrastructure, it is therefore more efficient to adopt a DNN model with only one or a few exit points to provide a fast and reliable inference service.

Snapshot-Based Offloading for Web Applications with HTML5 Canvas (HTML5 캔버스를 활용하는 웹 어플리케이션의 스냅샷 기반 연산 오프로딩)

  • Jeong, InChang;Jeong, Hyuk-Jin;Moon, Soo-Mook
    • Journal of KIISE
    • /
    • v.44 no.9
    • /
    • pp.871-877
    • /
    • 2017
  • A vast amount of research has been carried out for executing compute-intensive applications on resource-constrained mobile devices. Computation offloading is a method in which heavy computations are dynamically migrated from a mobile device to a server, exploiting the powerful hardware of the server to perform complex computations. An important issue for offloading is the complexity of reconciling the execution state of applications between the server and the client. To address this issue, snapshot-based offloading has recently been proposed, which utilizes the snapshot of a web app as the portable description of the execution state. However, for web applications using the HTML5 canvas, snapshot-based offloading does not function correctly, because the snapshot cannot capture the state of the canvas. In this paper, we propose a code generation technique to save the canvas state as part of a snapshot, so that the snapshot-based offloading can be applied to web applications using the canvas.

Reconfigurable Architecture Design for H.264 Motion Estimation and 3D Graphics Rendering of Mobile Applications (이동통신 단말기를 위한 재구성 가능한 구조의 H.264 인코더의 움직임 추정기와 3차원 그래픽 렌더링 가속기 설계)

  • Park, Jung-Ae;Yoon, Mi-Sun;Shin, Hyun-Chul
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.1
    • /
    • pp.10-18
    • /
    • 2007
  • Mobile communication devices such as PDAs, cellular phones, etc., need to perform several kinds of computation-intensive functions including H.264 encoding/decoding and 3D graphics processing. In this paper, new reconfigurable architecture is described, which can perform either motion estimation for H.264 or rendering for 3D graphics. The proposed motion estimation techniques use new efficient SAD computation ordering, DAU, and FDVS algorithms. The new approach can reduce the computation by 70% on the average than that of JM 8.2, without affecting the quality. In 3D rendering, midline traversal algorithm is used for parallel processing to increase throughput. Memories are partitioned into 8 blocks so that 2.4Mbits (47%) of memory is shared and selective power shutdown is possible during motion estimation and 3D graphics rendering. Processing elements are also shared to further reduce the chip area by 7%.

An Efficient Algorithm for Improving Calculation Complexity of the MDCT/IMDCT (MDCT/IMDCT의 계산 복잡도를 개선하기 위한 효율적인 알고리즘)

  • 조양기;이원표;김희석
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.40 no.6
    • /
    • pp.106-113
    • /
    • 2003
  • The modified discrete cosine transform (MDCT) and inverse MDCT (IMDCT) are employed in subband/transform coding schemes as the analysis/synthesis filter bank based on time domain aliasing cancellation (TDAC). And the MDCT and IMDCT are the most computational intensive operations in layer III of the MPEG audio coding standard. In this paper, we propose a new efficient algorithm for the MDCT/IMDCT computation in various audio coding systems. It is based on the MDCT/IMDCT computation algorithm using the discrete cosine transforms (DCTs), and It employs two discrete cosine transform of type II (DCT-II) to compute the MDCT/IMDCT In addition, it takes advantage of ability in calculating the MDCT/IMDCT computation, where the length of a data block Is divisible by 4. The Proposed algorithm in this paper requires less calculation complexity than the existing method does. Also, it can be implemented by the parallel structure, therefore its structure is particularly suitable for VLSI realization

Fast Array Architecture with Improved Reconfigurability (향상된 재구성능력을 가진 고속 어레이 구조)

  • Lee Jae-Ic;Kim Jinsang;Cho Won-Kyung;Kim Youngsoo
    • Proceedings of the IEEK Conference
    • /
    • 2004.06b
    • /
    • pp.451-454
    • /
    • 2004
  • The reconfigurable architecture is increasingly important for design of multi-mode communication systems and computation-intensive DSP systems. The proposed coarse-grain architecture is based on a reconfigurable processing element consisting of a MAC unit, a register file, a context data register, and PE interconnect control blocks. The main feature of the Proposed architecture is the loop context which enables faster configuration. Also, we propose another area-efficient reconfigurable architecture with improved reconfigurability. The SystemC modeling results show that the proposed architecture can reduce 9 clock cycles of 2D DCT compared to existing architectures.

  • PDF