• Title/Summary/Keyword: Multicores.

Search Result 8, Processing Time 0.026 seconds

Tile Partitioning-based HEVC Parallel Decoding Optimization for Asymmetric Multicore Processor (비대칭 멀티코어 시스템 상의 HEVC 병렬 디코딩 최적화를 위한 타일 분할 기법)

  • Ryu, Yeongil;Roh, Hyun-Joon;Ryu, Eun-Seok
    • Journal of KIISE
    • /
    • v.43 no.9
    • /
    • pp.1060-1065
    • /
    • 2016
  • Recently, there is an emerging need for parallel UHD video processing, and the usage of computing systems that have an asymmetric processor such as ARM big.LITTLE is actively increasing. Thus, a new parallel UHD video processing method that is optimized for the asymmetric multicore systems is needed. This paper proposes a novel HEVC tile partitioning method for parallel processing by analyzing the computational power of asymmetric multicores. The proposed method analyzes (1) the computing power of asymmetric multicores and (2) the regression model of computational complexity per video resolution. Finally, the model (3) determines the optimal HEVC tile resolution for each core and partitions/allocates the tiles to suitable cores. The proposed method minimizes the gap in the decoding time between the fastest CPU core and the slowest CPU core. Experimental results with the 4K UHD official test sequences show average 20% improvement in the decoding speedup on the ARM asymmetric multicore system.

Parallel Processing of k-Means Clustering Algorithm for Unsupervised Classification of Large Satellite Images: A Hybrid Method Using Multicores and a PC-Cluster (대용량 위성영상의 무감독 분류를 위한 k-Means Clustering 알고리즘의 병렬처리: 다중코어와 PC-Cluster를 이용한 Hybrid 방식)

  • Han, Soohee;Song, Jeong Heon
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.37 no.6
    • /
    • pp.445-452
    • /
    • 2019
  • In this study, parallel processing codes of k-means clustering algorithm were developed and implemented in a PC-cluster for unsupervised classification of large satellite images. We implemented intra-node code using multicores of CPU (Central Processing Unit) based on OpenMP (Open Multi-Processing), inter-nodes code using a PC-cluster based on message passing interface, and hybrid code using both. The PC-cluster consists of one master node and eight slave nodes, and each node is equipped with eight multicores. Two operating systems, Microsoft Windows and Canonical Ubuntu, were installed in the PC-cluster in turn and tested to compare parallel processing performance. Two multispectral satellite images were tested, which are a medium-capacity LANDSAT 8 OLI (Operational Land Imager) image and a high-capacity Sentinel 2A image. To evaluate the performance of parallel processing, speedup and efficiency were measured. Overall, the speedup was over N / 2 and the efficiency was over 0.5. From the comparison of the two operating systems, the Ubuntu system showed two to three times faster performance. To confirm that the results of the sequential and parallel processing coincide with the other, the center value of each band and the number of classified pixels were compared, and result images were examined by pixel to pixel comparison. It was found that care should be taken to avoid false sharing of OpenMP in intra-node implementation. To process large satellite images in a PC-cluster, code and hardware should be designed to reduce performance degradation caused by file I / O. Also, it was found that performance can differ depending on the operating system installed in a PC-cluster.

A PARALLEL IMPLEMENTATION OF A RELAXED HSS PRECONDITIONER FOR SADDLE POINT PROBLEMS FROM THE NAVIER-STOKES EQUATIONS

  • JANG, HO-JONG;YOUN, KIHANG
    • Journal of the Korean Society for Industrial and Applied Mathematics
    • /
    • v.22 no.3
    • /
    • pp.155-162
    • /
    • 2018
  • We describe a parallel implementation of a relaxed Hermitian and skew-Hermitian splitting preconditioner for the numerical solution of saddle point problems arising from the steady incompressible Navier-Stokes equations. The equations are linearized by the Picard iteration and discretized with the finite element and finite difference schemes on two-dimensional and three-dimensional domains. We report strong scalability results for up to 32 cores.

Static Timing Analysis of Shared Caches for Multicore Processors

  • Zhang, Wei;Yan, Jun
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.4
    • /
    • pp.267-278
    • /
    • 2012
  • The state-of-the-art techniques in multicore timing analysis are limited to analyze multicores with shared instruction caches only. This paper proposes a uniform framework to analyze the worst-case performance for both shared instruction caches and data caches in a multicore platform. Our approach is based on a new concept called address flow graph, which can be used to model both instruction and data accesses for timing analysis. Our experiments, as a proof-of-concept study, indicate that the proposed approach can accurately compute the worst-case performance for real-time threads running on a dual-core processor with a shared L2 cache (either to store instructions or data).

Virtual Machine Scheduling for Multicores Considering Effects of Shared On-chip Last Level Cache Interference (공유 말단 캐시에서의 간섭의 영향을 고려한 멀티코어 프로세서를 위한 가상 머신 스케줄링)

  • Kim, Shin-gyu;Choi, Chanho;Eom, Hyeonsang;Yeom, Heon Y.
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.04a
    • /
    • pp.134-136
    • /
    • 2012
  • 클라우드 컴퓨팅 서비스 시장이 성장하면서, 서비스 제공자들은 전력 사용량 감소와 서비스 수준을 보장하는 등의 여러 가지 문제와 맞딱드리게 되었다. 이런 문제에 대한 원인 중 하나는 자원 효율성을 높이기 위해 도입한 가상머신 기반의 서버 통합 정책이다. 현재의 가상머신 기술들은 아직까지 완벽한 격리수준을 제공하지 못하기 때문에, 같은 노드에 배치된 가상머신들은 자원을 공유하면서 서로 간에 간섭을 일으키게 된다. 본 연구에서는 가상머신끼리 공유하는 자원 중 프로세서의 말단 캐시(Last-level Cache, LLC)에서의 간섭을 최대한 줄여서 성능을 극대화하기 위한 방법을 제안한다.

An Efficient Load Balancing Technique in a Multicore Mobile System (멀티코어 모바일 시스템에서 효과적인 부하 균등화 기법)

  • Cho, Jungseok;Cho, Doosan
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.4 no.5
    • /
    • pp.153-160
    • /
    • 2015
  • The effectiveness of multicores depends on how well a scheduler can assign tasks onto the cores efficiently. In a heterogeneous multicore platform, the execution time of an application depends on which core it executes on. That is to say, the effectiveness of task assignment is one of the important components for a multicore systems' performance. This work proposes a load scheduling technique that analyzes execution time of each task by profiling. The profiling result provides a basic information to predict which task-to-core mapping is likely to provide the best performance. By using such information, the proposed technique is about 26% performance gain.

Performance Evaluation and Optimization of Journaling File Systems with Multicores and High-Performance Flash SSDs (멀티코어 및 고성능 플래시 SSD 환경에서 저널링 파일 시스템의 성능 평가 및 최적화)

  • Han, Hyuck
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.4
    • /
    • pp.178-185
    • /
    • 2018
  • Recently, demands for computer systems with multicore CPUs and high-performance flash-based storage devices (i.e., flash SSD) have rapidly grown in cloud computing, surer-computing, and enterprise storage/database systems. Journaling file systems running on high-performance systems do not exploit the full I/O bandwidth of high-performance SSDs. In this article, we evaluate and analyze the performance of the Linux EXT4 file system with high-performance SSDs and multicore CPUs. The system used in this study has 72 cores and Intel NVMe SSD, and the flash SSD has performance up to 2800/1900 MB/s for sequential read/write operations. Our experimental results show that checkpointing in the EXT4 file system is a major overhead. Furthermore, we optimize the checkpointing procedure and our optimized EXT4 file system shows up to 92% better performance than the original EXT4 file system.

Improving Multi-DNN Computational Performance of Embedded Multicore Processors through a Global Queue (글로벌 큐를 통한 임베디드 멀티코어 프로세서의 멀티 DNN 연산 성능 향상)

  • Cho, Ho-jin;Kim, Myung-sun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.6
    • /
    • pp.714-721
    • /
    • 2020
  • DNN is expanding its use in embedded systems such as robots and autonomous vehicles. For high recognition accuracy, computational complexity is greatly increased, and multiple DNNs are running aperiodically. Therefore, the ability processing multiple DNNs in embedded environments is a crucial issue. Accordingly, multicore based platforms are being released. However, most DNN models are operated in a batch process, and when multiple DNNs are operated in multicore together, the execution time deviation between each DNN may be large and the end-to-end execution time of the whole DNNs could be long depending on how they are allocated to the cores. In this paper, we solve these problems by providing a framework that decompose each DNN into individual layers and then distribute to multicores through a global queue. As a result of the experiment, the total DNN execution time was reduced by 31%, and when operating multiple identical DNNs, the deviation in execution time was reduced by up to 95.1%.