• Title/Summary/Keyword: Thread-Level Parallelism

Search Result 24, Processing Time 0.027 seconds

Thread-Level Parallelism using Java Thread and Network Resources (자바 스레드와 네트워크 자원을 이용한 병렬처리)

  • Kim, Tae-Yong
    • Journal of Advanced Navigation Technology
    • /
    • v.14 no.6
    • /
    • pp.984-989
    • /
    • 2010
  • In this paper, parallel programming technique by using Java Thread is introduced so as to develop parallel design tool to analyze the small micro flow sensor. To estimate computing time for Thread-level parallelism, the performances of two experimental models for potential problem subject to Thermal transfer equation are examined. As a result, if the number of network PC is increase, computing time for parallelism on network environment is enhanced to be almost n times. The micro sensor design tool based on distributed computing can be utilized to analyze a large scale problem.

A Performance Evaluation of Parallel Color Conversion based on the Thread Number on Multi-core Systems (멀티코어 시스템에서 쓰레드 수에 따른 병렬 색변환 성능 검증)

  • Kim, Cheong Ghil
    • Journal of Satellite, Information and Communications
    • /
    • v.9 no.4
    • /
    • pp.73-76
    • /
    • 2014
  • With the increasing popularity of multi-core processors, they have been adopted even in embedded systems. Under this circumstance many multimedia applications can be parallelized on multi-core platforms because they usually require heavy computations and extensive memory accesses. This paper proposes an efficient thread-level parallel implementation for color space conversion on multi-core CPU. Thread-level parallelism has been becoming very useful parallel processing paradigm especially on shared memory computing systems. In this work, it is exploited by allocating different input pixels to each thread for concurrent loop executions. For the performance evaluation, this paper evaluate the performace improvements for color conversion on multi-core processors based on the processing speed comparison between its serial implementation and parallel ones. The results shows that thread-level parallel implementations show the overall similar ratios of performance improvements regardless of different multi-cores.

Exploiting Implicit Parallelism for Single Loops in Java Programming Language (JAVA 프로그래밍 언어에서 단일루프구조의 무시적 병렬성 검출)

  • Kwon, Oh-Jin
    • Journal of Information Management
    • /
    • v.29 no.3
    • /
    • pp.1-26
    • /
    • 1998
  • The loop is a fundamental for the parallelism exploiting as it has a large portion of execution time for sequential Java program on the parallel machine. This paper proposes the method of exploiting the implicit parallelism through the analysis of data dependence in the existing Java programming language having a single loop structure. The parallel code generation method through the restructuring compiler and the translation method of Java source program into multithread statement, which is supported in the level of the Java programming language, are also proposed here. The performance test of the program translated into the thread statement is conducted using the trip count of loop and the thread count as parameters. The restructuring compiler makes it possible for users to reduce overhead and exploit parallelism efficiently in the Java programming.

  • PDF

Performance Analysis of Embedded Applications (임베디드 응용 프로그램 성능 분석)

  • 김선욱;오재근;한영선;최홍욱;김철우
    • Proceedings of the IEEK Conference
    • /
    • 2003.07d
    • /
    • pp.1355-1358
    • /
    • 2003
  • This paper presents performance analysis of the embedded application, called EEMBC consisting of 5 categories and total 34 applications. We measured various performance metrics, such as code sizes, TLP(Thread Level Parallelism) using OpenMP API and ILP(Instruction Level Parallelism) on ARM-modeled SimpleScalar in detail. We show that the embedded applications have the similar characteristics as integer applications to deliver low ILP and TLP in our environment. They have many small loops, which result in large instruction overhead in TLP and loop control overhead in ILP.

  • PDF

Electromagnetic Field Analysis based on Java Multi-Threads (자바 스레드 기반 고속 전자계 해석)

  • Kim, Tae Yong;Lee, Hoon-Jae
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2013.05a
    • /
    • pp.53-55
    • /
    • 2013
  • In general, various numerical techniques such as MoM, TLM, FEM, and BEM can be introduced to analyze some microwave device and electromagnetic propagation problems. Numerical simulator based on Java thread-level parallelism for improving computation performance is implemented and estimated.

  • PDF

Computation-Communication Overlapping in AES-CCM Using Thread-Level Parallelism on a Multi-Core Processor (멀티코어 프로세서의 쓰레드-수준 병렬성을 활용한 AES-CCM 계산-통신 중첩화)

  • Lee, Eun-Ji;Lee, Sung-Ju;Chung, Yong-Wha;Lee, Myung-Ho;Min, Byoung-Ki
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.8
    • /
    • pp.863-867
    • /
    • 2010
  • Multi-core processors are becoming increasingly popular. As they are widely adopted in embedded systems as well as desktop PC's, many multimedia applications are being parallelized on multi-core platforms. However, it is difficult to parallelize applications with inherent data dependencies such as encryption algorithms for multimedia data. In order to overcome this limit, we propose a technique to overlap computation and communication using an otherwise idle core in this paper. In particular, we interpret the problem of multimedia computation and communication as a pipeline design problem at the application program level, and derive an optimal number of stages in the pipeline.

Speculative Parallelism Characterization Profiling in General Purpose Computing Applications

  • Wang, Yaobin;An, Hong;Liu, Zhiqin;Li, Li;Yu, Liang;Zhen, Yilu
    • Journal of Computing Science and Engineering
    • /
    • v.9 no.1
    • /
    • pp.20-28
    • /
    • 2015
  • General purpose computing applications have not yet been thoroughly explored in procedure level speculation, especially in the light-weighted profiling way. This paper proposes a light-weighted profiling mechanism to analyze speculative parallelism characterization in several classic general purpose computing applications from SPEC CPU2000 benchmark. By comparing the key performance factors in loop and procedure-level speculation, it includes new findings on the behaviors of loop and procedure-level parallelism under these applications. The experimental results are as follows. The best gzip application can only achieve a 2.4X speedup in loop level speculation, while the best mcf application can achieve almost 3.5X speedup in procedure level. It proves that our light-weighted profiling method is also effective. It is found that between the loop-level and procedure-level TLS, the latter is better on several cases, which is against the conventional perception. It is especially shown in the applications where their 'hot' procedure body is concluded as 'hot' loops.

Tile-level and Frame-level Parallel Encoding for HEVC (타일 및 프레임 수준의 HEVC 병렬 부호화)

  • Kim, Younhee;Seok, Jinwuk;Jung, Soon-heung;Kim, Huiyong;Choi, Jin Soo
    • Journal of Broadcast Engineering
    • /
    • v.20 no.3
    • /
    • pp.388-397
    • /
    • 2015
  • High Efficiency Video Coding (HEVC)/H.265 is a new video coding standard which is known as high compression ratio compared to the previous standard, Advanced Video Coding (AVC)/H.264. Due to achievement of high efficiency, HEVC sacrifices the time complexity. To apply HEVC to the market applications, one of the key requirements is the fast encoding. To achieve the fast encoding, exploiting thread-level parallelism is widely chosen mechanism since multi-threading is commonly supported based on the multi-core computer architecture. In this paper, we implement both the Tile-level parallelism and the Frame-level parallelism for HEVC encoding on multi-core platform. Based on the implementation, we present two approaches in combining the Tile-level parallelism with Frame-level parallelism. The first approach creates the fixed number of tile per frame while the second approach creates the number of tile per frame adaptively according to the number of frame in parallel and the number of available worker threads. Experimental results show that both improves the parallel scalability compared to the one that use only tile-level parallelism and the second approach achieves good trade-off between parallel scalability and coding efficiency for both Full-HD (1080 x 1920) and 4K UHD (3840 x 2160) sequences.

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization (GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법)

  • Thuan, Do Cong;Choi, Yong;Kim, Jong Myon;Kim, Cheol Hong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.5
    • /
    • pp.219-230
    • /
    • 2017
  • General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.4
    • /
    • pp.1-9
    • /
    • 2019
  • LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.