• Title/Summary/Keyword: Thread Block

Search Result 25, Processing Time 0.022 seconds

SimTBS: Simulator For GPGPU Thread Block Scheduling (SimTBS: GPGPU 스레드블록 스케줄링 시뮬레이터)

  • Cho, Kyung-Woon;Bahn, Hyokyung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.4
    • /
    • pp.87-92
    • /
    • 2020
  • Although GPGPU (General-Purpose GPU) can maximize performance by parallelizing a task with tens of thousands of threads, those threads are internally grouped into a thread block, which is a base unit for processing and resource allocation. A thread block scheduler is a specialized hardware gadget whose role is to allocate thread blocks to GPGPU processing hardware in a round-robin manner. However, round-robin is a sequential allocation policy and is not optimized for GPGPU resource utilization. In this paper, we propose a thread block scheduler model which can analyze and quantify performances for various thread block scheduling policies. Experiment results from the implemented simulator of our model show that the legacy hardware thread block scheduling does not behave well when workload becomes heavy.

An Implementation of Embedded SIP User Agent under Wireless LAN Area (Wireless LAN 환경에서 임베디드 SIP User Agent 구현)

  • Park Seung-Hwan;Lee Jae-Heung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.9 no.3
    • /
    • pp.493-497
    • /
    • 2005
  • This paper is about the research of the User Agent implementation under wireless embedded environment, using SIP which is one of protocol components construct the VoIP system. The User Agent is made of the User Agent configuration block, the device thread block to control devices and the SIP stack block to process SIP messages. The device thread consists of the RTP thread and the sound lard device processing block. Futhermore, the SIP stack consist of the worker thread to process proxy events, the SIP transceiver and SIP thread to transfer and receive SIP messages. The H/W platform is a board included the Intel's XScale PXA255 processor, flash memory, SDRAM, Audio CODEC module and wireless LAN threough PCMCIA socket, furthermore a microphone and headphone is used by the audio 1/0. The system has embedded linux kernel 2.4.19. For embedded environment, the function of User Agent and SIP method is diminished. Finally, the resource of system could be reduced about $12.9\%$, compared to overall system resource, by minimizing peripherals control and excepting TCP.

Thread Block Scheduling for Multi-Workload Environments in GPGPU (다중 워크로드 환경을 위한 GPGPU 스레드 블록 스케줄링)

  • Park, Soyeon;Cho, Kyung-Woon;Bahn, Hyokyung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.22 no.2
    • /
    • pp.71-76
    • /
    • 2022
  • Round-robin is widely used for the scheduling of large-scale parallel workloads in the computing units of GPGPU. Round-robin is easy to implement by sequentially allocating tasks to each computing unit, but the load balance between computing units is not well achieved in multi-workload environments like cloud. In this paper, we propose a new thread block scheduling policy to resolve this situation. The proposed policy manages thread blocks generated by various GPGPU workloads with multiple queues based on their computation loads and tries to maximize the resource utilization of each computing unit by selecting a thread block from the queue that can maximally utilize the remaining resources, thereby inducing load balance between computing units. Through simulation experiments under various load environments, we show that the proposed policy improves the GPGPU performance by 24.8% on average compared to Round-robin.

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization (GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법)

  • Thuan, Do Cong;Choi, Yong;Kim, Jong Myon;Kim, Cheol Hong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.5
    • /
    • pp.219-230
    • /
    • 2017
  • General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.

An Optimization Method for Hologram Generation on Multiple GPU-based Parallel Processing (다중 GPU기반 홀로그램 생성을 위한 병렬처리 성능 최적화 기법)

  • Kook, Joongjin
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.9-15
    • /
    • 2019
  • Since the computational complexity for hologram generation increases exponentially with respect to the size of the point cloud, parallel processing using CUDA and/or OpenCL library based on multiple GPUs has recently become popular. The CUDA kernel for parallelization needs to consist of threads, blocks, and grids properly in accordance with the number of cores and the memory size in the GPU. In addition, in case of multiple GPU environments, the distribution in grid-by-grid, in block-by-block, or in thread-by-thread is needed according to the number of GPUs. In order to evaluate the performance of CGH generation, we compared the computational speed in CPU, in single GPU, and in multi-GPU environments by gradually increasing the number of points in a point cloud from 10 to 1,000,000. We also present a memory structure design and a calculation method required in the CUDA-based parallel processing to accelerate the CGH (Computer Generated Hologram) generation operation in multiple GPU environments.

Thread Block Scheduling for GPGPU based on Fine-Grained Resource Utilization (상세 자원 이용률에 기반한 병렬 가속기용 스레드 블록 스케줄링)

  • Bahn, Hyokyung;Cho, Kyungwoon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.22 no.5
    • /
    • pp.49-54
    • /
    • 2022
  • With the recent widespread adoption of general-purpose GPUs (GPGPUs) in cloud systems, maximizing the resource utilization through multitasking in GPGPU has become an important issue. In this article, we show that resource allocation based on the workload classification of computing-bound and memory-bound is not sufficient with respect to resource utilization, and present a new thread block scheduling policy for GPGPU that makes use of fine-grained resource utilizations of each workload. Unlike previous approaches, the proposed policy reduces scheduling overhead by separating profiling and scheduling, and maximizes resource utilizations by co-locating workloads with different bottleneck resources. Through simulations under various virtual machine scenarios, we show that the proposed policy improves the GPGPU throughput by 130.6% on average and up to 161.4%.

WRF Physics Models Using GP-GPUs with CUDA Fortran (WRF 물리 과정의 GP-GPU 계산을 위한 CUDA Fortran 프로그램 구현)

  • Kim, Youngtae;Lee, Yong Hee;Chung, Kwan-Young
    • Atmosphere
    • /
    • v.23 no.2
    • /
    • pp.231-235
    • /
    • 2013
  • We parallelized WRF major physics routines for Nvidia GP-GPUs with CUDA Fortran. GP-GPUs are originally designed for graphic processing, but show high performance with low electricity for calculating numerical models. In the CUDA environment, a data domain is allocated into thread blocks and threads in each thread block are computing in parallel. We parallelized the WRF program to use of thread blocks efficiently. We validated the GP-GPU program with the original CPU program, and the WRF model using GP-GPUs shows efficient speedup.

A Design of Mobile Web Server Framework for SOAP Transaction and Performance Enhancement in Web2.0 (웹2.0에서 SOAP 처리와 성능 향상을 위한 모바일 웹 서버 프레임워크의 설계)

  • Kim, Yong-Tae;Jeong, Yoon-Su;Park, Gil-Cheol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.10
    • /
    • pp.1866-1874
    • /
    • 2008
  • Existing web server lowers the whole capacity of system because of the problem on the processing load of server by closing connection increasing code handshake operation, and remarkable decrease of server capacity if it is the state of overload. Also, there occurs disadvantages of increasing connection tine about client's request and response time because handling of client's multi-requests is not smooth because of thread block and it requests a lot of time and resources for revitalization of thread. Therefore, this paper proposes the extended web server which provides the technique for delay handling and improves the overload of server for better system capacity, communication support, and the unification which is the advantage of web service. And it evaluates the existing system(implemented at Tomcat 5.5) and the proposed mobile web server architecture. The extended server architecture provides excellent exchange condition for system capacity and evaluates improved web server architecture which combines multi-thread with thread pool. The proposed web service architecture in this paper got the better result of improved capacity benefit than the evaluation result of original Tomcat 5.5.

A Study on the Tracking and Blocking of Malicious Actors through Thread-Based Monitoring (스레드 기반 모니터링을 통한 악의적인 행위 주체 추적 및 차단에 관한 연구)

  • Ko, Boseung;Choi, Wonhyok;Jeong, Dajung
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.30 no.1
    • /
    • pp.75-86
    • /
    • 2020
  • With the recent advancement of malware, the actors performing malicious tasks are often not processes. Malicious code injected into the process that is installed by default in the operating system works thread by thread in the same way as DLL / code injection. In this case, diagnosing and blocking the process as malicious can cause serious problems with system operation. This white paper lists the problems of how to use process-based monitoring information to identify and block the malicious state of a process and presents an improved solution.

Distributed Shared Memory Scheme for Multi-thread programming (다중쓰레드 프로그래밍을 위한 분산공유메모리 관리 기법)

  • Seo, Dae-Wha
    • The Transactions of the Korea Information Processing Society
    • /
    • v.3 no.4
    • /
    • pp.791-802
    • /
    • 1996
  • In this paper, we discuss a distributed shared memory management scheme based on multi-threaded programming model for a large-scale loosely coupled multiprocessor system. The scheme covers three major issues in the distribued shared memory;the address translation table management, the block coherence maintenance, and the block placement policy. The scheme efficiently resolves the general problems occurred in the distributed shared memory such as a false sharing, an unnecessary replication, a block bouncing, and an address aliasing phenomenon. It also provides the application transparency, good scalability, easy implementation, and multithreaded programming model to users.

  • PDF