Search | Korea Science

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

Vo, Viet Tan;Kim, Cheol Hong
- Journal of Information Processing Systems
- /
- v.17 no.6
- /
- pp.1157-1169
- /
- 2021
Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.
https://doi.org/10.3745/JIPS.01.0084 인용 PDF KSCI

Design of 1-D DCT processor using a new efficient computation sharing multiplier (새로운 연산 공유 승산기를 이용한 1차원 DCT 프로세서의 설계)

Lee, Tae-Wook;Cho, Sang-Bock
- The KIPS Transactions:PartA
- /
- v.10A no.4
- /
- pp.347-356
- /
- 2003
The OCT algorithm needs efficient hardware architecture to compute inner product. The conventional methods have large hardware complexity. Because of this reason. a computation sharing multiplier was proposed for implementing inner product. However, the existing multiplier has inefficient hardware architecture in precomputer and select units. Therefore it degrades the performance of the multiplier. In this paper, we proposed a new efficient computation sharing multiplier and applied it to implementation of 1-D DCT processor. The comparison results show that the new multiplier is more efficient than an old one when hardware architectures and logic synthesis results were compared. The designed 1-D DCT processor by using the proposed multiplier is more high performance than typical design methods.
https://doi.org/10.3745/KIPSTA.2003.10A.4.347 인용 PDF KSCI

Efficient VLSI architecture for one-dimensional discrete wavelet transform using a sealable data reorder unit

Park, Taegeun
- Proceedings of the IEEK Conference
- /
- 2002.07a
- /
- pp.353-356
- /
- 2002
In this paper, we design an efficient, scalable one-dimensional discrete wavelet transform (1DDWT) filter using data reorder unit (DRU). At each level, the required hardware is optimized by sharing multipliers and adders because the input rate is reduced by a factor of two at each level due to decimation. The proposed architecture shows 100% hardware utilization by balancing hardware with input rate. Furthermore, sharing the coefficients of the highpass and the lowpass filters using the mirror filter property reduces the number of multipliers and adders by half. We designed a scalable DRU that efficiently reorders and feeds inputs to highpass and lowpass filters. The proposed DRU-based architecture is so regular and scalable that it can be easily extended to an arbitrary 1D DWT structure with M taps and J levels. Compared to other architectures, the proposed DWT filter shows efficiency in performance with relatively less hardware.
PDF

Optimization Design Method for Inner Product Using CSHM Algorithm and its Application to 1-D DCT Processor (연산공유 승산 알고리즘을 이용한 내적의 최적화 및 이를 이용한 1차원 DCT 프로세서 설계)

이태욱;조상복
- The Transactions of the Korean Institute of Electrical Engineers D
- /
- v.53 no.2
- /
- pp.86-93
- /
- 2004
The DCT algorithm needs an efficient hardware architecture to compute inner product. The conventional design method, like ROM-based DA(Distributed Arithmetic), has large hardware complexity. Because of this reason, a CSHM(Computation Sharing Multiplication) was proposed for implementing inner product by Park. However, the Park's CSHM has inefficient hardware architecture in the precomputer and select units. Therefore it degrades the performance of the multiplier. In this paper, we presents the optimization design method for inner product using CSHM algorithm and applied it to implementation of 1-D DCT processor. The experimental results show that the proposed multiplier is more efficient than Park's when hardware architectures and logic synthesis results were compared. The designed 1-D DCT processor by using proposed design method is more high performance than typical methods.
PDF KSCI

A bitwidth optimization algorithm for efficient hardware sharing (효율적인 하드웨어 공유를 위한 단어길이 최적화 알고리듬)

최정일;전홍신;이정주;김문수;황선영
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.22 no.3
- /
- pp.454-468
- /
- 1997
This paper presents a bitwidth optimization algorithm for efficient hardware sharing in digital signal processing system. The proposed algorithm determines the fixed-point representation for each signal through bitwidth optimization to generate the hardware requiring less area. To reduce the operator area, the algorithm partitions the abstract operations in the design description into several groups, such that the operations in the same group can share an operator. The partitioning result are fed to a high-level synthesis system to generate the pipelined fixed-point datapaths. The proposed algorithm has been implemented in SODAS-DSP an automatic synthesis system for fixed-point DSP hardware. Accepting the models of DSP algorithms in schematics, the system automatically generates the fixed-point datapath and controller satisfying the design constraints in area, speed, and SNR(Signal-to-Noise Ratio). Experimental results show that the efficiency of the proposed algorithm by generates the area-efficient DSP hardwares satisfying performance constraints.
PDF

Low-power/high-speed DCT structure using common sub-expression sharing (Common sub-expression sharing을 이용한 고속/저전력 DCT 구조)

Jang, Young-Beom;Yang, Se-Jung
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.29 no.1C
- /
- pp.119-128
- /
- 2004
In this paper, a low-power 8-point DCT structure is proposed using add and shift operations. Proposed structure adopts 4 cycles for complete 8-point DCT in order to minimize size of hardware and to enable high-speed processing. In the structure, hardware for the first cycle can be shared in the next 3 cycles since all columns in the DCT coefficient matrix are common except sign. Conventional DCT structures implemented with only add and shift operation use CSD(Canonic Signed Digit) form coefficients to reduce the number of adders. To reduce the number of adders further, we propose a new structure using common sub-expression sharing techniques. With this techniques, the proposed 8-point DCT structure achieves 19.5% adder reduction comparison to the conventional structure using only CSD coefficient form.
PDF KSCI

Design of MD5 Hash Processor with Hardware Sharing and Carry Save Addition Scheme (하드웨어 공유와 캐리 보존 덧셈을 이용한 MDS 해쉬 프로세서의 설계)

최병윤;박영수
- Journal of the Korea Institute of Information Security & Cryptology
- /
- v.13 no.4
- /
- pp.139-149
- /
- 2003
In this paper a hardware design of area-efficient hash processor which implements MD5 algorithm using hardware sharing and carry-save addition schemes is described. To reduce area, the processor adopts hardware sharing scheme in which 1 step operation is divided into 2 substeps and then each substep is executed using the same hardware. Also to increase clock frequency, three serial additions of substep operation are transformed into two carry-save additions and one carry propagation addition. The MD5 hash processor is designed using 0.25 $\mu\textrm{m}$CMOS technology and consists of about 13,000 gates. From timing simulation results, the designed MD5 hash processor has 465 Mbps hash rates for 512-bit input message data under 120 MHz operating frequency.
https://doi.org/10.13089/JKIISC.2003.13.4.139 인용 PDF KSCI HTML

Design and Implementation of Unified Hardware for 128-Bit Block Ciphers ARIA and AES

Koo, Bon-Seok;Ryu, Gwon-Ho;Chang, Tae-Joo;Lee, Sang-Jin
- ETRI Journal
- /
- v.29 no.6
- /
- pp.820-822
- /
- 2007
ARIA and the Advanced Encryption Standard (AES) are next generation standard block cipher algorithms of Korea and the US, respectively. This letter presents an area-efficient unified hardware architecture of ARIA and AES. Both algorithms have 128-bit substitution permutation network (SPN) structures, and their substitution and permutation layers could be efficiently merged. Therefore, we propose a 128-bit processor architecture with resource sharing, which is capable of processing ARIA and AES. This is the first architecture which supports both algorithms. Furthermore, it requires only 19,056 logic gates and encrypts data at 720 Mbps and 1,047 Mbps for ARIA and AES, respectively.
PDF

Low Complexity Gradient Magnitude Calculator Hardware Architecture Using Characteristic Analysis of Projection Vector and Hardware Resource Sharing (정사영 벡터의 특징 분석 및 하드웨어 자원 공유기법을 이용한 저면적 Gradient Magnitude 연산 하드웨어 구현)

Kim, WooSuk;Lee, Juseong;An, Ho-Myoung
- The Journal of Korea Institute of Information, Electronics, and Communication Technology
- /
- v.9 no.4
- /
- pp.414-418
- /
- 2016
In this paper, a hardware architecture of low area gradient magnitude calculator is proposed. For the hardware complexity reduction, the characteristic of orthogonal projection vector and hardware resource sharing technique are applied. The proposed hardware architecture can be implemented without degradation of the gradient magnitude data quality since the proposed hardware is implemented with original algorithm. The FPGA implementation result shows the 15% of logic elements and 38% embedded multiplier savings compared with previous work using Altera Cyclone VI (EP4CE115F29C7N) FPGA and Quartus II v15.0 environment.
https://doi.org/10.17661/jkiiect.2016.9.4.414 인용 PDF KSCI

A design of compact and high-performance AES processor using composite field based S-Box and hardware sharing (합성체 기반의 S-Box와 하드웨어 공유를 이용한 저면적/고성능 AES 프로세서 설계)

Yang, Hyun-Chang;Shin, Kyung-Wook
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.45 no.8
- /
- pp.67-74
- /
- 2008
A compact and high-performance AES(Advanced Encryption Standard) encryption/decryption processor is designed by applying various hardware sharing and optimization techniques. In order to achieve minimized hardware complexity, sharing the S-Boxes for round transformation with the key scheduler, as well as merging and reusing datapaths for encryption and decryption are utilized, thus the area of S-Boxes is reduced by 25%. Also, the S-Boxes which require the largest hardware in AES processor is designed by applying composite field arithmetic on $GF(((2^2)^2)^2)$, thus it further reduces the area of S-Boxes when compared to the design based on $GF(2^8)$ or $GF((2^4)^2)$. By optimizing the operation of the 64-bit round transformation and round key scheduling, the round transformation is processed in 3 clock cycles and an encryption of 128-bit data block is performed in 31 clock cycles. The designed AES processor has about 15,870 gates, and the estimated throughput is 412.9 Mbps at 100 MHz clock frequency.
PDF KSCI

Search Result 170, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)