통합 검색 | Korea Science

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

Vo, Viet Tan;Kim, Cheol Hong
- Journal of Information Processing Systems
- /
- 제17권6호
- /
- pp.1157-1169
- /
- 2021
Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.
https://doi.org/10.3745/JIPS.01.0084 인용 PDF KSCI

새로운 연산 공유 승산기를 이용한 1차원 DCT 프로세서의 설계 (Design of 1-D DCT processor using a new efficient computation sharing multiplier)

이태욱;조상복
- 정보처리학회논문지A
- /
- 제10A권4호
- /
- pp.347-356
- /
- 2003
DCT 알고리즘은 내적을 효율적으로 처리할 수 있는 하드웨어 구조가 필수적이다. 내적 연산을 위한 기존의 방법들은 하드웨어 복잡도가 높기 때문에, 이론 줄이기 위한 방법으로 연산 공유 승산기가 제안되었다. 하지만 기존의 연산 공유 승산기는 전처리기 및 선택기의 비효율적 구조로 인한 성능저하의 문제점을 가지고 있다. 본 논문에서는 새로운 연산 공유 승산기를 제안하고 이를 1차원 DCT 프로세서에 적용하여 구현하였다. 연산 공유 승산기의 구조 및 논리 합성 비교 시 새로운 승산기는 기존에 비해 효율적인 하드웨어 구성이 가능함을 확인하였고, 1차원 DCT 프로세서 설계 시 기존 구현 방식들에 비해 우수한 성능을 나타내었다.
https://doi.org/10.3745/KIPSTA.2003.10A.4.347 인용 PDF KSCI

Efficient VLSI architecture for one-dimensional discrete wavelet transform using a sealable data reorder unit

Park, Taegeun
- 대한전자공학회:학술대회논문집
- /
- 대한전자공학회 2002년도 ITC-CSCC -1
- /
- pp.353-356
- /
- 2002
In this paper, we design an efficient, scalable one-dimensional discrete wavelet transform (1DDWT) filter using data reorder unit (DRU). At each level, the required hardware is optimized by sharing multipliers and adders because the input rate is reduced by a factor of two at each level due to decimation. The proposed architecture shows 100% hardware utilization by balancing hardware with input rate. Furthermore, sharing the coefficients of the highpass and the lowpass filters using the mirror filter property reduces the number of multipliers and adders by half. We designed a scalable DRU that efficiently reorders and feeds inputs to highpass and lowpass filters. The proposed DRU-based architecture is so regular and scalable that it can be easily extended to an arbitrary 1D DWT structure with M taps and J levels. Compared to other architectures, the proposed DWT filter shows efficiency in performance with relatively less hardware.
PDF

연산공유 승산 알고리즘을 이용한 내적의 최적화 및 이를 이용한 1차원 DCT 프로세서 설계 (Optimization Design Method for Inner Product Using CSHM Algorithm and its Application to 1-D DCT Processor)

이태욱;조상복
- 대한전기학회논문지:시스템및제어부문D
- /
- 제53권2호
- /
- pp.86-93
- /
- 2004
The DCT algorithm needs an efficient hardware architecture to compute inner product. The conventional design method, like ROM-based DA(Distributed Arithmetic), has large hardware complexity. Because of this reason, a CSHM(Computation Sharing Multiplication) was proposed for implementing inner product by Park. However, the Park's CSHM has inefficient hardware architecture in the precomputer and select units. Therefore it degrades the performance of the multiplier. In this paper, we presents the optimization design method for inner product using CSHM algorithm and applied it to implementation of 1-D DCT processor. The experimental results show that the proposed multiplier is more efficient than Park's when hardware architectures and logic synthesis results were compared. The designed 1-D DCT processor by using proposed design method is more high performance than typical methods.
PDF KSCI

효율적인 하드웨어 공유를 위한 단어길이 최적화 알고리듬 (A bitwidth optimization algorithm for efficient hardware sharing)

최정일;전홍신;이정주;김문수;황선영
- 한국통신학회논문지
- /
- 제22권3호
- /
- pp.454-468
- /
- 1997
This paper presents a bitwidth optimization algorithm for efficient hardware sharing in digital signal processing system. The proposed algorithm determines the fixed-point representation for each signal through bitwidth optimization to generate the hardware requiring less area. To reduce the operator area, the algorithm partitions the abstract operations in the design description into several groups, such that the operations in the same group can share an operator. The partitioning result are fed to a high-level synthesis system to generate the pipelined fixed-point datapaths. The proposed algorithm has been implemented in SODAS-DSP an automatic synthesis system for fixed-point DSP hardware. Accepting the models of DSP algorithms in schematics, the system automatically generates the fixed-point datapath and controller satisfying the design constraints in area, speed, and SNR(Signal-to-Noise Ratio). Experimental results show that the efficiency of the proposed algorithm by generates the area-efficient DSP hardwares satisfying performance constraints.
PDF

Common sub-expression sharing을 이용한 고속/저전력 DCT 구조 (Low-power/high-speed DCT structure using common sub-expression sharing)

장영범;양세정
- 한국통신학회논문지
- /
- 제29권1C호
- /
- pp.119-128
- /
- 2004
이 논문에서는 곱셈기를 사용하지 않고 덧셈기 만을 사용하여 DCT를 효과적으로 수행하는 저전력 구조를 제안하였다. 고속처리가 가능하면서도 구현 하드웨어의 크기를 최소화하기 위하여 8-point DCT를 4 cycle에 수행하는 구조를 사용하였다. 즉, 첫 번째 cycle에서 사용한 계수용 하드웨어를 두 번째부터 네 번째까지의 계산에서도 공통으로 사용할 수 있는 구조를 채택하였다. 덧셈기 만을 사용하는 기존의 구조들은 CSD(Canonic signed digit)형의 계수를 사용하여 덧셈의 수를 줄이고 있다. 본 논문에서는 Common subexpression sharing 방식을 채용함으로서 하드웨어를 더욱 감소시킬 수 있는 구조를 제안하였다. 그 결과 8-point DCT의 경우에 CSD 만을 사용한 구조와 비교하여 19.5%의 덧셈 수 감소 효과를 달성하였다.
PDF KSCI

하드웨어 공유와 캐리 보존 덧셈을 이용한 MDS 해쉬 프로세서의 설계 (Design of MD5 Hash Processor with Hardware Sharing and Carry Save Addition Scheme)

최병윤;박영수
- 정보보호학회논문지
- /
- 제13권4호
- /
- pp.139-149
- /
- 2003
본 논문에서는 하드웨어 공유와 캐리 보존 덧셈 연산을 이용하여 MD5 알고리즘을 구현하는 면적 효율적인 해쉬 프로세서를 하드웨어로 설계하였다. 면적을 최소화하기 위해, MD5의 1 단계 동작을 2개의 부분 단계로 세분화하고, 각각의 부분 단계 동작을 동일 하드웨어로 구현하는 방식으로 하드웨어 공유를 극대화하였다. 그리고 MD5의 부분 단계를 구성하는 3개의 직렬 캐리 전달 덧셈 동작을 2개의 캐리 보존 덧셈과 1개의 캐리 전달 덧셈으로 변환하여 동작 주파수를 증가시켰다. MD5 해쉬 프로세서는 0.25$\mu\textrm{m}$ CMOS 표준 셀 라이브러리로 합성한 결과 약 13,000개의 게이트 수로 구성되며, 타이밍 분석 결과 설계된 MD5 해쉬 프로세서는 120 MHz의 동작 주파수에서 512 비트 입력 메시지에 대해 465 Mbps의 성능을 갖는다.
https://doi.org/10.13089/JKIISC.2003.13.4.139 인용 PDF KSCI HTML

Design and Implementation of Unified Hardware for 128-Bit Block Ciphers ARIA and AES

Koo, Bon-Seok;Ryu, Gwon-Ho;Chang, Tae-Joo;Lee, Sang-Jin
- ETRI Journal
- /
- 제29권6호
- /
- pp.820-822
- /
- 2007
ARIA and the Advanced Encryption Standard (AES) are next generation standard block cipher algorithms of Korea and the US, respectively. This letter presents an area-efficient unified hardware architecture of ARIA and AES. Both algorithms have 128-bit substitution permutation network (SPN) structures, and their substitution and permutation layers could be efficiently merged. Therefore, we propose a 128-bit processor architecture with resource sharing, which is capable of processing ARIA and AES. This is the first architecture which supports both algorithms. Furthermore, it requires only 19,056 logic gates and encrypts data at 720 Mbps and 1,047 Mbps for ARIA and AES, respectively.
PDF

정사영 벡터의 특징 분석 및 하드웨어 자원 공유기법을 이용한 저면적 Gradient Magnitude 연산 하드웨어 구현 (Low Complexity Gradient Magnitude Calculator Hardware Architecture Using Characteristic Analysis of Projection Vector and Hardware Resource Sharing)

김우석;이주성;안호명
- 한국정보전자통신기술학회논문지
- /
- 제9권4호
- /
- pp.414-418
- /
- 2016
본 논문은 저면적 gradient magnitude 연산을 위한 하드웨어 구조를 제안한다. 하드웨어 복잡도를 줄이기 위해 정사영 벡터의 특징 및 하드웨어 자원 공유기법을 이용했다. 제안된 하드웨어 구조는 gradient magnitude 연산 알고리즘의 변형 없이 구현되었기 때문에 gradient magnitude 데이터 품질의 열화 없이 구현될 수 있다. 제안된 저면적 gradient magnitude 연산 하드웨어는 Altera Quartus II v15.0 환경에서 Altera Cyclone VI (EP4CE115F29C7N) FPGA를 이용하여 구현되었다. 구현 결과, 기존 하드웨어 구조를 이용하여 구현한 연산기와의 비교에서 15%의 logic elements 및 38%의 embedded multiplier 절감 효과가 있음을 확인했다.
https://doi.org/10.17661/jkiiect.2016.9.4.414 인용 PDF KSCI

합성체 기반의 S-Box와 하드웨어 공유를 이용한 저면적/고성능 AES 프로세서 설계 (A design of compact and high-performance AES processor using composite field based S-Box and hardware sharing)

양현창;신경욱
- 대한전자공학회논문지SD
- /
- 제45권8호
- /
- pp.67-74
- /
- 2008
다양한 하드웨어 공유 및 최적화 방법을 적용하여 저면적/고성능 AES(Advanced Encryption Standard) 암호/복호 프로세서를 설계하였다. 라운드 변환블록 내부에 암호연산과 복호연산 회로의 공유 및 재사용과 함께 라운드 변환블록과 키 스케줄러의 S-Box 공유 등을 통해 회로 복잡도가 최소화되도록 하였으며, 이를 통해 S-Box의 면적을 약 25% 감소시켰다. 또한, AES 프로세서에서 가장 큰 면적을 차지하는 S-Box를 합성체 $GF(((2^2)^2)^2)$ 연산을 적용하여 구현함으로써 $GF(2^8)$ 또는 $GF((2^4)^2)$ 기반의 설계에 비해 S-Box의 면적이 더욱 감소되도록 하였다. 64-비트 데이터패스의 라운드 변환블록과 라운드 키 생성기의 동작을 최적화시켜 라운드 연산이 3 클록주기에 처리되도록 하였으며, 128비트 데이터 블록의 암호화가 31 클록주기에 처리되도록 하였다. 설계된 AES 암호/복호 프로세서는 약 15,870 게이트로 구현되었으며, 100 MHz 클록으로 동작하여 412.9 Mbps의 성능이 예상된다.
PDF KSCI

검색결과 170건 처리시간 0.02초

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)