Search | Korea Science

Exploring GEMM Optimization Techniques for PIM Architecture: A Case Study on UPMEM (PIM 아키텍처를 위한 GEMM 최적화 기법 탐구: UPMEM 사례 연구)

Chan Lee;Heelim Choi;Hanjun Kim
- Proceedings of the Korea Information Processing Society Conference
- /
- 2024.05a
- /
- pp.65-68
- /
- 2024
이 연구는 PIM(Processing-in-Memory) 아키텍처를 활용하여 General Matrix Multiplication(GEMM)의 최적화 기법을 UPMEM PIM 을 통해 탐구한다. 본 연구는 CPU 에서 경험하는 메모리 대역폭의 제한을 극복하고 병렬 처리 구조를 활용함으로써 GEMM 연산에서 PIM 의 잠재적 이점을 확인한다. 또한 연속된 세 개의 행렬 곱셈에 대한 효율성을 평가하고, 데이터 전송 시간이 성능 최적화의 주요병목 현상으로 작용하는 것을 확인한다. CPU 에서 UPMEM 커널로 전송되는 데이터의 양을 한 번에 늘리면서 전송 횟수를 줄이는 방법을 사용하여 CPU 에 비해 성능을 최대 6.57 배 향상시켰다.
https://doi.org/10.3745/PKIPS.y2024m05a.65 인용 PDF

Energy-Efficient Signal Processing Using FPGAs (FPGA 상에서 에너지 효율이 높은 병렬 신호처리 기법)

Jang Ju-wook;Hwang Yunil;Scrofano Ronald;Prasanna Viktor K.
- The KIPS Transactions:PartA
- /
- v.12A no.4 s.94
- /
- pp.305-312
- /
- 2005
In this paper, we present algorithm-level techniques for energy-efficient design at the algorithm level using FPGAs. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast Fourier transform(FFT) and matrix multiplication. We evaluate the performance, in terms of both latency and energy efficiency, of FPGAs in performing these tasks. Using a Xilinx Virtex-II as the target FPGA, we compare the performance of our designs to those from the Xilinx library as well as to conventional algorithms run on the PowerPC core embedded in the Virtex-II Pro and the Texas Instruments TMS320C6415. Our evaluations are done both through estimation based on energy and latency equations on high-level and through low-level simulation. For FFT, our designs dissipated an average of $50\%$ less energy than the design from the Xilinx library and $56\%$ less than the DSP. Our designs showed an EAT factor of 10 times improvement over the embedded processor. These results provide a concrete evidence to substantiate the idea that FPGAs can outperform DSPs and embedded processors in signal processing. Further, they show that PFGAs can achieve this performance while still dissipating less energy than the other two types of devices.
https://doi.org/10.3745/KIPSTA.2005.12A.4.305 인용 PDF KSCI

A Practical Synthesis Technique for Optimal Arithmetic Hardware based on Carry-Save-Adders (캐리-세이브 가산기에 기초한 연산 하드웨어 최적화를 위한 실질적 합성 기법)

Kim, Tae-Hwan;Eom, Jun-Hyeong
- Journal of KIISE:Computer Systems and Theory
- /
- v.28 no.10
- /
- pp.520-529
- /
- 2001
Carry-save-adder(CSA) is one of the most effective operation cells in implementing an arithmetic hardware with high performace and small circuit area. An fundamental drawback of the existing CAS applications is that the applications are limited to the local parts of arithmetic circuit that are directly converted to additions. To resolve the limitation, we propose a set of new CSA transformation techniques: optimizing arithmetics with multiplexors, optimizing arithmetics in multiple designs, and optimizing arithmetics with multiplications. We then design a new CSA transformation algorithm which integrates the proposed techniques, so that we are able to utilize CSAs more globally. An extensive experimentation for practical designs are provided to show the effectiveness of our proposed algorithm over the conventional CSA techniques.
PDF

A Study on GPGPU Performance Improvement Technique on GCN Architecture Using OpenCL API (GCN 아키텍쳐 상에서의 OpenCL을 이용한 GPGPU 성능향상 기법 연구)

Woo, DongHee;Kim, YoonHo
- The Journal of Society for e-Business Studies
- /
- v.23 no.1
- /
- pp.37-45
- /
- 2018
The current system upon which a variety of programs are in operation has continuously expanded its domain from conventional single-core and multi-core system to many-core and heterogeneous system. However, existing researches have focused mostly on parallelizing programs based CUDA framework and rarely on AMD based GCN-GPU optimization. In light of the aforementioned problems, our study focuses on the optimization techniques of the GCN architecture in a GPGPU environment and achieves a performance improvement. Specifically, by using performance techniques we propose, we have reduced more then 30% of the computation time of matrix multiplication and convolution algorithm in GPGPU. Also, we increase the kernel throughput by more then 40%.
https://doi.org/10.7838/jsebs.2018.23.1.037 인용 PDF KSCI

Pipelined Wake-Up Scheme to Reduce Power-Line Noise of MTCMOS Megablock Shutdown for Low-Power VLSI Systems (저전력 VLSI 시스템에서 MTCMOS 블록 전원 차단 시의 전원신 잡음을 줄인 파이프라인 전원 복귀 기법)

이성주;연규성;전치훈;장용주;조지연;위재경
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.41 no.10
- /
- pp.77-83
- /
- 2004
In low-power VLSI systems, it is effective to suppress leakage current by shutting down megablocks in idle states. Recently, multi-threshold voltage CMOS (MTCMOS) is widely accepted to shutdown power supply. However, it requires short wake-up time as operating frequency increases. This causes large current surge during wake-up process, and it often leads to system malfunction due to severe Power line noise. In this paper, a novel wake-up scheme is proposed to solve this problem. It exploits pipelined wake-up strategy in several stages that reduces maximum current on the power line and its corresponding power line noise. To evaluate its efficiency, the proposed scheme was applied to a multiplier block in the Compact Flash memory controller chip. Power line noise in shutdown and wake-up process was simulated and analyzed. From the simulation results, the proposed scheme was proven to greatly reduce the power line noise compared with conventional schemes.
PDF KSCI

A Design and Performance Evaluation of Path Search by Simplification of Estimated Values based on Variable Heuristic (가변 휴리스틱 기반 추정치 간소화를 통한 경로탐색 기법의 설계 및 성능 평가)

Kim, Jin-Deog
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.10 no.11
- /
- pp.2002-2007
- /
- 2006
The path search method in the telematics system should consider traffic flow of the roads as well as the shortest time because the optimal path with minimized travel time could be continuously changed by the traffic flow. The existing path search methods are not able to cope efficiently with the change of the traffic flow. The search method to use traffic information also needs more computation time than the existing shortest path search. In this paper, a method for efficiency improvement of path search is implemented and its performance is evaluated. The method employs the fixed grid for adjustable heuristic to traffic flow. Moreover, in order to simplify the computation of estimation values, it only adds graded decimal values instead of multiplication operation of floating point numbers with due regard to the gradient between a departure and a destination. The results obtained from the experiments show that it achieves the high accuracy and short execution time as well.
PDF KSCI

Development of Signal Processing Technique of Digital Speckle Tomography for Analysis of Three-Dimensional Density Distributions of Unsteady and Asymmetric Gas Flow (비정상 비대칭 기체 유동의 3차원 밀도 분포 분석을 위한 디지털 스펙클 토모그래피 기법의 신호 처리 기술 개발)

Baek, Seung-Hwan;Kim, Yong-Jae;Ko, Han-Seo
- Journal of the Korean Society for Nondestructive Testing
- /
- v.26 no.2
- /
- pp.108-114
- /
- 2006
Transient and asymmetric density distributions of butane flow have been investigated from laser image signals by developed three-dimensional digital speckle tomography. Moved signals of speckles have been captured by multiple CCD images in three angles of view simultaneously because the flows were asymmetric and transient. The signals of speckle movements between no flow and downward butane flow from a circular half opening have been calculated by a cross-correlation tracking method so that those distances can be transferred to deflection angles of laser rays fur density gradients. The three-dimensional density fields have been reconstructed from the fringe shift signal which is integrated from the deflection angle by a real-time multiplicative algebraic reconstruction technique (MART).
PDF KSCI

Side-Channel Attacks on Square Always Exponentiation Algorithm (Square Always 멱승 알고리듬에 대한 부채널 공격)

Jung, Seung-Gyo;Ha, Jae-Cheol
- Journal of the Korea Institute of Information Security & Cryptology
- /
- v.24 no.3
- /
- pp.477-489
- /
- 2014
Based on some flaws occurred for implementing a public key cryptosystem in the embedded security device, many side-channel attacks to extract the secret private key have been tried. In spite of the fact that the cryptographic exponentiation is basically composed of a sequence of multiplications and squarings, a new Square Always exponentiation algorithm was recently presented as a countermeasure against side-channel attacks based on trading multiplications for squarings. In this paper, we propose Known Power Collision Analysis and modified Doubling attacks to break the Right-to-Left Square Always exponentiation algorithm which is known resistant to the existing side-channel attacks. And we also present a Collision-based Combined Attack which is a combinational method of fault attack and power collision analysis. Furthermore, we verify that the Square Always algorithm is vulnerable to the proposed side-channel attacks using computer simulation.
https://doi.org/10.13089/JKIISC.2014.24.3.477 인용 PDF KSCI HTML

Compressive Sensing of the FIR Filter Coefficients for Multiplierless Implementation (무곱셈 구현을 위한 FIR 필터 계수의 압축 센싱)

Kim, Seehyun
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.18 no.10
- /
- pp.2375-2381
- /
- 2014
In case the coefficient set of an FIR filter is represented in the canonic signed digit (CSD) format with a few nonzero digits, it is possible to implement high data rate digital filters with low hardware cost. Designing an FIR filter with CSD format coefficients, whose number of nonzero signed digits is minimal, is equivalent to finding sparse nonzero signed digits in the coefficient set of the filter which satisfies the target frequency response with minimal maximum error. In this paper, a compressive sensing based CSD coefficient FIR filter design algorithm is proposed for multiplierless and high speed implementation. Design examples show that multiplierless FIR filters can be designed using less than two additions per tap on average with approximate frequency response to the target, which are suitable for high speed filtering applications.
https://doi.org/10.6109/jkiice.2014.18.10.2375 인용 PDF KSCI

Implementation of Neural Network Accelerator for Rendering Noise Reduction on OpenCL (OpenCL을 이용한 랜더링 노이즈 제거를 위한 뉴럴 네트워크 가속기 구현)

Nam, Kihun
- The Journal of the Convergence on Culture Technology
- /
- v.4 no.4
- /
- pp.373-377
- /
- 2018
In this paper, we propose an implementation of a neural network accelerator for reducing the rendering noise using OpenCL. Among the rendering algorithms, we selects a ray tracing to assure a high quality graphics. Ray tracing rendering uses ray to render, less use of the ray will result in noise. Ray used more will produce a higher quality image but will take operation time longer. To reduce operation time whiles using fewer rays, Learning Base Filtering algorithm using neural network was applied. it's not always produce optimize result. In this paper, a new approach to Matrix Multiplication that is based on General Matrix Multiplication for improved performance. The development environment, we used specialized in high speed parallel processing of OpenCL. The proposed architecture was verified using Kintex UltraScale XKU6909T-2FDFG1157C FPGA board. The time it takes to calculate the parameters is about 1.12 times fast than that of Verilog-HDL structure.
https://doi.org/10.17703/JCCT.2018.4.4.373 인용 PDF KSCI HTML

Search Result 120, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)