• Title/Summary/Keyword: Matrix Multiplication

Search Result 167, Processing Time 0.027 seconds

GPU-based Sparse Matrix-Vector Multiplication Schemes for Random Walk with Restart: A Performance Study (랜덤워크 기법을 위한 GPU 기반 희소행렬 벡터 곱셈 방안에 대한 성능 평가)

  • Yu, Jae-Seo;Bae, Hong-Kyun;Kang, Seokwon;Yu, Yongseung;Park, Yongjun;Kim, Sang-Wook
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.11a
    • /
    • pp.96-97
    • /
    • 2020
  • 랜덤워크 기반 노드 랭킹 방식 중 하나인 RWR(Random Walk with Restart) 기법은 희소행렬 벡터 곱셈 연산과 벡터 간의 합 연산을 반복적으로 수행하며, RWR 의 수행 시간은 희소행렬 벡터 곱셈 연산 방법에 큰 영향을 받는다. 본 논문에서는 CSR5(Compressed Sparse Row 5) 기반 희소행렬 벡터 곱셈 방식과 CSR-vector 기반 희소행렬 곱셈 방식을 채택한 GPU 기반 RWR 기법 간의 비교 실험을 수행한다. 실험을 통해 데이터 셋의 특징에 따른 RWR 의 성능 차이를 분석하고, 적합한 희소행렬 벡터 곱셈 방안 선택에 관한 가이드라인을 제안한다.

A Comparative Analysis of Whole Blood Cadmium by Atomic Absorption Spectrophotometry with a Graphite Furnace (흑연로 원자흡수분광법에 의한 혈액중 카드뮴 정량분석)

  • Park, Jong An;Oh, Hye Jeong;Lee, Jong Hwa
    • Journal of Korean Society of Occupational and Environmental Hygiene
    • /
    • v.6 no.2
    • /
    • pp.301-312
    • /
    • 1996
  • This study was performed to search a optimal analyzing method of cadmium in whole-blood. Cadmium was determined by graphite furnace atomic absorption spectrometry(GFAAS). We investigated the effect of ashing temperature on the absorbance of cadmium in a simple dilution(ten-fold) method with triton X-100 and matrix modifier methods treated with $NH_4H_2PO_4$(1 and 3%) and $Pd(NO_3)_2$(0.00l and 0.005%) as matrix modifier. We also compared the reported reference values of standard blood with values resulted from optimal analyzing conditions of this study. In case of a simple dilution method, when ashing temperature was set at $450^{\circ}C$, the absorbance of sample and background were $0.334{\pm}0.012$ and $1.382{\pm}0.245$, respectively. Background level was higher than the value(0.8) that can be corrected by $D_2$ background correction method. As ashing temperature was rised to $500^{\circ}C$, the absorbance of sample and background were $0.178{\pm}0.008$ and $0.711{\pm}0.223$ respectively. The higher ashing temperature($450^{\circ}C-650^{\circ}C$) was, the lower the absorbance of sample was. In case of a matrix modifier method with $NH_4H_2PO_4$(1 and 3%), when ashing temperature was rised from $500^{\circ}C$ to $650^{\circ}C$, the absorbance of sample slightly changed. The absorbances of sample at $600^{\circ}C$ were $0.230{\pm}0.017$ and $0.137{\pm}0.012$, respectively. These values were larger than that of simple dilution method. But the absorbance of background was higher than the level that can be corrected by $D_2$ method. In case of a matrix modifier method with $Pd(NO_3)_2$(0.001 and 0.005%), the absorbance of sample and background were higher than those of other methods and were stable and reproducible. When ashing temperature was over $550^{\circ}C$, the absorbance of sample was significantly decreased. In case of 0.005% $Pd(NO_3)_2$ carbon residue remained in graphite tube affected the absorbance of sample and background. From these results, We propose that in case of a simple dilution(ten-fold) method with triton X-100 ashing temperature must be maintained below $400^{\circ}C$. In order to diminish the absorbance of background, the alternative method is attenuation of injection volume or multiplication of dilution ratio. We recommend $Pd(NO_3)_2$ than $NH_4H_2PO_4$ as a matrix modifier. In case of a matrix modifier method with $Pd(NO_3)_2$ ashing temperature might be maintained below $550^{\circ}C$.

  • PDF

Random Partial Haar Wavelet Transformation for Single Instruction Multiple Threads (단일 명령 다중 스레드 병렬 플랫폼을 위한 무작위 부분적 Haar 웨이블릿 변환)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.16 no.5
    • /
    • pp.805-813
    • /
    • 2015
  • Many researchers expect the compressive sensing and sparse recovery problem can overcome the limitation of conventional digital techniques. However, these new approaches require to solve the l1 norm optimization problems when it comes to signal reconstruction. In the signal reconstruction process, the transform computation by multiplication of a random matrix and a vector consumes considerable computing power. To address this issue, parallel processing is applied to the optimization problems. In particular, due to huge size of original signal, it is hard to store the random matrix directly in memory, which makes one need to design a procedural approach in handling the random matrix. This paper presents a new parallel algorithm to calculate random partial Haar wavelet transform based on Single Instruction Multiple Threads (SIMT) platform.

Fast Mask Operators for the edge Detection in Vision System (시각시스템의 Edge 검출용 고속 마스크 Operator)

  • 최태영
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.11 no.4
    • /
    • pp.280-286
    • /
    • 1986
  • A newmethod of fast mask operators for edge detection is proposed, which is based on the matrix factorization. The output of each component in the multi-directional mask operator is obtained adding every image pixels in the mask area weighting by corresponding mask element. Therefore, it is same as the result of matrix-vector multiplication like one dimensional transform, i, e, , trasnform of an image vector surrounded by mask with a transform matrix consisted of all the elements of eack mask row by row. In this paper, for the Sobel and Prewitt operators, we find the transform matrices, add up the number of operations factoring these matrices and compare the performances of the proposed method and the standard method. As a result, the number of operations with the proposed method, for Sobel and prewitt operators, without any extra storage element, are reduced by 42.85% and 50% of the standard operations, respectively and in case of an image having 100x100 pixels, the proposed Sobel operator with 301 extra storage locations can be computed by 35.93% of the standard method.

  • PDF

Analysis of Tensor Processing Unit and Simulation Using Python (텐서 처리부의 분석 및 파이썬을 이용한 모의실행)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.3
    • /
    • pp.165-171
    • /
    • 2019
  • The study of the computer architecture has shown that major improvements in price-to-energy performance stems from domain-specific hardware development. This paper analyzes the tensor processing unit (TPU) ASIC which can accelerate the reasoning of the artificial neural network (NN). The core device of the TPU is a MAC matrix multiplier capable of high-speed operation and software-managed on-chip memory. The execution model of the TPU can meet the reaction time requirements of the artificial neural network better than the existing CPU and the GPU execution models, with the small area and the low power consumption even though it has many MAC and large memory. Utilizing the TPU for the tensor flow benchmark framework, it can achieve higher performance and better power efficiency than the CPU or CPU. In this paper, we analyze TPU, simulate the Python modeled OpenTPU, and synthesize the matrix multiplication unit, which is the key hardware.

Variable Radix-Two Multibit Coding and Its VLSI Implementation of DCT/IDCT (가변길이 다중비트 코딩을 이용한 DCT/IDCT의 설계)

  • 김대원;최준림
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.39 no.12
    • /
    • pp.1062-1070
    • /
    • 2002
  • In this paper, variable radix-two multibit coding algorithm is presented and applied in the implementation of discrete cosine transform(DCT) and inverse discrete cosine transform(IDCT). Variable radix-two multibit coding means the 2k SD (signed digit) representation of overlapped multibit scanning with variable shift method. SD represented by 2k generates partial products, which can be easily implemented with shifters and adders. This algorithm is most powerful for the hardware implementation of DCT/IDCT with constant coefficient matrix multiplication. This paper introduces the suggested algorithm, it's proof and the implementation of DCT/IDCT The implemented IDCT chip with 8 PEs(Processing Elements) and one transpose memory runs at a tate of 400 Mpixels/sec at 54MHz frequency for high speed parallel signal processing, and it's verified in HDTV and MPEG decoder.

Design of an Automatic Generation System for Cycle-accurate Instruction-set Simulators for DSP Processors (DSP 프로세서용 인스트럭션 셋 시뮬레이터 자동생성기의 설계에 관한 연구)

  • Hong, Sung-Min;Park, Chang-Soo;Hwang, Sun-Young
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.32 no.9A
    • /
    • pp.931-939
    • /
    • 2007
  • This paper describes the system which automatically generates instruction-set simulators cores using the SMDL. SMDL describes structure and instruction-set information of a target DSP machine. Analyzing behavioral information of each pipeline stage of all instructions on a target ASIPS, the proposed system automatically generates a cycle-accurate instruction set simulator in C++ for a target processor. The proposed system has been tested by generating instruction-set simulators for ARM9E-S, ADSP-TS20x, and TMS320C2x architectures. Experiments were performed by checking the functions of the $4{\times}4$ matrix multiplication, 16-bit IIR filter, 32-bit multiplication, and the FFT using the generated simulators. Experimental results show the functional accuracy of the generated simulators.

Experimental Design of S box and G function strong with attacks in SEED-type cipher (SEED 형식 암호에서 공격에 강한 S 박스와 G 함수의 실험적 설계)

  • 박창수;송홍복;조경연
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.8 no.1
    • /
    • pp.123-136
    • /
    • 2004
  • In this paper, complexity and regularity of polynomial multiplication over $GF({2^n})$ are defined by using Hamming weight of rows and columns of the matrix ever GF(2) which represents polynomial multiplication. It is shown experimentally that in order to construct the block cipher robust against differential cryptanalysis, polynomial multiplication of substitution layer and the permutation layer should have high complexity and high regularity. With result of the experiment, a way of constituting S box and G function is suggested in the block cipher whose structure is similar to SEED, which is KOREA standard of 128-bit block cipher. S box can be formed with a nonlinear function and an affine transform. Nonlinear function must be strong with differential attack and linear attack, and it consists of an inverse number over $GF({2^8})$ which has neither a fixed pout, whose input and output are the same except 0 and 1, nor an opposite fixed number, whose output is one`s complement of the input. Affine transform can be constituted so that the input/output correlation can be the lowest and there can be no fixed point or opposite fixed point. G function undergoes linear transform with 4 S-box outputs using the matrix of 4${\times}$4 over $GF({2^8})$. The components in the matrix of linear transformation have high complexity and high regularity. Furthermore, G function can be constituted so that MDS(Maximum Distance Separable) code can be formed, SAC(Strict Avalanche Criterion) can be met, and there can be no weak input where a fixed point an opposite fixed point, and output can be two`s complement of input. The primitive polynomials of nonlinear function affine transform and linear transformation are different each other. The S box and G function suggested in this paper can be used as a constituent of the block cipher with high security, in that they are strong with differential attack and linear attack with no weak input and they are excellent at diffusion.

A Model-based Methodology for Application Specific Energy Efficient Data path Design Using FPGAs (FPGA에서 에너지 효율이 높은 데이터 경로 구성을 위한 계층적 설계 방법)

  • Jang Ju-Wook;Lee Mi-Sook;Mohanty Sumit;Choi Seonil;Prasanna Viktor K.
    • The KIPS Transactions:PartA
    • /
    • v.12A no.5 s.95
    • /
    • pp.451-460
    • /
    • 2005
  • We present a methodology to design energy-efficient data paths using FPGAs. Our methodology integrates domain specific modeling, coarse-grained performance evaluation, design space exploration, and low-level simulation to understand the tradeoffs between energy, latency, and area. The domain specific modeling technique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. A domain is a family of architectures and corresponding algorithms for a given application kernel. The high-level model also consists of functions for estimating energy, latency, and area that facilitate tradeoff analysis. Design space exploration(DSE) analyzes the design space defined by the domain and selects a set of designs. Low-level simulations are used for accurate performance estimation for the designs selected by the DSE and also for final design selection We illustrate our methodology using a family of architectures and algorithms for matrix multiplication. The designs identified by our methodology demonstrate tradeoffs among energy, latency, and area. We compare our designs with a vendor specified matrix multiplication kernel to demonstrate the effectiveness of our methodology. To illustrate the effectiveness of our methodology, we used average power density(E/AT), energy/(area x latency), as themetric for comparison. For various problem sizes, designs obtained using our methodology are on average $25\%$ superior with respect to the E/AT performance metric, compared with the state-of-the-art designs by Xilinx. We also discuss the implementation of our methodology using the MILAN framework.

Energy-Efficient Signal Processing Using FPGAs (FPGA 상에서 에너지 효율이 높은 병렬 신호처리 기법)

  • Jang Ju-wook;Hwang Yunil;Scrofano Ronald;Prasanna Viktor K.
    • The KIPS Transactions:PartA
    • /
    • v.12A no.4 s.94
    • /
    • pp.305-312
    • /
    • 2005
  • In this paper, we present algorithm-level techniques for energy-efficient design at the algorithm level using FPGAs. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast Fourier transform(FFT) and matrix multiplication. We evaluate the performance, in terms of both latency and energy efficiency, of FPGAs in performing these tasks. Using a Xilinx Virtex-II as the target FPGA, we compare the performance of our designs to those from the Xilinx library as well as to conventional algorithms run on the PowerPC core embedded in the Virtex-II Pro and the Texas Instruments TMS320C6415. Our evaluations are done both through estimation based on energy and latency equations on high-level and through low-level simulation. For FFT, our designs dissipated an average of $50\%$ less energy than the design from the Xilinx library and $56\%$ less than the DSP. Our designs showed an EAT factor of 10 times improvement over the embedded processor. These results provide a concrete evidence to substantiate the idea that FPGAs can outperform DSPs and embedded processors in signal processing. Further, they show that PFGAs can achieve this performance while still dissipating less energy than the other two types of devices.