• Title/Summary/Keyword: Matrix Multiplication

Search Result 170, Processing Time 0.027 seconds

An Design Exploration Technique of a Hybrid Memory for Artificial Intelligence Applications (인공지능 응용을 위한 하이브리드 메모리 설계 탐색 기법)

  • Cho, Doo-San
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.24 no.5
    • /
    • pp.531-536
    • /
    • 2021
  • As artificial intelligence technology advances, it is being applied to various application fields. Artificial intelligence is performing well in the field of image recognition and classification. Chip design specialized in this field is also actively being studied. Artificial intelligence-specific chips are designed to provide optimal performance for the applications. At the design task, memory component optimization is becoming an important issue. In this study, the optimal algorithm for the memory size exploration is presented, and the optimal memory size is becoming as a important factor in providing a proper design that meets the requirements of performance, cost, and power consumption.

CSR Sparse Matrix Vector Multiplication Using Zero Copy (Zero Copy를 이용한 CSR 희소행렬 연산)

  • Yoon, SangHyeuk;Jeon, Dayun;Park, Neungsoo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.05a
    • /
    • pp.45-47
    • /
    • 2021
  • APU(Accelerated Processing Unit)는 CPU와 GPU가 통합되어있는 프로세서이며 같은 메모리 공간을 사용한다. CPU와 GPU가 분리되어있는 기존 이종 컴퓨팅 환경에서는 GPU가 작업을 처리하기 위해 CPU에서 GPU로 메모리 복사가 이루어졌지만, APU는 같은 메모리 공간을 사용하므로 메모리 복사 없이 가상주소 할당으로 같은 물리 주소에 접근할 수 있으며 이를 Zero Copy라 한다. Zero Copy 성능을 테스트하기 위해 희소행렬 연산을 사용하였으며 기존 메모리 복사대비 크기가 큰 데이터는 약 4.67배, 크기가 작은 데이터는 약 6.27배 빨랐다.

GPU-based Sparse Matrix-Vector Multiplication Schemes for Random Walk with Restart: A Performance Study (랜덤워크 기법을 위한 GPU 기반 희소행렬 벡터 곱셈 방안에 대한 성능 평가)

  • Yu, Jae-Seo;Bae, Hong-Kyun;Kang, Seokwon;Yu, Yongseung;Park, Yongjun;Kim, Sang-Wook
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.11a
    • /
    • pp.96-97
    • /
    • 2020
  • 랜덤워크 기반 노드 랭킹 방식 중 하나인 RWR(Random Walk with Restart) 기법은 희소행렬 벡터 곱셈 연산과 벡터 간의 합 연산을 반복적으로 수행하며, RWR 의 수행 시간은 희소행렬 벡터 곱셈 연산 방법에 큰 영향을 받는다. 본 논문에서는 CSR5(Compressed Sparse Row 5) 기반 희소행렬 벡터 곱셈 방식과 CSR-vector 기반 희소행렬 곱셈 방식을 채택한 GPU 기반 RWR 기법 간의 비교 실험을 수행한다. 실험을 통해 데이터 셋의 특징에 따른 RWR 의 성능 차이를 분석하고, 적합한 희소행렬 벡터 곱셈 방안 선택에 관한 가이드라인을 제안한다.

Exploring GEMM Optimization Techniques for PIM Architecture: A Case Study on UPMEM (PIM 아키텍처를 위한 GEMM 최적화 기법 탐구: UPMEM 사례 연구)

  • Chan Lee;Heelim Choi;Hanjun Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.65-68
    • /
    • 2024
  • 이 연구는 PIM(Processing-in-Memory) 아키텍처를 활용하여 General Matrix Multiplication(GEMM)의 최적화 기법을 UPMEM PIM 을 통해 탐구한다. 본 연구는 CPU 에서 경험하는 메모리 대역폭의 제한을 극복하고 병렬 처리 구조를 활용함으로써 GEMM 연산에서 PIM 의 잠재적 이점을 확인한다. 또한 연속된 세 개의 행렬 곱셈에 대한 효율성을 평가하고, 데이터 전송 시간이 성능 최적화의 주요병목 현상으로 작용하는 것을 확인한다. CPU 에서 UPMEM 커널로 전송되는 데이터의 양을 한 번에 늘리면서 전송 횟수를 줄이는 방법을 사용하여 CPU 에 비해 성능을 최대 6.57 배 향상시켰다.

A Comparative Analysis of Whole Blood Cadmium by Atomic Absorption Spectrophotometry with a Graphite Furnace (흑연로 원자흡수분광법에 의한 혈액중 카드뮴 정량분석)

  • Park, Jong An;Oh, Hye Jeong;Lee, Jong Hwa
    • Journal of Korean Society of Occupational and Environmental Hygiene
    • /
    • v.6 no.2
    • /
    • pp.301-312
    • /
    • 1996
  • This study was performed to search a optimal analyzing method of cadmium in whole-blood. Cadmium was determined by graphite furnace atomic absorption spectrometry(GFAAS). We investigated the effect of ashing temperature on the absorbance of cadmium in a simple dilution(ten-fold) method with triton X-100 and matrix modifier methods treated with $NH_4H_2PO_4$(1 and 3%) and $Pd(NO_3)_2$(0.00l and 0.005%) as matrix modifier. We also compared the reported reference values of standard blood with values resulted from optimal analyzing conditions of this study. In case of a simple dilution method, when ashing temperature was set at $450^{\circ}C$, the absorbance of sample and background were $0.334{\pm}0.012$ and $1.382{\pm}0.245$, respectively. Background level was higher than the value(0.8) that can be corrected by $D_2$ background correction method. As ashing temperature was rised to $500^{\circ}C$, the absorbance of sample and background were $0.178{\pm}0.008$ and $0.711{\pm}0.223$ respectively. The higher ashing temperature($450^{\circ}C-650^{\circ}C$) was, the lower the absorbance of sample was. In case of a matrix modifier method with $NH_4H_2PO_4$(1 and 3%), when ashing temperature was rised from $500^{\circ}C$ to $650^{\circ}C$, the absorbance of sample slightly changed. The absorbances of sample at $600^{\circ}C$ were $0.230{\pm}0.017$ and $0.137{\pm}0.012$, respectively. These values were larger than that of simple dilution method. But the absorbance of background was higher than the level that can be corrected by $D_2$ method. In case of a matrix modifier method with $Pd(NO_3)_2$(0.001 and 0.005%), the absorbance of sample and background were higher than those of other methods and were stable and reproducible. When ashing temperature was over $550^{\circ}C$, the absorbance of sample was significantly decreased. In case of 0.005% $Pd(NO_3)_2$ carbon residue remained in graphite tube affected the absorbance of sample and background. From these results, We propose that in case of a simple dilution(ten-fold) method with triton X-100 ashing temperature must be maintained below $400^{\circ}C$. In order to diminish the absorbance of background, the alternative method is attenuation of injection volume or multiplication of dilution ratio. We recommend $Pd(NO_3)_2$ than $NH_4H_2PO_4$ as a matrix modifier. In case of a matrix modifier method with $Pd(NO_3)_2$ ashing temperature might be maintained below $550^{\circ}C$.

  • PDF

Random Partial Haar Wavelet Transformation for Single Instruction Multiple Threads (단일 명령 다중 스레드 병렬 플랫폼을 위한 무작위 부분적 Haar 웨이블릿 변환)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.16 no.5
    • /
    • pp.805-813
    • /
    • 2015
  • Many researchers expect the compressive sensing and sparse recovery problem can overcome the limitation of conventional digital techniques. However, these new approaches require to solve the l1 norm optimization problems when it comes to signal reconstruction. In the signal reconstruction process, the transform computation by multiplication of a random matrix and a vector consumes considerable computing power. To address this issue, parallel processing is applied to the optimization problems. In particular, due to huge size of original signal, it is hard to store the random matrix directly in memory, which makes one need to design a procedural approach in handling the random matrix. This paper presents a new parallel algorithm to calculate random partial Haar wavelet transform based on Single Instruction Multiple Threads (SIMT) platform.

Fast Mask Operators for the edge Detection in Vision System (시각시스템의 Edge 검출용 고속 마스크 Operator)

  • 최태영
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.11 no.4
    • /
    • pp.280-286
    • /
    • 1986
  • A newmethod of fast mask operators for edge detection is proposed, which is based on the matrix factorization. The output of each component in the multi-directional mask operator is obtained adding every image pixels in the mask area weighting by corresponding mask element. Therefore, it is same as the result of matrix-vector multiplication like one dimensional transform, i, e, , trasnform of an image vector surrounded by mask with a transform matrix consisted of all the elements of eack mask row by row. In this paper, for the Sobel and Prewitt operators, we find the transform matrices, add up the number of operations factoring these matrices and compare the performances of the proposed method and the standard method. As a result, the number of operations with the proposed method, for Sobel and prewitt operators, without any extra storage element, are reduced by 42.85% and 50% of the standard operations, respectively and in case of an image having 100x100 pixels, the proposed Sobel operator with 301 extra storage locations can be computed by 35.93% of the standard method.

  • PDF

Analysis of Tensor Processing Unit and Simulation Using Python (텐서 처리부의 분석 및 파이썬을 이용한 모의실행)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.3
    • /
    • pp.165-171
    • /
    • 2019
  • The study of the computer architecture has shown that major improvements in price-to-energy performance stems from domain-specific hardware development. This paper analyzes the tensor processing unit (TPU) ASIC which can accelerate the reasoning of the artificial neural network (NN). The core device of the TPU is a MAC matrix multiplier capable of high-speed operation and software-managed on-chip memory. The execution model of the TPU can meet the reaction time requirements of the artificial neural network better than the existing CPU and the GPU execution models, with the small area and the low power consumption even though it has many MAC and large memory. Utilizing the TPU for the tensor flow benchmark framework, it can achieve higher performance and better power efficiency than the CPU or CPU. In this paper, we analyze TPU, simulate the Python modeled OpenTPU, and synthesize the matrix multiplication unit, which is the key hardware.

Variable Radix-Two Multibit Coding and Its VLSI Implementation of DCT/IDCT (가변길이 다중비트 코딩을 이용한 DCT/IDCT의 설계)

  • 김대원;최준림
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.39 no.12
    • /
    • pp.1062-1070
    • /
    • 2002
  • In this paper, variable radix-two multibit coding algorithm is presented and applied in the implementation of discrete cosine transform(DCT) and inverse discrete cosine transform(IDCT). Variable radix-two multibit coding means the 2k SD (signed digit) representation of overlapped multibit scanning with variable shift method. SD represented by 2k generates partial products, which can be easily implemented with shifters and adders. This algorithm is most powerful for the hardware implementation of DCT/IDCT with constant coefficient matrix multiplication. This paper introduces the suggested algorithm, it's proof and the implementation of DCT/IDCT The implemented IDCT chip with 8 PEs(Processing Elements) and one transpose memory runs at a tate of 400 Mpixels/sec at 54MHz frequency for high speed parallel signal processing, and it's verified in HDTV and MPEG decoder.

Design of an Automatic Generation System for Cycle-accurate Instruction-set Simulators for DSP Processors (DSP 프로세서용 인스트럭션 셋 시뮬레이터 자동생성기의 설계에 관한 연구)

  • Hong, Sung-Min;Park, Chang-Soo;Hwang, Sun-Young
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.32 no.9A
    • /
    • pp.931-939
    • /
    • 2007
  • This paper describes the system which automatically generates instruction-set simulators cores using the SMDL. SMDL describes structure and instruction-set information of a target DSP machine. Analyzing behavioral information of each pipeline stage of all instructions on a target ASIPS, the proposed system automatically generates a cycle-accurate instruction set simulator in C++ for a target processor. The proposed system has been tested by generating instruction-set simulators for ARM9E-S, ADSP-TS20x, and TMS320C2x architectures. Experiments were performed by checking the functions of the $4{\times}4$ matrix multiplication, 16-bit IIR filter, 32-bit multiplication, and the FFT using the generated simulators. Experimental results show the functional accuracy of the generated simulators.