• 제목/요약/키워드: matrix-vector multiplication

검색결과 35건 처리시간 0.023초

PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

  • Hyeji Kim;Yeongmin Lee;Chun-Gi Lyuh
    • ETRI Journal
    • /
    • 제46권5호
    • /
    • pp.817-828
    • /
    • 2024
  • Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.

Homogeneous Transformation Matrix의 곱셈을 위한 병렬구조 프로세서의 설계 (A Parallel-Architecture Processor Design for the Fast Multiplication of Homogeneous Transformation Matrices)

  • 권두올;정태상
    • 대한전기학회논문지:시스템및제어부문D
    • /
    • 제54권12호
    • /
    • pp.723-731
    • /
    • 2005
  • The $4{\times}4$ homogeneous transformation matrix is a compact representation of orientation and position of an object in robotics and computer graphics. A coordinate transformation is accomplished through the successive multiplications of homogeneous matrices, each of which represents the orientation and position of each corresponding link. Thus, for real time control applications in robotics or animation in computer graphics, the fast multiplication of homogeneous matrices is quite demanding. In this paper, a parallel-architecture vector processor is designed for this purpose. The processor has several key features. For the accuracy of computation for real application, the operands of the processors are floating point numbers based on the IEEE Standard 754. For the parallelism and reduction of hardware redundancy, the processor takes column vectors of homogeneous matrices as multiplication unit. To further improve the throughput, the processor structure and its control is based on a pipe-lined structure. Since the designed processor can be used as a special purpose coprocessor in robotics and computer graphics, additionally to special matrix/matrix or matrix/vector multiplication, several other useful instructions for various transformation algorithms are included for wide application of the new design. The suggested instruction set will serve as standard in future processor design for Robotics and Computer Graphics. The design is verified using FPGA implementation. Also a comparative performance improvement of the proposed design is studied compared to a uni-processor approach for possibilities of its real time application.

영어 수계를 이용한 디지털 신경망회로의 실현 (An Implementation of Digital Neural Network Using Systolic Array Processor)

  • 윤현식;조원경
    • 전자공학회논문지B
    • /
    • 제30B권2호
    • /
    • pp.44-50
    • /
    • 1993
  • In this paper, we will present an array processor for implementation of digital neural networks. Back-propagation model can be formulated as a consecutive matrix-vector multiplication problem with some prespecified thresholding operation. This operation procedure is suited for the design of an array processor, because it can be recursively and repeatedly executed. Systolic array circuit architecture with Residue Number System is suggested to realize the efficient arithmetic circuit for matrix-vector multiplication and compute sigmoid function. The proposed design method would expect to adopt for the application field of neural networks, because it can be realized to currently developed VLSI technology.

  • PDF

NEW ALGORITHMS FOR SOLVING ODES BY PSEUDOSPECTRAL METHOD

  • Darvishi, M.T.
    • Journal of applied mathematics & informatics
    • /
    • 제7권2호
    • /
    • pp.439-451
    • /
    • 2000
  • To compute derivatives using matrix vector multiplication method, new algorithms were introduced in [1.2]n By these algorithms, we reduced roundoff error in computing derivative using Chebyshev collocation methods (CCM). In this paper, some applications of these algorithms ar presented.

GPU-Based ECC Decode Unit for Efficient Massive Data Reception Acceleration

  • Kwon, Jisu;Seok, Moon Gi;Park, Daejin
    • Journal of Information Processing Systems
    • /
    • 제16권6호
    • /
    • pp.1359-1371
    • /
    • 2020
  • In transmitting and receiving such a large amount of data, reliable data communication is crucial for normal operation of a device and to prevent abnormal operations caused by errors. Therefore, in this paper, it is assumed that an error correction code (ECC) that can detect and correct errors by itself is used in an environment where massive data is sequentially received. Because an embedded system has limited resources, such as a low-performance processor or a small memory, it requires efficient operation of applications. In this paper, we propose using an accelerated ECC-decoding technique with a graphics processing unit (GPU) built into the embedded system when receiving a large amount of data. In the matrix-vector multiplication that forms the Hamming code used as a function of the ECC operation, the matrix is expressed in compressed sparse row (CSR) format, and a sparse matrix-vector product is used. The multiplication operation is performed in the kernel of the GPU, and we also accelerate the Hamming code computation so that the ECC operation can be performed in parallel. The proposed technique is implemented with CUDA on a GPU-embedded target board, NVIDIA Jetson TX2, and compared with execution time of the CPU.

LED배열을 이용한 인코히어런트광벡터매트릭스 곱셈기〈IOVMM〉에 관한 연구 (A Study on the Incoherent Optical Vector-Matrix Multiplier(IOVMM)using a LED array)

  • 최평석;박한규
    • 한국통신학회논문지
    • /
    • 제9권3호
    • /
    • pp.127-131
    • /
    • 1984
  • 벡터-매트릭스 곱셈을 인코히어런트(incoherent)광원에 의해 빠른 속도로 대량의 정보를 처리할 수 있는 IOVMM(incoherent optical vector matrix multiplier)을 구성하고 실험결과와 이론치를 비교하였다. 입력 벡터 및 매트릭스의 원소들은 양의 실수로만 국한시키고 입력 벡터는 LED배열로 나타내었으며 매트릭스는 마스크상에 면적변조방식으로 부호화하였다. 이 두 곱셈의 결과는 렌즈계를 통하여 포토 다이오우드 배열로 검출하였으며 하나의 채널로 출력신호를 관찰하기 위하여 애널로그 멀티플렉스를 사용하였다.

  • PDF

구조적 LDPC 부호의 저복잡도 및 고속 부호화기 설계 (Design of Low Complexity and High Throughput Encoder for Structured LDPC Codes)

  • 정용민;정윤호;김재석
    • 대한전자공학회논문지SD
    • /
    • 제46권10호
    • /
    • pp.61-69
    • /
    • 2009
  • 본 논문은 저 복잡도와 높은 throughput을 지원하는 LDPC 부호화기의 구조에 대하여 제안한다. LDPC 부호화기가 갖는 높은 복잡도 문제를 해결하기 위하여 기존의 복잡도가 높은 행렬 곱셈 연산기 대신에 간소화된 행렬 곱셈 연산기가 제안되었다. 또한 높은 throughput을 지원하기 위하여 행렬 곱셈 연산시 행 방향 연산 및 부분 병렬처리 연산을 적용하였다. 제안된 부호화기 구조의 로직 게이트와 메모리 사용량은 기존의 5단 파이프라인 부호화기의 구조에 비하여 각각 37.4%와 56.7%씩 감소하였다. 또한 40MHz 클럭 주파수에 대해 기존의 부호화기에 비하여 3배 이상의 throughput인 최대 800Mbps의 throughput을 지원한다.

효율적인 D-클래스 계산을 위한 알고리즘 (Algorithm for Efficient D-Class Computation)

  • 한재일
    • 한국IT서비스학회지
    • /
    • 제6권1호
    • /
    • pp.151-158
    • /
    • 2007
  • D-class computation requires multiplication of three Boolean matrices for each of all possible triples of $n{\times}n$ Boolean matrices and search for equivalent $n{\times}n$ Boolean matrices according to a specific equivalence relation. It is easy to see that even multiplying all $n{\times}n$ Boolean matrices with themselves shows exponential time complexity and D-Class computation was left an unsolved problem due to its computational complexity. The vector-based multiplication theory shows that the multiplication of three Boolean matrices for each of all possible triples of $n{\times}n$ Boolean matrices can be done much more efficiently. However, D-Class computation requires computation of equivalent classes in addition to the efficient multiplication. The paper discusses a theory and an algorithm for efficient D-class computation, and shows execution results of the algorithm.

그래픽스 하드웨어를 이용한 스윕 곡면의 렌더링 (Rendering of Sweep Surfaces using Programmable Graphics Hardware)

  • 고대현;윤승현;이지은
    • 한국컴퓨터그래픽스학회논문지
    • /
    • 제16권4호
    • /
    • pp.11-16
    • /
    • 2010
  • 본 논문에서는 그래픽스 하드웨어를 이용한 스윕 곡면의 효율적인 렌더링 알고리즘을 제안한다. 스윕 곡면은 스플라인 모션을 따라 움직이는 단면 곡선으로 표현된다. 이러한 표현은 행렬과 벡터의 곱으로 계산되며, 이는 프로그래밍이 가능한 그래픽스 하드웨어에 쉽게 적용될 수 있다. 스플라인 모션과 단면 곡선의 정보는 텍스쳐 메모리에 저장된다. 그래픽스 하드웨어의 정점 프로세서는 두 개의 곡면 매개변수를 2차원 정점으로 입력받아 한 번의 행렬 곱셈으로 스윕 곡면의 정점 좌표와 법선 벡터를 계산한다. 제안한 GPU 기반 스윕 곡면의 렌더링은 CPU 기반 렌더링에 비해 10배에서 40배 정도의 속도 향상을 보였다.

Strain Decomposition Method in Hull Stress Monitoring System for Container Ship

  • Park, Jae-Woong;Kang, Yun-Tae
    • Journal of Ship and Ocean Technology
    • /
    • 제7권3호
    • /
    • pp.56-65
    • /
    • 2003
  • The hull monitoring systems of container ships with four long-base gages give enough information for identifying the hull girder loads such as bending and torsional moments. But such a load-identification for container ships has not been known. In this paper, a load-identification method is suggested in terms of a linear matrix equation that the measured strain vector equals to the multiplication of the transformation matrix and the desired strain component vector. The equation is proved to be mathematically complete by the property of positive-definite determinant of the transformation matrix. The method is applied to a hull stress monitoring system for 8100TED container ship during sea trial, and the estimated external loads illustrate reasonable results in comparison with the pre-estimated results. This moment decomposition concept has also been tested in real operation conditions. The typical phenomena over the Suez Canal illustrated very suitable results comparing with the physical understandings. Henceforth, one can effectively use the proposed concept to monitor the hull girder loads such as bending and torsional moments.