• 제목/요약/키워드: Matrix Multiplication

검색결과 169건 처리시간 0.023초

유한체위에서의 고속 최적정규기저 직렬 연산기 (Fast Sequential Optimal Normal Bases Multipliers over Finite Fields)

  • 김용태
    • 한국전자통신학회논문지
    • /
    • 제8권8호
    • /
    • pp.1207-1212
    • /
    • 2013
  • 유한체 연산은 부호이론과 암호학에 널리 쓰이고 있으므로, 유한체 연산의 복잡도를 낮출 수 있는 연산기가 절실하게 필요하다. 그런데 연산기의 복잡도는 유한체의 원소를 표현하는 방법에 달려있다. 복잡도를 줄이기 위해서, 지금까지 알려진 원소를 표현하는 가장 좋은 방법이 최적정규기저를 사용하는 것이다. 본 논문에서는 최적정규기저로 표현된 원소의 곱셈시에 구축되는 곱셈행렬의 1의 개수를 최소화하는 알고리즘을 개발하여 시간과 공간을 최소화하는 곱셈기를 제안하고자 한다.

필터 뱅크를 사용한 저전력 short-length running convolution 필터 설계 및 구현 (Design and Implementation of low-power short-length running convolution filter using filter banks)

  • 장영범
    • 한국산학기술학회논문지
    • /
    • 제7권4호
    • /
    • pp.625-634
    • /
    • 2006
  • 이 논문에서는 FIR(Finite Impulse Response) 필터의 연산의 양을 줄이는 효율적인 직접방식의 고속 알고리즘을 제안하였다. 제안된 알고리즘은 임의의 다운샘플링 크기로 병렬화가 가능하며, 다운샘플링의 크기가 결정되면 쉽게 구조를 유도할 수 있다. 특히 제안된 알고리즘은 이론적인 샘플당 곱셈연산의 수를 감소시킴과 동시에 실제 구현에 있어서도 효과가 있음을 실험을 통하여 입증하였다. 이론적으로 연산의 양이 감소함을 보이기 위하여 부필터의 수와 샘플당 곱셈연산의 수를 기존의 고속 알고리즘과 비교하였으며, 실제적으로 구현의 효과를 입증하기 위하여 하드웨어 구현소자의 수와 Verilog-HDL (Hardware Description Language) 구현으로 기존의 방식들과 비교하여 제안된 구조가 효과적임을 보였다.

  • PDF

New Memristor-Based Crossbar Array Architecture with 50-% Area Reduction and 48-% Power Saving for Matrix-Vector Multiplication of Analog Neuromorphic Computing

  • Truong, Son Ngoc;Min, Kyeong-Sik
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • 제14권3호
    • /
    • pp.356-363
    • /
    • 2014
  • In this paper, we propose a new memristor-based crossbar array architecture, where a single memristor array and constant-term circuit are used to represent both plus-polarity and minus-polarity matrices. This is different from the previous crossbar array architecture which has two memristor arrays to represent plus-polarity and minus-polarity connection matrices, respectively. The proposed crossbar architecture is tested and verified to have the same performance with the previous crossbar architecture for applications of character recognition. For areal density, however, the proposed crossbar architecture is twice better than the previous architecture, because only single memristor array is used instead of two crossbar arrays. Moreover, the power consumption of the proposed architecture can be smaller by 48% than the previous one because the number of memristors in the proposed crossbar architecture is reduced to half compared to the previous crossbar architecture. From the high areal density and high energy efficiency, we can know that this newly proposed crossbar array architecture is very suitable to various applications of analog neuromorphic computing that demand high areal density and low energy consumption.

고효율 스위칭회로 (Construction of Highly Performance Switching Circuit)

  • 박춘명
    • 전자공학회논문지
    • /
    • 제53권12호
    • /
    • pp.88-93
    • /
    • 2016
  • 본 논문에서는 유한체의 수학적 성질과 그래프이론을 바탕으로 GF(P)상의 선형디지털스위칭함수구성을 효과적으로 구성하는 한가지 방법을 제안하였다. 제안한 방법은 주어진 임의의 디지털스위칭함수의 입출력 사이의 연관관계특성으로 부터 DCG를 도출한 후에 노드의 개수를 인수분해한다. 이때 행렬방정식을 해당 차수보다 낮은 기약다항식으로 인수분해하여 그 결과를 부분회로실현한 다음 선형결합함으로써 최종 선형디지털스위칭함수를 구성하였다. 그 결과 기존의 방법에 비해 선형디지털스위칭함수구성을 상당히 간단화 할 수 있었으며 회로구성은 유한체 GF(P)내에서 정의된 가산기와 계수곱셈기를 사용하여 용이하게 실현 할 수 있다.

High-throughput and low-area implementation of orthogonal matching pursuit algorithm for compressive sensing reconstruction

  • Nguyen, Vu Quan;Son, Woo Hyun;Parfieniuk, Marek;Trung, Luong Tran Nhat;Park, Sang Yoon
    • ETRI Journal
    • /
    • 제42권3호
    • /
    • pp.376-387
    • /
    • 2020
  • Massive computation of the reconstruction algorithm for compressive sensing (CS) has been a major concern for its real-time application. In this paper, we propose a novel high-speed architecture for the orthogonal matching pursuit (OMP) algorithm, which is the most frequently used to reconstruct compressively sensed signals. The proposed design offers a very high throughput and includes an innovative pipeline architecture and scheduling algorithm. Least-squares problem solving, which requires a huge amount of computations in the OMP, is implemented by using systolic arrays with four new processing elements. In addition, a distributed-arithmetic-based circuit for matrix multiplication is proposed to counterbalance the area overhead caused by the multi-stage pipelining. The results of logic synthesis show that the proposed design reconstructs signals nearly 19 times faster while occupying an only 1.06 times larger area than the existing designs for N = 256, M = 64, and m = 16, where N is the number of the original samples, M is the length of the measurement vector, and m is the sparsity level of the signal.

GCN 아키텍쳐 상에서의 OpenCL을 이용한 GPGPU 성능향상 기법 연구 (A Study on GPGPU Performance Improvement Technique on GCN Architecture Using OpenCL API)

  • 우동희;김윤호
    • 한국전자거래학회지
    • /
    • 제23권1호
    • /
    • pp.37-45
    • /
    • 2018
  • 현재 프로그램이 운용되는 시스템은 기존의 싱글코어 및 멀티코어 환경을 넘어서 매니코어, 부가 프로세스 및 이기종 환경까지 그 영역이 확장되고 있는 중이다. 하지만, 기존 연구의 경우 NVIDIA 벤더에서 나온 아키텍쳐 및 CUDA로의 병렬화가 주로 이루어졌고 AMD에서 나온 범용 GPU 아키텍쳐인 GCN 아키텍쳐에 대한 성능향상에 관한 연구는 제한적으로 이루어졌다. 이런 점을 고려해 본 논문에서는 GCN 아키텍쳐의 GPGPU 환경인 OpenCL 내에서의 성능향상 기법에 대해 연구하고 실질적인 성능향상을 보였다. 구체적으로, 행렬 곱셈과 컨볼루션을 적용한 GPGPU 프로그램을 본 논문에서 제시한 성능향상 기법을 통해 최대 30% 이상의 실행시간을 감소시켰으며, 커널 이용률 또한 40% 이상 높였다.

Compression of 3D Mesh Geometry and Vertex Attributes for Mobile Graphics

  • Lee, Jong-Seok;Choe, Sung-Yul;Lee, Seung-Yong
    • Journal of Computing Science and Engineering
    • /
    • 제4권3호
    • /
    • pp.207-224
    • /
    • 2010
  • This paper presents a compression scheme for mesh geometry, which is suitable for mobile graphics. The main focus is to enable real-time decoding of compressed vertex positions while providing reasonable compression ratios. Our scheme is based on local quantization of vertex positions with mesh partitioning. To prevent visual seams along the partitioning boundaries, we constrain the locally quantized cells of all mesh partitions to have the same size and aligned local axes. We propose a mesh partitioning algorithm to minimize the size of locally quantized cells, which relates to the distortion of a restored mesh. Vertex coordinates are stored in main memory and transmitted to graphics hardware for rendering in the quantized form, saving memory space and system bus bandwidth. Decoding operation is combined with model geometry transformation, and the only overhead to restore vertex positions is one matrix multiplication for each mesh partition. In our experiments, a 32-bit floating point vertex coordinate is quantized into an 8-bit integer, which is the smallest data size supported in a mobile graphics library. With this setting, the distortions of the restored meshes are comparable to 11-bit global quantization of vertex coordinates. We also apply the proposed approach to compression of vertex attributes, such as vertex normals and texture coordinates, and show that gains similar to vertex geometry can be obtained through local quantization with mesh partitioning.

NOW 환경에서 개선된 고정 분할 단위 알고리즘 (Refined fixed granularity algorithm on Networks of Workstations)

  • 구본근
    • 정보처리학회논문지A
    • /
    • 제8A권2호
    • /
    • pp.117-124
    • /
    • 2001
  • At NOW (Networks Of Workstations), the load sharing is very important role for improving the performance. The known load sharing strategy is fixed-granularity, variable-granularity and adaptive-granularity. The variable-granularity algorithm is sensitive to the various parameters. But Send algorithm, which implements the fixed-granularity strategy, is robust to task granularity. And the performance difference between Send and variable-granularity algorithm is not substantial. But, in Send algorithm, the computing time and the communication time are not overlapped. Therefore, long latency time at the network has influence on the execution time of the parallel program. In this paper, we propose the preSend algorithm. In the preSend algorithm, the master node can send the data to the slave nodes in advance without the waiting for partial results from the slaves. As the master node sent the next data to the slaves in advance, the slave nodes can process the data without the idle time. As stated above, the preSend algorithm can overlap the computing time and the communication time. Therefore we reduce the influence of the long latency time at the network and the execution time of the parallel program on the NOW. To compare the execution time of two algorithms, we use the $320{\times}320$ matrix multiplication. The comparison results of execution times show that the preSend algorithm has the shorter execution time than the Send algorithm.

  • PDF

학습된 신경망 설계를 위한 가중치의 비트-레벨 어레이 구조 표현과 최적화 방법 (Bit-level Array Structure Representation of Weight and Optimization Method to Design Pre-Trained Neural Network)

  • 임국찬;곽우영;이현수
    • 대한전자공학회논문지SD
    • /
    • 제39권9호
    • /
    • pp.37-44
    • /
    • 2002
  • 학습된 신경망(Pre-trained neural network)은 고정된 가중치(weight)를 갖는다. 이 논문에서는 이러한 특성을 이용하여 신경망의 효과적인 디지털 하드웨어의 설계방법을 제안한다. 이를 위해 신경망의 PEs(Processing Elements)연산은 행렬-벡터 곱셈으로 표하고 고정된 가중치와 입력 데이터의 관계를 비트-레벨 어레이(array) 구조로 표현하여, 노드 소거와 가중치 비트 패턴에 따른 공유 노드 설정을 통한 최적화로 연산에 필요한 노드를 최소화한다. FPGA 시뮬레이션 결과, 완전한 정확성에 기반한 하드웨어를 설계하는 경우, 하드웨어 비용을 상당부분 줄였고 동작 주파수가 높다는 것을 확인하였다. 또한, 제안한 설계방법은 한정된 공간 내에서 많은 수의 PEs 구현이 가능함으로, 큰 신경망 모델에 대한 온-칩(on-chip) 구현이 가능하다.

효율적인 J 관계 계산을 위한 L 클래스 계산의 개선 (Improved Computation of L-Classes for Efficient Computation of J Relations)

  • 한재일;김영만
    • 한국IT서비스학회지
    • /
    • 제9권4호
    • /
    • pp.219-229
    • /
    • 2010
  • The Green's equivalence relations have played a fundamental role in the development of semigroup theory. They are concerned with mutual divisibility of various kinds, and all of them reduce to the universal equivalence in a group. Boolean matrices have been successfully used in various areas, and many researches have been performed on them. Studying Green's relations on a monoid of boolean matrices will reveal important characteristics about boolean matrices, which may be useful in diverse applications. Although there are known algorithms that can compute Green relations, most of them are concerned with finding one equivalence class in a specific Green's relation and only a few algorithms have been appeared quite recently to deal with the problem of finding the whole D or J equivalence relations on the monoid of all $n{\times}n$ Boolean matrices. However, their results are far from satisfaction since their computational complexity is exponential-their computation requires multiplication of three Boolean matrices for each of all possible triples of $n{\times}n$ Boolean matrices and the size of the monoid of all $n{\times}n$ Boolean matrices grows exponentially as n increases. As an effort to reduce the execution time, this paper shows an isomorphism between the R relation and L relation on the monoid of all $n{\times}n$ Boolean matrices in terms of transposition. introduces theorems based on it discusses an improved algorithm for the J relation computation whose design reflects those theorems and gives its execution results.