• Title/Summary/Keyword: Matrix Multiplication

Search Result 169, Processing Time 0.024 seconds

A Model-based Methodology for Application Specific Energy Efficient Data path Design Using FPGAs (FPGA에서 에너지 효율이 높은 데이터 경로 구성을 위한 계층적 설계 방법)

  • Jang Ju-Wook;Lee Mi-Sook;Mohanty Sumit;Choi Seonil;Prasanna Viktor K.
    • The KIPS Transactions:PartA
    • /
    • v.12A no.5 s.95
    • /
    • pp.451-460
    • /
    • 2005
  • We present a methodology to design energy-efficient data paths using FPGAs. Our methodology integrates domain specific modeling, coarse-grained performance evaluation, design space exploration, and low-level simulation to understand the tradeoffs between energy, latency, and area. The domain specific modeling technique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. A domain is a family of architectures and corresponding algorithms for a given application kernel. The high-level model also consists of functions for estimating energy, latency, and area that facilitate tradeoff analysis. Design space exploration(DSE) analyzes the design space defined by the domain and selects a set of designs. Low-level simulations are used for accurate performance estimation for the designs selected by the DSE and also for final design selection We illustrate our methodology using a family of architectures and algorithms for matrix multiplication. The designs identified by our methodology demonstrate tradeoffs among energy, latency, and area. We compare our designs with a vendor specified matrix multiplication kernel to demonstrate the effectiveness of our methodology. To illustrate the effectiveness of our methodology, we used average power density(E/AT), energy/(area x latency), as themetric for comparison. For various problem sizes, designs obtained using our methodology are on average $25\%$ superior with respect to the E/AT performance metric, compared with the state-of-the-art designs by Xilinx. We also discuss the implementation of our methodology using the MILAN framework.

Energy-Efficient Signal Processing Using FPGAs (FPGA 상에서 에너지 효율이 높은 병렬 신호처리 기법)

  • Jang Ju-wook;Hwang Yunil;Scrofano Ronald;Prasanna Viktor K.
    • The KIPS Transactions:PartA
    • /
    • v.12A no.4 s.94
    • /
    • pp.305-312
    • /
    • 2005
  • In this paper, we present algorithm-level techniques for energy-efficient design at the algorithm level using FPGAs. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast Fourier transform(FFT) and matrix multiplication. We evaluate the performance, in terms of both latency and energy efficiency, of FPGAs in performing these tasks. Using a Xilinx Virtex-II as the target FPGA, we compare the performance of our designs to those from the Xilinx library as well as to conventional algorithms run on the PowerPC core embedded in the Virtex-II Pro and the Texas Instruments TMS320C6415. Our evaluations are done both through estimation based on energy and latency equations on high-level and through low-level simulation. For FFT, our designs dissipated an average of $50\%$ less energy than the design from the Xilinx library and $56\%$ less than the DSP. Our designs showed an EAT factor of 10 times improvement over the embedded processor. These results provide a concrete evidence to substantiate the idea that FPGAs can outperform DSPs and embedded processors in signal processing. Further, they show that PFGAs can achieve this performance while still dissipating less energy than the other two types of devices.

An Efficient Matrix-Vector Product Algorithm for the Analysis of General Interconnect Structures (일반적인 연결선 구조의 해석을 위한 효율적인 행렬-벡터 곱 알고리즘)

  • Jung, Seung-Ho;Baek, Jong-Humn;Kim, Joon-Hee;Kim, Seok-Yoon
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.12
    • /
    • pp.56-65
    • /
    • 2001
  • This paper proposes an algorithm for the capacitance extraction of general 3-dimensional conductors in an ideal uniform dielectric that uses a high-order quadrature approximation method combined with the typical first-order collocation method to enhance the accuracy and adopts an efficient matrix-vector product algorithm for the model-order reduction to achieve efficiency. The proposed method enhances the accuracy using the quadrature method for interconnects containing corners and vias that concentrate the charge density. It also achieves the efficiency by reducing the model order using the fact that large parts of system matrices are of numerically low rank. This technique combines an SVD-based algorithm for the compression of rank-deficient matrices and Gram-Schmidt algorithm of a Krylov-subspace iterative technique for the rapid multiplication of matrices. It is shown through the performance evaluation procedure that the combination of these two techniques leads to a more efficient algorithm than Gaussian elimination or other standard iterative schemes within a given error tolerance.

  • PDF

Preprocessing Method for Handling Multi-Way Join Continuous Queries over Data Streams (데이터 스트림에서 다중 조인 연속질의의 효과적인 처리를 위한 전처리 기법)

  • Seo, Ki-Yeon;Lee, Joo-Il;Lee, Won-Suk
    • Journal of Internet Computing and Services
    • /
    • v.13 no.3
    • /
    • pp.93-105
    • /
    • 2012
  • A data stream is a series of tuples which are generated in real-time, incessant, immense, and volatile manner. As new information technologies are actively emerging, stream processing methods are being needed to efficiently handle data streams. Especially, finding out an efficient evaluation for a multi-way join would make outstanding contributions toward improving the performance of a data stream management system because a join operation is one of the most resource-consuming operators for evaluating queries. In this paper, in order to evaluate efficiently a multi-way join continuous query, we propose a novel method to decrease the cost of a query by eliminating unsuccessful intermediate results. For this, we propose a matrix-based structure for monitoring data streams and estimate the number of final result tuples of the query and find out unsuccessful tuples by matrix multiplication operations. And then using these information, we process efficiently a multi-way join continuous query by filtering out the unsuccessful tuples in advance before actual evaluation of the query.

A Study on High Speed Image Rotation Algorithm using CUDA (CUDA를 이용한 고속 영상 회전 알고리즘에 관한 연구)

  • Kwon, Hee-Choul;Cho, Hyung-Jin;Kwon, Hee-Yong
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.16 no.5
    • /
    • pp.1-6
    • /
    • 2016
  • Image rotation is one of main pre-processing step in image processing or image pattern recognition. It is implemented with rotation matrix multiplication. However it requires lots of floating point arithmetic operations and trigonometric function calculations, so it takes long execution time. We propose a new high speed image rotation algorithm without two major time-consuming operations. It use just 2 shear translation operations, so it is very fast. In addition, we apply a parallel computing technique with CUDA. CUDA is a massively parallel computing architecture using prevailed GPU recently. As GPU is a dedicated graphic processor, it is exellent for parallel processing of pixels. We compare the proposed algorithm with the conventional rotation one with various size images. Experimental results show that the proposed algorithm is superior to the conventional rotation ones.

A Low-Power 2-D DCT/IDCT Architecture through Dynamic Control of Data Driven and Fine-Grain Partitioned Bit-Slices (데이터에 의한 구동과 세분화된 비트-슬라이스의 동적제어를 통한 저전력 2-D DCT/IDCT 구조)

  • Kim Kyeounsoo;Ryu Dae-Hyun
    • Journal of Korea Multimedia Society
    • /
    • v.8 no.2
    • /
    • pp.201-210
    • /
    • 2005
  • This paper proposes a power efficient 2-dimensional DCT/IDCT architecture driven by input data to be processed. The architecture achieves low power by taking advantage of the typically large fraction of zero and small-valued input processing data in video and image data compression. In particular, it skips multiplication by zero and dynamically activates/deactivates required bit-slices of fine-grain bit partitioned adders within multipliers and accumulators using simple input ANDing and bit-slice MASKing. The processed results from 1-D DCT/IDCT do not have unnecessary sign extension bits (SEBs), which are used for further power reduction in matrix transposer. The results extracted by bit-level transition activity simulations indicate significant power reduction compared to conventional designs.

  • PDF

Change Area Detection using Color and Edge Gradient Covariance Features (색상과 에지 공분산 특징을 이용한 변화영역 검출)

  • Kim, Dong-Keun;Hwang, Chi-Jung
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.17 no.1
    • /
    • pp.717-724
    • /
    • 2016
  • This paper proposes a change detection method based on the covariance matrices of color and edge gradient in a color video. The YCbCr color format was used instead of RGB. The color covariance matrix was calculated from the CbCr-channels and the edge gradient covariance matrix was calculated from the Y-channels. The covariance matrices were effectively calculated at each pixel by calculating the sum, squared sum, and sum of two values' multiplication of a rectangle area using the integral images from a background image. The background image was updated by a running the average between the background image and a current frame. The change areas in a current frame image against the background were detected using the Mahalanobis distance, which is a measure of the statistical distance using covariance matrices. The experimental results of an expressway color video showed that the proposed approach can effectively detect change regions for color and edge gradients against the background.

High Expression of KIFC1 in Glioma Correlates with Poor Prognosis

  • Pengfei Xue;Juan Zheng;Rongrong Li;Lili Yan;Zhaohao Wang;Qingbin Jia;Lianqun Zhang;Xin Li
    • Journal of Korean Neurosurgical Society
    • /
    • v.67 no.3
    • /
    • pp.364-375
    • /
    • 2024
  • Objective : Kinesin family member C1 (KIFC1), a non-essential kinesin-like motor protein, has been found to serve a crucial role in supernumerary centrosome clustering and the progression of several human cancer types. However, the role of KIFC1 in glioma has been rarely reported. Thus, the present study aimed to investigate the role of KIFC1 in glioma progression. Methods : Online bioinformatics analysis was performed to determine the association between KIFC1 expression and clinical outcomes in glioma. Immunohistochemical staining was conducted to analyze the expression levels of KIFC1 in glioma and normal brain tissues. Furthermore, KIFC1 expression was knocked in the glioma cell lines, U251 and U87MG, and the functional roles of KIFC1 in cell proliferation, invasion and migration were analyzed using cell multiplication, wound healing and Transwell invasion assays, respectively. The autophagic flux and expression levels matrix metalloproteinase-2 (MMP2) were also determined using imaging flow cytometry, western blotting and a gelation zymography assay. Results : The results revealed that KIFC1 expression levels were significantly upregulated in glioma tissues compared with normal brain tissues, and the expression levels were positively associated with tumor grade. Patients with glioma with low KIFC1 expression levels had a more favorable prognosis compared with patients with high KIFC1 expression levels. In vitro, KIFC1 knockdown not only inhibited the proliferation, migration and invasion of glioma cells, but also increased the autophagic flux and downregulated the expression levels of MMP2. Conclusion : Upregulation of KIFC1 expression may promote glioma progression and KIFC1 may serve as a potential prognostic biomarker and possible therapeutic target for glioma.

A Multi-Level Accumulation-Based Rectification Method and Its Circuit Implementation

  • Son, Hyeon-Sik;Moon, Byungin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.6
    • /
    • pp.3208-3229
    • /
    • 2017
  • Rectification is an essential procedure for simplifying the disparity extraction of stereo matching algorithms by removing vertical mismatches between left and right images. To support real-time stereo matching, studies have introduced several look-up table (LUT)- and computational logic (CL)-based rectification approaches. However, to support high-resolution images, the LUT-based approach requires considerable memory resources, and the CL-based approach requires numerous hardware resources for its circuit implementation. Thus, this paper proposes a multi-level accumulation-based rectification method as a simple CL-based method and its circuit implementation. The proposed method, which includes distortion correction, reduces addition operations by 29%, and removes multiplication operations by replacing the complex matrix computations and high-degree polynomial calculations of the conventional rectification with simple multi-level accumulations. The proposed rectification circuit can rectify $1,280{\times}720$ stereo images at a frame rate of 135 fps at a clock frequency of 125 MHz. Because the circuit is fully pipelined, it continuously generates a pair of left and right rectified pixels every cycle after 13-cycle latency plus initial image buffering time. Experimental results show that the proposed method requires significantly fewer hardware resources than the conventional method while the differences between the results of the proposed and conventional full rectifications are negligible.

A Systematic Generation of Register-Reuse Chains (레지스터 재활용 사슬의 체계적 생성)

  • Lee, Hyuk-Jae
    • The Transactions of the Korean Institute of Electrical Engineers A
    • /
    • v.48 no.12
    • /
    • pp.1564-1574
    • /
    • 1999
  • In order to improve the efficiency of optimizing compilers, integration of register allocation and instruction scheduling has been extensively studied. One of the promising integration techniques is register allocation based on register-reuse chains. However, the generation of register-reuse chains in the previous approach was not completely systematic and consequently it creates unnecessarily dependencies that restrict instruction scheduling. This paper proposes a new register allocation technique based on a systematic generation of register-reuse chains. The first phase of the proposed technique is to generate register-reuse chains that are optimal in the sense that no additional dependencies are created. Thus, register allocation can be done without restricting instruction scheduling. For the case when the optimal register-reuse chains require more than available registers, the second phase reduces the number of required registers by merging the register-reuse chains. Chain merging always generates additional dependencies and consequently enforces the execution order of instructions. A heuristic is developed for the second phase in order to reduce additional dependencies created by merging chains. For matrix multiplication program, the number of registers resulting from the first phase is small enough to fit into available registers for most basic blocks. In addition, it is shown that the restriction to instruction scheduling is reduced by the proposed merging heuristic of the second phase.

  • PDF