• Title/Summary/Keyword: GPU Memory

Search Result 150, Processing Time 0.022 seconds

Large-Scale Ultrasound Volume Rendering using Bricking (블리킹을 이용한 대용량 초음파 볼륨 데이터 렌더링)

  • Kim, Ju-Hwan;Kwon, Koo-Joo;Shin, Byeong-Seok
    • Journal of the Korea Society of Computer and Information
    • /
    • v.13 no.7
    • /
    • pp.117-126
    • /
    • 2008
  • Recent advances in medical imaging technologies have enabled the high-resolution data acquisition. Therefore visualization of such large data set on standard graphics hardware became a popular research theme. Among many visualization techniques, we focused on bricking method which divided the entire volume into smaller bricks and rendered them in order. Since it switches bet\W8n bricks on main memory and bricks on GPU memory on the fly, to achieve better performance, the number of these memory swapping conditions has to be minimized. And, because the original bricking algorithm was designed for regular volume data such as CT and MR, when applying the algorithm to ultrasound volume data which is based on the toroidal coordinate space, it revealed some performance degradation. In some areas near bricks' boundaries, an orthogonal viewing ray intersects the single brick twice, and it consequently makes a single brick memory to be uploaded onto GPU twice in a single frame. To avoid this redundancy, we divided the volume into bricks allowing overlapping between the bricks. In this paper, we suggest the formula to determine an appropriate size of these shared area between the bricks. Using our formula, we could minimize the memory bandwidth. and, at the same time, we could achieve better rendering performance.

  • PDF

A Study on GPU Computing of Bi-conjugate Gradient Method for Finite Element Analysis of the Incompressible Navier-Stokes Equations (유한요소 비압축성 유동장 해석을 위한 이중공액구배법의 GPU 기반 연산에 대한 연구)

  • Yoon, Jong Seon;Jeon, Byoung Jin;Jung, Hye Dong;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.40 no.9
    • /
    • pp.597-604
    • /
    • 2016
  • A parallel algorithm of bi-conjugate gradient method was developed based on CUDA for parallel computation of the incompressible Navier-Stokes equations. The governing equations were discretized using splitting P2P1 finite element method. Asymmetric stenotic flow problem was solved to validate the proposed algorithm, and then the parallel performance of the GPU was examined by measuring the elapsed times. Further, the GPU performance for sparse matrix-vector multiplication was also investigated with a matrix of fluid-structure interaction problem. A kernel was generated to simultaneously compute the inner product of each row of sparse matrix and a vector. In addition, the kernel was optimized to improve the performance by using both parallel reduction and memory coalescing. In the kernel construction, the effect of warp on the parallel performance of the present CUDA was also examined. The present GPU computation was more than 7 times faster than the single CPU by double precision.

Acceleration of Phase Measuring Profilometry using GPU (GPU를 이용한 위상 측정법의 가속화)

  • Kim, Ho-Joong;Cho, Tai-Hoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.12
    • /
    • pp.2285-2290
    • /
    • 2017
  • Automation systems are evolving in many areas of industry in recent years. At the same time, the necessity of the height inspection of the object by the 3D measurement is gradually increasing. Among the various 3D measurement methods, this paper discusses phase measuring profilometry(PMP). The PMP is a method of obtaining the height of an object using the phase value of the fringe pattern. Since the PMP is an algorithm requiring a large amount of computation, a method for efficiently solving the problem is needed. In this paper, we propose to use CUDA from NVIDIA to solve this problem. We also propose using pinned memory and streams provided by CUDA. This can greatly improve the measurement speed while maintaining accuracy. Finally, we demonstrate the performance of the proposed method through experiments.

Building a Dynamic Analyzer for CUDA based System.

  • SALAH T. ALSHAMMARI
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.8
    • /
    • pp.77-84
    • /
    • 2023
  • The utilization of GPUs on general-purpose computers is currently on the rise due to the increase in its programmability and performance requirements. The utility of tools like NVIDIA's CUDA have been designed to allow programmers to code algorithms by using C-like language for the execution process on the graphics processing units GPU. Unfortunately, many of the performance and correctness bugs will happen on parallel programs. The CUDA tool support for the parallel programs has not yet been actualized. The use of a dynamic analyzer to find performance and correctness bugs in CUDA programs facilitates the execution of sophisticated processes, especially in modern computing requirements. Any race conditions bug it will impact of program correctness and the share memory bank conflicts to improve the overall performance. The technique instruments the programs in a way that promotes accessibility of the memory locations accessed by different threads well as to check for any bugs in the code of a program. The instrumented source code will be used initiated directly in the device emulation code of CUDA to send report for the user about all errors. The current degree of automation helps programmers solve subtle bugs in highly complex programs or programs that cannot be analyzed manually.

Practical methods for GPU-based whole-core Monte Carlo depletion calculation

  • Kyung Min Kim;Namjae Choi;Han Gyu Lee;Han Gyu Joo
    • Nuclear Engineering and Technology
    • /
    • v.55 no.7
    • /
    • pp.2516-2533
    • /
    • 2023
  • Several practical methods for accelerating the depletion calculation in a GPU-based Monte Carlo (MC) code PRAGMA are presented including the multilevel spectral collapse method and the vectorized Chebyshev rational approximation method (CRAM). Since the generation of microscopic reaction rates for each nuclide needed for the construction of the depletion matrix of the Bateman equation requires either enormous memory access or tremendous physical memory, both of which are quite burdensome on GPUs, a new method called multilevel spectral collapse is proposed which combines two types of spectra to generate microscopic reaction rates: an ultrafine spectrum for an entire fuel pin and coarser spectra for each depletion region. Errors in reaction rates introduced by this method are mitigated by a hybrid usage of direct online reaction rate tallies for several important fissile nuclides. The linear system to appear in the solution process adopting the CRAM is solved by the Gauss-Seidel method which can be easily vectorized on GPUs. With the accelerated depletion methods, only about 10% of MC calculation time is consumed for depletion, so an accurate full core cycle depletion calculation for a commercial power reactor (BEAVRS) can be done in 16 h with 24 consumer-grade GPUs.

메모리내 연산 기술의 클라우드 신뢰실행 관련 연구 전망

  • Suhwan Shin;Hojoon Lee
    • Review of KIISC
    • /
    • v.33 no.5
    • /
    • pp.11-16
    • /
    • 2023
  • 오늘 날의 클라우드 워크로드는 인공지능 및 빅 데이터 활용의 비약적인 발전으로 인하여 메모리 대역폭이 프로세서의 연산 속도를 따라가지 못해 병목 현상을 겪고 있다. 이러한 이른바 메모리 벽 문제 (Memory Wall Problem)를 해결하기 위해 컴퓨터 아키텍처 및 운영체제는 변화해 나가고 있다. 그 중 최근 가장 주목 받는 기술 중 하나인 메모리내 연산기술(Processing-In-Memory)는 프로세서를 메모리 디바이스 내에 탑재함으로써, 데이터를 메인 프로세서에 이동시켜 처리할 필요 없이 데이터 내부에서 처리한다. 이로 인해 대용량 데이터의 처리속도 향상과 동시에 메인 메모리버스의 부하를 줄여 클라우드 시스템의 전반적인 성능 향상 또한 꽤할 수 있다. 한편, 클라우드 아키텍처는 또 다른 요구에 의하여 변화를 거치고 있으며, 이는 바로 보안이다. 오늘 날의 컴퓨터 아키텍처 및 GPU등의 가속기들은 신뢰실행 기술 (Trusted Execution)의 지원을 통하여 클라우드에서의 민감한 연산을 격리 및 보호하고자 한다. Intel의 SGX와 NVIDIA GPU의 confidential computing기능 지원이 이러한 흐름을 대표한다. 최근 PIM을 활용한 보안기술의 새로운 방향들을 제시하는 연구들이 소개되고 있는 가운데, 본 논문은 클라우드 신뢰실행 (Trusted Execution)에서의 PIM을 적용한 최신 연구들의 방향을 소개하고 또한 향후 연구 전망을 제공하고자 한다. PIM기술의 동향과 PIM을 보안에 특화시킨 연구, 그리고 앞으로 해결되어야할 과제들을 논함으로써, 새로이 주목받는 PIM 기반 보안 기술들을 정리하고 향후 전망을 제공한다.

Real-time Depth Image Refinement using Hierarchical Joint Bilateral Filter (계층적 결합형 양방향 필터를 이용한 실시간 깊이 영상 보정 방법)

  • Shin, Dong-Won;Hoa, Yo-Sung
    • Journal of Broadcast Engineering
    • /
    • v.19 no.2
    • /
    • pp.140-147
    • /
    • 2014
  • In this paper, we propose a method for real-time depth image refinement. In order to improve the quality of the depth map acquired from Kinect camera, we employ constant memory and texture memory which are suitable for a 2D image processing in the graphics processing unit (GPU). In addition, we applied the joint bilateral filter (JBF) in parallel to accelerate the overall execution. To enhance the quality of the depth image, we applied the JBF hierarchically using the compute unified device architecture (CUDA). Finally, we obtain the refined depth image. Experimental results showed that the proposed real-time depth image refinement algorithm improved the subjective quality of the depth image and the computational time was 260 frames per second.

Analysis of Tensor Processing Unit and Simulation Using Python (텐서 처리부의 분석 및 파이썬을 이용한 모의실행)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.3
    • /
    • pp.165-171
    • /
    • 2019
  • The study of the computer architecture has shown that major improvements in price-to-energy performance stems from domain-specific hardware development. This paper analyzes the tensor processing unit (TPU) ASIC which can accelerate the reasoning of the artificial neural network (NN). The core device of the TPU is a MAC matrix multiplier capable of high-speed operation and software-managed on-chip memory. The execution model of the TPU can meet the reaction time requirements of the artificial neural network better than the existing CPU and the GPU execution models, with the small area and the low power consumption even though it has many MAC and large memory. Utilizing the TPU for the tensor flow benchmark framework, it can achieve higher performance and better power efficiency than the CPU or CPU. In this paper, we analyze TPU, simulate the Python modeled OpenTPU, and synthesize the matrix multiplication unit, which is the key hardware.

A new warp scheduling technique for improving the performance of GPUs by utilizing MSHR information (GPU 성능 향상을 위한 MSHR 정보 기반 워프 스케줄링 기법)

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.3
    • /
    • pp.72-83
    • /
    • 2017
  • GPUs can provide high throughput with latency hiding by executing many warps in parallel. MSHR(Miss Status Holding Registers) for L1 data cache tracks cache miss requests until required data is serviced from lower level memory. In recent GPUs, excessive requests for cache resources cause underutilization problem of GPU resources due to cache resource reservation fails. In this paper, we propose a new warp scheduling technique to reduce stall cycles under MSHR resource shortage. Cache miss rates for each warp is predicted based on the observation that each warp shows similar cache miss rates for long period. The warps showing low miss rates or computation-intensive warps are given high priority to be issued when MSHR is full status. Our proposal improves GPU performance by utilizing cache resource more efficiently based on cache miss rate prediction and monitoring the MSHR entries. According to our experimental results, reservation fail cycles can be reduced by 25.7% and IPC is increased by 6.2% with the proposed scheduling technique compared to loose round robin scheduler.

Performance Enhancement and Evaluation of AES Cryptography using OpenCL on Embedded GPGPU (OpenCL을 이용한 임베디드 GPGPU환경에서의 AES 암호화 성능 개선과 평가)

  • Lee, Minhak;Kang, Woochul
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.7
    • /
    • pp.303-309
    • /
    • 2016
  • Recently, an increasing number of embedded processors such as ARM Mali begin to support GPGPU programming frameworks, such as OpenCL. Thus, GPGPU technologies that have been used in PC and server environments are beginning to be applied to the embedded systems. However, many embedded systems have different architectural characteristics compare to traditional PCs and low-power consumption and real-time performance are also important performance metrics in these systems. In this paper, we implement a parallel AES cryptographic algorithm for a modern embedded GPU using OpenCL, a standard parallel computing framework, and compare performance against various baselines. Experimental results show that the parallel GPU AES implementation can reduce the response time by about 1/150 and the energy consumption by approximately 1/290 compare to OpenMP implementation when 1000KB input data is applied. Furthermore, an additional 100 % performance improvement of the parallel AES algorithm was achieved by exploiting the characteristics of embedded GPUs such as removing copying data between GPU and host memory. Our results also demonstrate that higher performance improvement can be achieved with larger size of input data.