Search | Korea Science

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
- Journal of KIISE:Computing Practices and Letters
- /
- v.16 no.7
- /
- pp.775-787
- /
- 2010
Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.
PDF KSCI

Spark Framework Based on a Heterogenous Pipeline Computing with OpenCL (OpenCL을 활용한 이기종 파이프라인 컴퓨팅 기반 Spark 프레임워크)

Kim, Daehee;Park, Neungsoo
- The Transactions of The Korean Institute of Electrical Engineers
- /
- v.67 no.2
- /
- pp.270-276
- /
- 2018
Apache Spark is one of the high performance in-memory computing frameworks for big-data processing. Recently, to improve the performance, general-purpose computing on graphics processing unit(GPGPU) is adapted to Apache Spark framework. Previous Spark-GPGPU frameworks focus on overcoming the difficulty of an implementation resulting from the difference between the computation environment of GPGPU and Spark framework. In this paper, we propose a Spark framework based on a heterogenous pipeline computing with OpenCL to further improve the performance. The proposed framework overlaps the Java-to-Native memory copies of CPU with CPU-GPU communications(DMA) and GPU kernel computations to hide the CPU idle time. Also, CPU-GPU communication buffers are implemented with switching dual buffers, which reduce the mapped memory region resulting in decreasing memory mapping overhead. Experimental results showed that the proposed Spark framework based on a heterogenous pipeline computing with OpenCL had up to 2.13 times faster than the previous Spark framework using OpenCL.
https://doi.org/10.5370/KIEE.2018.67.2.270 인용 PDF KSCI

Sub-Frame Analysis-based Object Detection for Real-Time Video Surveillance

Jang, Bum-Suk;Lee, Sang-Hyun
- International Journal of Internet, Broadcasting and Communication
- /
- v.11 no.4
- /
- pp.76-85
- /
- 2019
We introduce a vision-based object detection method for real-time video surveillance system in low-end edge computing environments. Recently, the accuracy of object detection has been improved due to the performance of approaches based on deep learning algorithm such as Region Convolutional Neural Network(R-CNN) which has two stage for inferencing. On the other hand, one stage detection algorithms such as single-shot detection (SSD) and you only look once (YOLO) have been developed at the expense of some accuracy and can be used for real-time systems. However, high-performance hardware such as General-Purpose computing on Graphics Processing Unit(GPGPU) is required to still achieve excellent object detection performance and speed. To address hardware requirement that is burdensome to low-end edge computing environments, We propose sub-frame analysis method for the object detection. In specific, We divide a whole image frame into smaller ones then inference them on Convolutional Neural Network (CNN) based image detection network, which is much faster than conventional network designed forfull frame image. We reduced its computationalrequirementsignificantly without losing throughput and object detection accuracy with the proposed method.
https://doi.org/10.7236/IJIBC.2019.11.4.76 인용 PDF KSCI

A Study of The GPGPU Performance (범용 그래픽 처리장치 (GPGPU)의 성능에 대한 연구)

Lee, Jongbok
- The Journal of the Institute of Internet, Broadcasting and Communication
- /
- v.18 no.6
- /
- pp.201-206
- /
- 2018
As the artificial intelligence and big data technology has been developed recently, the importance of GPGPU, which is a general purpose graphics processing unit, is emphasized. In addition, by the demand for mining equipment to obtain bit coins, which is a block chain application technology, the price of GPGPU has increased sharply with scarcity. If a GPGPU can be precisely simulated, it is possible to conduct experiments on various GPGPU types and analyze performance without purchasing expensive ones. In this paper, we investigate the configuration of a GPGPU simulator and measure the performance of various benchmark programs using GPGPU-Sim.
https://doi.org/10.7236/JIIBC.2018.18.6.201 인용 PDF KSCI HTML

An Edge AI Device based Intelligent Transportation System

Jeong, Youngwoo;Oh, Hyun Woo;Kim, Soohee;Lee, Seung Eun
- Journal of information and communication convergence engineering
- /
- v.20 no.3
- /
- pp.166-173
- /
- 2022
Recently, studies have been conducted on intelligent transportation systems (ITS) that provide safety and convenience to humans. Systems that compose the ITS adopt architectures that applied the cloud computing which consists of a high-performance general-purpose processor or graphics processing unit. However, an architecture that only used the cloud computing requires a high network bandwidth and consumes much power. Therefore, applying edge computing to ITS is essential for solving these problems. In this paper, we propose an edge artificial intelligence (AI) device based ITS. Edge AI which is applicable to various systems in ITS has been applied to license plate recognition. We implemented edge AI on a field-programmable gate array (FPGA). The accuracy of the edge AI for license plate recognition was 0.94. Finally, we synthesized the edge AI logic with Magnachip/Hynix 180nm CMOS technology and the power consumption measured using the Synopsys's design compiler tool was 482.583mW.
https://doi.org/10.56977/jicce.2022.20.3.166 인용 PDF KSCI

Non-Photorealistic Rendering Using CUDA-Based Image Segmentation (CUDA 기반 영상 분할을 사용한 비사실적 렌더링)

Yoon, Hyun-Cheol;Park, Jong-Seung
- KIPS Transactions on Software and Data Engineering
- /
- v.4 no.11
- /
- pp.529-536
- /
- 2015
When rendering both three-dimensional objects and photo images together, the non-photorealistic rendering results are in visual discord since the two contents have their own independent color distributions. This paper proposes a non-photorealistic rendering technique which renders both three-dimensional objects and photo images such as cartoons and sketches. The proposed technique computes the color distribution property of the photo images and reduces the number of colors of both photo images and 3D objects. NPR is performed based on the reduced colormaps and edge features. To enhance the natural scene presentation, the image region segmentation process is preferred when extracting and applying colormaps. However, the image segmentation technique needs a lot of computational operations. It takes a long time for non-photorealistic rendering for large size frames. To speed up the time-consuming segmentation procedure, we use GPGPU for the parallel computing using the GPU. As a result, we significantly improve the execution speed of the algorithm.
https://doi.org/10.3745/KTSDE.2015.4.11.529 인용 PDF KSCI

A Study on the Underwater Channel Model based on a High-Order Finite Difference Method using GPUs (그래픽 프로세서를 이용한 고차 유한 차분식 기반 수중채널모델 연구)

Bae, Ho Seuk;Kim, Won-Ki;Son, Su-Uk;Ha, Wansoo
- Journal of the Korea Society for Simulation
- /
- v.30 no.1
- /
- pp.11-20
- /
- 2021
As unmanned underwater systems have recently emerged, a high-speed underwater channel modeling technique, which is one of the most important techniques in the system, has received a lot of attention. In this paper, we proposed a high-speed sound propagation model and verified the applicability through quantitative performance analyses. We used a high-order finite difference method (FDM) for wave propagation modeling in the water, and a domain decomposition method was adopted using multiple general-purpose graphics processing units (GPUs) to increase the calculation efficiency. We compared the results of the model we proposed with the analytic solution in the half-infinite media and results of the Virtual Timeseries Experiment (VirTEX) model, which is based on the ray method. Finally, we analyzed the performance of the model quantitatively using numerical examples. Through quantitative analyses of the improvement in computational performance, we confirmed that the computational speed increases linearly as the number of GPUs increases. The computation times are increased by 2 times and 8 times, respectively, when the domain size of computation and the maximum frequency are doubled. We expect that the proposed high-speed underwater channel modeling technique is able to contribute to the enhancement of national defense as an underwater communication channel model and analysis tool to develop the underwater communication technique for the unmanned underwater system.
https://doi.org/10.9709/JKSS.2021.30.1.011 인용 PDF KSCI

AB9: A neural processor for inference acceleration

Cho, Yong Cheol Peter;Chung, Jaehoon;Yang, Jeongmin;Lyuh, Chun-Gi;Kim, HyunMi;Kim, Chan;Ham, Je-seok;Choi, Minseok;Shin, Kyoungseon;Han, Jinho;Kwon, Youngsu
- ETRI Journal
- /
- v.42 no.4
- /
- pp.491-504
- /
- 2020
We present AB9, a neural processor for inference acceleration. AB9 consists of a systolic tensor core (STC) neural network accelerator designed to accelerate artificial intelligence applications by exploiting the data reuse and parallelism characteristics inherent in neural networks while providing fast access to large on-chip memory. Complementing the hardware is an intuitive and user-friendly development environment that includes a simulator and an implementation flow that provides a high degree of programmability with a short development time. Along with a 40-TFLOP STC that includes 32k arithmetic units and over 36 MB of on-chip SRAM, our baseline implementation of AB9 consists of a 1-GHz quad-core setup with other various industry-standard peripheral intellectual properties. The acceleration performance and power efficiency were evaluated using YOLOv2, and the results show that AB9 has superior performance and power efficiency to that of a general-purpose graphics processing unit implementation. AB9 has been taped out in the TSMC 28-nm process with a chip size of 17 × 23 ㎟. Delivery is expected later this year.
https://doi.org/10.4218/etrij.2020-0134 인용 PDF KSCI

Search Result 48, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)