• Title/Summary/Keyword: CUDA(CUDA)

Search Result 295, Processing Time 0.032 seconds

Evaluation of GPU Computing Capacity for All-in-view GNSS SDR Implementation

  • Yun Sub, Choi;Hung Seok, Seo;Young Baek, Kim
    • Journal of Positioning, Navigation, and Timing
    • /
    • v.12 no.1
    • /
    • pp.75-81
    • /
    • 2023
  • In this study, we design an optimized Graphics Processing Unit (GPU)-based GNSS signal processing technique with the goal of designing and implementing a GNSS Software Defined Receiver (SDR) that can operate in real time all-in-view mode under multi-constellation and multi-frequency signal environment. In the proposed structure the correlators of the existing GNSS SDR are processed by the GPU. We designed a memory structure and processing method that can minimize memory access bottlenecks and optimize the GPU memory resource distribution. The designed GNSS SDR can select and operate only the desired GNSS or desired satellite signals by user input. Also, parameters such as the number of quantization bits, sampling rate, and number of signal tracking arms can be selected. The computing capability of the designed GPU-based GNSS SDR was evaluated and it was confirmed that up to 2400 channels can be processed in real time. As a result, the GPU-based GNSS SDR has sufficient performance to operate in real-time all-in-view mode. In future studies, it will be used for more diverse GNSS signal processing and will be applied to multipath effect analysis using more tracking arms.

Fundamental Function Design of Real-Time Unmanned Monitoring System Applying YOLOv5s on NVIDIA TX2TM AI Edge Computing Platform

  • LEE, SI HYUN
    • International journal of advanced smart convergence
    • /
    • v.11 no.2
    • /
    • pp.22-29
    • /
    • 2022
  • In this paper, for the purpose of designing an real-time unmanned monitoring system, the YOLOv5s (small) object detection model was applied on the NVIDIA TX2TM AI (Artificial Intelligence) edge computing platform in order to design the fundamental function of an unmanned monitoring system that can detect objects in real time. YOLOv5s was applied to the our real-time unmanned monitoring system based on the performance evaluation of object detection algorithms (for example, R-CNN, SSD, RetinaNet, and YOLOv5). In addition, the performance of the four YOLOv5 models (small, medium, large, and xlarge) was compared and evaluated. Furthermore, based on these results, the YOLOv5s model suitable for the design purpose of this paper was ported to the NVIDIA TX2TM AI edge computing system and it was confirmed that it operates normally. The real-time unmanned monitoring system designed as a result of the research can be applied to various application fields such as an security or monitoring system. Future research is to apply NMS (Non-Maximum Suppression) modification, model reconstruction, and parallel processing programming techniques using CUDA (Compute Unified Device Architecture) for the improvement of object detection speed and performance.

Implementation of GPU Based Polymorphic Worm Detection Method and Its Performance Analysis on Different GPU Platforms (GPU를 이용한 Polymorphic worm 탐지 기법 구현 및 GPU 플랫폼에 따른 성능비교)

  • Lee, Sunwon;Song, Chihwan;Lee, Injoon;Joh, Taewon;Kang, Jaewoo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.1458-1461
    • /
    • 2010
  • 작년 7월 7일에 있었던 DDoS 공격과 같이 악성 코드로 인한 피해의 규모가 해마다 증가하고 있다. 특히 변형 웜(Polymorphic Worm)은 기존의 방법으로 1차 공격에서의 탐지가 어렵기 때문에 그 위험성이 더 크다. 이에 본 연구에서는 바이오 인포매틱스(Bioinformatics) 분야에서 유전자들의 유사성과 특징을 찾기 위한 방법 중 하나인 Local Alignment를 소개하고 이를 변형 웜 탐지에 적용한다. 또한 수행의 병렬화 및 알고리즘 변형을 통하여 기존 알고리즘의 $O(n^4)$수행시간이라는 단점을 극복한다. 병렬화는 NVIDIA사의 GPU를 이용한 CUDA 프로그래밍과 AMD사의 GPU를 사용한 OpenCL 프로그래밍을 통하여 수행되었다. 이로써 각 GPGPU 플랫폼에서의 Local Alignment를 이용한 변형 웜 탐지 알고리즘의 성능을 비교하였다.

Performance Improvement in HTTP Packet Extraction from Network Traffic using GPGPU (GPGPU 를 이용한 네트워크 트래픽에서의 HTTP 패킷 추출 성능 향상)

  • Han, SangWoon;Kim, Hyogon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.11a
    • /
    • pp.718-721
    • /
    • 2011
  • 웹 서비스를 대상으로 하는 DDoS(Distributed Denial-of-Service) 공격 또는 유해 트래픽 유입을 탐지 또는 차단하기 위한 목적으로 HTTP(Hypertext Transfer Protocol) 트래픽을 실시간으로 분석하는 기능은 거의 모든 네트워크 트래픽 보안 솔루션들이 탑재하고 있는 필수적인 요소이다. 하지만, HTTP 트래픽의 실시간 데이터 측정 양이 시간이 지날수록 기하급수적으로 증가함에 따라, HTTP 트래픽을 실시간 패킷 단위로 분석한다는 것에 대한 성능 부담감은 날로 커지고 있는 실정이다. 이제는 응용 어플리케이션 차원에서는 성능에 대한 부담감을 해소할 수 없기 때문에 고비용의 소프트웨어 가속기나 하드웨어에 의존적인 전용 장비를 탑재하여 해결하려는 시도가 대부분이다. 본 논문에서는 현재 대부분의 PC 에 탑재되어 있는 그래픽 카드의 GPU(Graphics Processing Units)를 범용적으로 활용하고자 하는 GPGPU(General-Purpose computation on Graphics Processing Units)의 연구에 힘입어, NVIDIA사의 CUDA(Compute Unified Device Architecture)를 사용하여 네트워크 트래픽에서 HTTP 패킷 추출성능을 응용 어플리케이션 차원에서 향상시켜 보고자 하였다. HTTP 패킷 추출 연산만을 기준으로 GPU 의 연산속도는 CPU 에 비해 10 배 이상의 높은 성능을 얻을 수 있었다.

Color2Gray using Conventional Approaches in Black-and-White Photography (전통적 사진 기법에 기반한 컬러 영상의 흑백 변환)

  • Jang, Hyuk-Su;Choi, Min-Gyu
    • Journal of the Korea Computer Graphics Society
    • /
    • v.14 no.3
    • /
    • pp.1-9
    • /
    • 2008
  • This paper presents a novel optimization-based saliency-preserving method for converting color images to grayscale in a manner consistent with conventional approaches of black-and-white photographers. In black-and-white photography, a colored filter called a contrast filter has been commonly employed on a camera to lighten or darken selected colors. In addition, local exposure controls such as dodging and burning techniques are typically employed in the darkroom process to change the exposure of local areas within the print without affecting the overall exposure. Our method seeks a digital version of a conventional contrast filter to preserve visually-important image features. Furthermore, conventional burning and dodging techniques are addressed, together with image similarity weights, to give edge-aware local exposure control over the image space. Our method can be efficiently optimized on GPU. According to the experiments, CUDA implementation enables 1 megapixel color images to be converted to grayscale at interactive frames rates.

  • PDF

GPU-Based Acceleration of Quantum-Inspired Evolutionary Algorithm (GPU를 이용한 Quantum-Inspired Evolutionary Algorithm 가속)

  • Ryoo, Ji-Hyun;Park, Han-Min;Choi, Ki-Young
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.49 no.8
    • /
    • pp.1-9
    • /
    • 2012
  • Quantum-Inspired Evolutionary Algorithm(QEA) contains sufficient data-level parallelism to be naturally accelerated on GPUs. For an efficient reduction of execution time, however, careful task-mapping should be done to properly reflect the characteristics of CPU and GPU. Furthermore, when deciding which part of the application should run on GPU, we need to consider the data transfer between CPU and GPU memory spaces as well as the data-level parallelism. In addition, the usage of zero-copy host memory, proper choice of the execution configuration, and thread organization considering memory coalescing is important to further reduce the execution time. With all these techniques, we could run QEA 3.69 times faster on average in comparison with the multi-threading CPU for the case of 0-1 knapsack problem with 30,000 items.

Performance of the Finite Difference Method Using Cache and Shared Memory for Massively Parallel Systems (대규모 병렬 시스템에서 캐시와 공유메모리를 이용한 유한 차분법 성능)

  • Kim, Hyun Kyu;Lee, Hyo Jong
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.4
    • /
    • pp.108-116
    • /
    • 2013
  • Many algorithms have been introduced to improve performance by using massively parallel systems, which consist of several hundreds of processors. A typical example is a GPU system of many processors which uses shared memory. In the case of image filtering algorithms, which make references to neighboring points, the shared memory helps improve performance by frequently accessing adjacent pixels. However, using shared memory requires rewriting the existing codes and consequently results in complexity of the codes. Recent GPU systems support both L1 and L2 cache along with shared memory. Since the L1 cache memory is located in the same area as the shared memory, the improvement of performance is predictable by using the cache memory. In this paper, the performance of cache and shared memory were compared. In conclusion, the performance of cache-based algorithm is very similar to the one of shared memory. The complexity of the code appearing in a shared memory system, however, is resolved with the cache-based algorithm.

An Analytical Model for Performance Prediction of AES on GPU Architecture (GPU 아키텍처의 AES 암호화 성능 예측 분석 모델)

  • Kim, Kyuwoon;Kim, Hyunwoo;Kim, Huijeong;Huh, Taeyoung;Jung, Sanghyuk;Song, Yong Ho
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.4
    • /
    • pp.89-96
    • /
    • 2013
  • The graphic processor unit (GPU) has been developed to process not only graphic data but also general system data. It shows a better performance than CPU in algorithm for 3D graphics and parallel program. In order to execute algorithm for CPU on GPU, we should understand about GPU architectures and rewrite program considering parallel processing capability and new memory model of GPU. For this reasons, a performance prediction model for the algorithm and its predicted performance through GPU system are required. These can predict problems in GPU application development or construct a performance evaluation standard for GPU. In this paper, we applied the AES encryption algorithms on our performance model and accomplished performance prediction with high accuracy under a heavy workload.

Fast Generation of Digital Video Holograms Using Multiple PCs (다수의 PC를 이용한 디지털 비디오 홀로그램의 고속 생성)

  • Park, Hanhoon;Kim, Changseob;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.22 no.4
    • /
    • pp.509-518
    • /
    • 2017
  • High-resolution digital holograms can be quickly generated by using a PC cluster that is based on server-client architecture and is composed of several GPU-equipped PCs. However, the data transmission time between PCs becomes a large obstacle for fast generation of video holograms because it linearly increases in proportion to the number of frames. To resolve the problem with the increase of data transmission time, this paper proposes a multi-threading-based method. Hologram generation in each client PC basically consists of three processes: acquisition of light sources, CGH operation using GPUs, and transmission of the result to the server PC. Unlike the previous method that sequentially executes the processes, the proposed method executes in parallel them by multi-threading and thus can significantly reduce the proportion of the data transmission time to the total hologram generation time. Through experiments, it was confirmed that the total generation time of a high-resolution video hologram with 150 frames can be reduced by about 30%.

Research on DNN Modeling using Feature Selection on Frequency Domain for Vital Reaction of Breeding Pig (모돈 생체 반응 신호의 주파수 영역 Feature selection을 통한 DNN 모델링 연구)

  • Cho, Jinho;Oh, Jong-woo;Lee, DongHoon
    • Proceedings of the Korean Society for Agricultural Machinery Conference
    • /
    • 2017.04a
    • /
    • pp.166-166
    • /
    • 2017
  • 모돈의 건강 상태를 정량 지수화 하기 위한 연구를 수행 중이다. 지제이상, 섭식 불량, 수면 패턴 등의 운동 특성 분석을 위하여 복수의 초음파 센서를 이용하였다. 시계열 계측 신호를 분석하여 정량 지수화를 수행하는 과정에서 주파수 도메인 분석을 시도하였다. 이 과정에서 주파수 도메인의 분해능에 따른 편차 극복을 위한 비선형 모델링을 수행하였다. 또한 인접한 시계열 데이터 구간 간의 상관성 분석이 가능하면 대용량 데이터의 실시간 처리로 인한 지연 시간 극복 및 기대되는 예후에 대한 조기 진단이 가능할 것이다. 본 연구에서는 구글에서 제공하는 Tensorflow와 NVIDIA에서 제공하는 CUDA 엔진을 동시 적용한 심층 학습 시스템을 이용하였다. 전 처리를 위하여 주파수 분해능 (2분, 3분, 5분, 7분, 11분, 13분, 17분, 19분)에 따른 데이터 집합을 1단계로 두고, 상위 10 순위 안에 드는 파워 스펙트럼 밀도의 크기를 2단계로 하여, 총 2~10개의 입력 노드를 순차적으로 선정하였고, 동일한 방식으로 인접한 시계열의 파워 스펙터럼 밀도를 순위를 변화시켜 지정하였다. 대표적인 심층학습 모델인 Softmax regression with a multilayer convolutional network를 이용하여 Recursive feature selection 경우의 수를 $8{\times}9{\times}9$로 총 648 가지 선정하고, Epoch는 10,000회로 지정하였다. Calibration 모델링의 경우 Cost function이 10% 이하인 경우 해당 경우의 학습을 중단하였으며, 모델 간 상호 교차 검증을 수행하기 위하여 $_8C_2{\times}_8C_2{\times}_8C_2$ 경우의 수에 대한 Verification test를 수행하였다. Calibration 과정 상 모든 경우에 대하여 10% 이하의 Cost function 값을 보였으나, 검증 테스트 과정에서 모든 경우에 대하여 $r^2$ < 0.5 인 결정 계수 값이 나타났다. 단적으로 심층학습 모델의 과도한 적합(Over fitting) 방식의 한계를 보인 것이라 판단할 수 있다. 적합한 Feature selection 및 심층 학습 모델에 대한 지속적이고 추가적인 고려를 통해 과도적합을 해소함과 동시에 실효적이고 활용 가능한 Classification을 위한 입, 출력 노드 단의 전후 Indexing, Quantization에 대한 고려가 필요할 것이다. 이를 통해 모돈 생체 정보 정량화를 위한 지능형 현장 진단 기술 연구를 지속할 것이다.

  • PDF