• Title/Summary/Keyword: GPU Memory

Search Result 151, Processing Time 0.024 seconds

Improving the Rendering Speed of 3D Model Animation on Smart Phones

  • Ng, Cong Jie;Hwang, Gi-Hyun;Kang, Dae-Ki
    • Journal of information and communication convergence engineering
    • /
    • v.9 no.3
    • /
    • pp.266-270
    • /
    • 2011
  • The advancement of technology enables smart phones or handheld devices to render complex 3D graphics. However, the processing power and memory of smart phones remain very limited to render high polygon and details 3D models especially on games which requires animation, physic engine, or augmented reality. In this paper, several techniques will be introduced to speed up the computation and reducing the number of vertices of the 3D meshes without losing much detail.

A Study on In-memory based Distributed Frameworks for Deep Learning (인메모리 기반 딥러닝 기술을 위한 분산 프레임워크에 관한 연구)

  • Cho, Hyeyoung;Yu, Jung-Lok
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2016.10a
    • /
    • pp.45-46
    • /
    • 2016
  • 최근 GPU를 비롯한 하드웨어의 성능이 급격이 증가하면서 인공지능, 딥러닝 기술에 대한 관심이 높아지고 있다. 또한 데이터가 더욱 방대해 지면서 대용량 데이터를 처리하고 위한 딥러닝 분산 프레임워크에 대한 필요성이 제기되고 있다. 이에 본 논문에서는 대규모의 분산 환경에서 딥러닝 고속 처리를 위한 분산 프레임워크를 비교 분석하였다. 특히 최근 주목받고 있는 인메모리 기반 분산 프레임워크인 Spark, SparkNet, HeteroSpark의 특징을 비교 분석하였다.

Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing

  • Pei, Songwen;Zhang, Junge;Jiang, Linhua;Kim, Myoung-Seo;Gaudiot, Jean-Luc
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.7
    • /
    • pp.3231-3244
    • /
    • 2016
  • As processor design has been transiting from homogeneous multicore processor to heterogeneous multicore processor, traditional Amdahl's law cannot meet the new challenges for asymmetric multicore system. In order to further investigate the impact factors related to the Overhead of Data Preparation (ODP) for Asymmetric multicore systems, we evaluate an asymmetric multicore system built with CPU-GPU by measuring the overheads of memory transfer, computing kernel, cache missing and synchronization. This paper demonstrates that decreasing the overhead of data preparation is a promising approach to improve the whole performance of heterogeneous system.

RAVIP: Real-Time AI Vision Platform for Heterogeneous Multi-Channel Video Stream

  • Lee, Jeonghun;Hwang, Kwang-il
    • Journal of Information Processing Systems
    • /
    • v.17 no.2
    • /
    • pp.227-241
    • /
    • 2021
  • Object detection techniques based on deep learning such as YOLO have high detection performance and precision in a single channel video stream. In order to expand to multiple channel object detection in real-time, however, high-performance hardware is required. In this paper, we propose a novel back-end server framework, a real-time AI vision platform (RAVIP), which can extend the object detection function from single channel to simultaneous multi-channels, which can work well even in low-end server hardware. RAVIP assembles appropriate component modules from the RODEM (real-time object detection module) Base to create per-channel instances for each channel, enabling efficient parallelization of object detection instances on limited hardware resources through continuous monitoring with respect to resource utilization. Through practical experiments, RAVIP shows that it is possible to optimize CPU, GPU, and memory utilization while performing object detection service in a multi-channel situation. In addition, it has been proven that RAVIP can provide object detection services with 25 FPS for all 16 channels at the same time.

Detecting A Crypto-mining Malware By Deep Learning Analysis

  • Aljehani, Shahad;Alsuwat, Hatim
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.6
    • /
    • pp.172-180
    • /
    • 2022
  • Crypto-mining malware (known as crypto-jacking) is a novel cyber-attack that exploits the victim's computing resources such as CPU and GPU to generate illegal cryptocurrency. The attacker get benefit from crypto-jacking by using someone else's mining hardware and their electricity power. This research focused on the possibility of detecting the potential crypto-mining malware in an environment by analyzing both static and dynamic approaches of deep learning. The Program Executable (PE) files were utilized with deep learning methods which are Long Short-Term Memory (LSTM). The finding revealed that LTSM outperformed both SVM and RF in static and dynamic approaches with percentage of 98% and 96%, respectively. Future studies will focus on detecting the malware using larger dataset to have more accurate and realistic results.

Text-to-speech with linear spectrogram prediction for quality and speed improvement (음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech)

  • Yoon, Hyebin
    • Phonetics and Speech Sciences
    • /
    • v.13 no.3
    • /
    • pp.71-78
    • /
    • 2021
  • Most neural-network-based speech synthesis models utilize neural vocoders to convert mel-scaled spectrograms into high-quality, human-like voices. However, neural vocoders combined with mel-scaled spectrogram prediction models demand considerable computer memory and time during the training phase and are subject to slow inference speeds in an environment where GPU is not used. This problem does not arise in linear spectrogram prediction models, as they do not use neural vocoders, but these models suffer from low voice quality. As a solution, this paper proposes a Tacotron 2 and Transformer-based linear spectrogram prediction model that produces high-quality speech and does not use neural vocoders. Experiments suggest that this model can serve as the foundation of a high-quality text-to-speech model with fast inference speed.

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression (구조적 압축을 통한 FPGA 기반 GRU 추론 가속기 설계)

  • Chae, Byeong-Cheol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.6
    • /
    • pp.850-858
    • /
    • 2022
  • To deploy Gate Recurrent Units (GRU) on resource-constrained embedded devices, this paper presents a reconfigurable FPGA-based GRU accelerator that enables structured compression. Firstly, a dense GRU model is significantly reduced in size by hybrid quantization and structured top-k pruning. Secondly, the energy consumption on external memory access is greatly reduced by the proposed reuse computing pattern. Finally, the accelerator can handle a structured sparse model that benefits from the algorithm-hardware co-design workflows. Moreover, inference tasks can be flexibly performed using all functional dimensions, sequence length, and number of layers. Implemented on the Intel DE1-SoC FPGA, the proposed accelerator achieves 45.01 GOPs in a structured sparse GRU network without batching. Compared to the implementation of CPU and GPU, low-cost FPGA accelerator achieves 57 and 30x improvements in latency, 300 and 23.44x improvements in energy efficiency, respectively. Thus, the proposed accelerator is utilized as an early study of real-time embedded applications, demonstrating the potential for further development in the future.

The Advanced Rasterizer and Cache Memory Architecture for Latency Reduction Of 3D GPU (3차원 그래픽 가속기의 지연 감소를 위한 개선된 래스터라이져 및 캐쉬 메모리 구조 제안 및 실험)

  • Park Jin-Hong;Kim Il-San;Park Woo-Chan;Han Tack-Don
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07a
    • /
    • pp.727-729
    • /
    • 2005
  • 현재 3차원 그래픽 가속기에서 성능 향상에 대한 문제점으로 대두되고 있는 것은 실제 화면에 그려지는 정보가 저장되는 프레임버퍼에 대한 접근 지연이다. 따라서 본 논문은 기존 픽셀 캐쉬가 포함된 래스터라이져 구조에서 캐쉬 읽기 접근 실패 시 발생하는 패널티와 이에 따른 프레임버퍼에 대한 지연이 발생하는 문제점을 개선하고자, 기존 래스터라이져를 래스터라이져와 합성기로 구분하고 그 사이에 캐쉬 읽기 접근 실패 시 프레임 버퍼에서 정보를 읽어오지 않는 깊이 캐쉬와 색상 캐쉬가 쌍을 이룬 픽셀 캐쉬 메모리 시스템으로 구성된 개선된 3차원 그래픽 가속기 구조을 제안하고 실험을 수행하였다. 실험 결과 제안하는 3차원 그래픽 가속기 구조가 기존 구조에 비해 캐쉬 접근 실패율이 약 $23\%$ 감소하였으며, 평균 메모리 접근 사이클이 $10\%-13\%$ 감소하였으며 이는 상당수의 프레임버퍼에 대한 접근 지연을 감소시킨 것이다. 합성기와 메모리 간의 대역폭은 약 $10\%$ 증가하지만 파이프라인의 작업에는 영향을 미치지는 않는다.

  • PDF

Binary Hashing CNN Features for Action Recognition

  • Li, Weisheng;Feng, Chen;Xiao, Bin;Chen, Yanquan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.9
    • /
    • pp.4412-4428
    • /
    • 2018
  • The purpose of this work is to solve the problem of representing an entire video using Convolutional Neural Network (CNN) features for human action recognition. Recently, due to insufficient GPU memory, it has been difficult to take the whole video as the input of the CNN for end-to-end learning. A typical method is to use sampled video frames as inputs and corresponding labels as supervision. One major issue of this popular approach is that the local samples may not contain the information indicated by the global labels and sufficient motion information. To address this issue, we propose a binary hashing method to enhance the local feature extractors. First, we extract the local features and aggregate them into global features using maximum/minimum pooling. Second, we use the binary hashing method to capture the motion features. Finally, we concatenate the hashing features with global features using different normalization methods to train the classifier. Experimental results on the JHMDB and MPII-Cooking datasets show that, for these new local features, binary hashing mapping on the sparsely sampled features led to significant performance improvements.

Development of a Low-cost Industrial OCR System with an End-to-end Deep Learning Technology

  • Subedi, Bharat;Yunusov, Jahongir;Gaybulayev, Abdulaziz;Kim, Tae-Hyong
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.15 no.2
    • /
    • pp.51-60
    • /
    • 2020
  • Optical character recognition (OCR) has been studied for decades because it is very useful in a variety of places. Nowadays, OCR's performance has improved significantly due to outstanding deep learning technology. Thus, there is an increasing demand for commercial-grade but affordable OCR systems. We have developed a low-cost, high-performance OCR system for the industry with the cheapest embedded developer kit that supports GPU acceleration. To achieve high accuracy for industrial use on limited computing resources, we chose a state-of-the-art text recognition algorithm that uses an end-to-end deep learning network as a baseline model. The model was then improved by replacing the feature extraction network with the best one suited to our conditions. Among the various candidate networks, EfficientNet-B3 has shown the best performance: excellent recognition accuracy with relatively low memory consumption. Besides, we have optimized the model written in TensorFlow's Python API using TensorFlow-TensorRT integration and TensorFlow's C++ API, respectively.