Search | Korea Science

Design of an Optimized GPGPU for Data Reuse in DeepLearning Convolution (딥러닝 합성곱에서 데이터 재사용에 최적화된 GPGPU 설계)

Nam, Ki-Hun;Lee, Kwang-Yeob;Jung, Jun-Mo
- Journal of IKEEE
- /
- v.25 no.4
- /
- pp.664-671
- /
- 2021
This paper proposes a GPGPU structure that can reduce the number of operations and memory access by effectively applying a data reuse method to a convolutional neural network(CNN). Convolution is a two-dimensional operation using kernel and input data, and the operation is performed by sliding the kernel. In this case, a reuse method using an internal register is proposed instead of loading kernel from a cache memory until the convolution operation is completed. The serial operation method was applied to the convolution to increase the effect of data reuse by using the principle of GPGPU in which instructions are executed by the SIMT method. In this paper, for register-based data reuse, the kernel was fixed at 4×4 and GPGPU was designed considering the warp size and register bank to effectively support it. To verify the performance of the designed GPGPU on the CNN, we implemented it as an FPGA and then ran LeNet and measured the performance on AlexNet by comparison using TensorFlow. As a result of the measurement, 1-iteration learning speed based on AlexNet is 0.468sec and the inference speed is 0.135sec.
https://doi.org/10.7471/ikeee.2021.25.4.664 인용 PDF KSCI

A Real-Time JPEG2000 Codec Implementation on ARM9 Processor (ARM9 프로세서용 실시간 JPEG2000 코덱의 구현)

Kim, Young-Tae;Cho, Shi-Won;Lee, Dong-Wook
- Journal of the Institute of Convergence Signal Processing
- /
- v.8 no.3
- /
- pp.149-155
- /
- 2007
In this paper, we propose an real-time implementation of JPEG2000 codec on the ARM9 processor. The implemented codec is designed to separate control codes from data management codes in order to use effectively the system resources such as processor and memory. Especially, in embedded situations like cellular phones it is very important to provide good services using limited processor and internal memory. Since ARM9 series processors do not provide floating-point, large amount of computational time is required to perform the operation which needs highly repetitive floating-point computations like DWT(discrete wavelet transform). The proposed codec was programed using fixed-point to overcome this weakness. Also code optimization considering cache memory was applied to further improve the computational speed.
PDF

Functionality-based Processing-In-Memory Accelerator for Deep Neural Networks (딥뉴럴네트워크를 위한 기능성 기반의 핌 가속기)

Kim, Min-Jae;Kim, Shin-Dug
- Proceedings of the Korea Information Processing Society Conference
- /
- 2020.11a
- /
- pp.8-11
- /
- 2020
4 차 산업혁명 시대의 도래와 함께 AI, ICT 기술의 융합이 진행됨에 따라, 유저 레벨의 디바이스에서도 AI 서비스의 요청이 실현되었다. 이미지 처리와 관련된 AI 서비스는 피사체 판별, 불량품 검사, 자율주행 등에 이용되고 있으며, 특히 Deep Convolutional Neural Network (DCNN)은 이미지의 특색을 파악하는 데 뛰어난 성능을 보여준다. 하지만, 이미지의 크기가 커지고, 신경망이 깊어짐에 따라 연산 처리에 있어 낮은 데이터 지역성과 빈번한 메모리 참조를 야기했다. 이에 따라, 기존의 계층적 시스템 구조는 DCNN 을 scalable 하고 빠르게 처리하는 데 한계를 보인다. 본 연구에서는 DCNN 의 scalable 하고 빠른 처리를 위해 3 차원 메모리 구조의 Processing-In-Memory (PIM) 가속기를 제안한다. 이를 위해 기존 3 차원 메모리인 Hybrid Memory Cube (HMC)에 하드웨어 및 소프트웨어 모듈을 추가로 구성하였다. 구체적으로, Processing Element (PE)간 데이터를 공유할 수 있는 공유 캐시 및 소프트웨어 스택, 파이프라인화된 곱셈기 및 듀얼 프리페치 버퍼를 구성하였다. 이를 유명 DCNN 알고리즘 LeNet, AlexNet, ZFNet, VGGNet, GoogleNet, RestNet 에 대해 성능 평가를 진행한 결과 기존 HMC 대비 40.3%의 속도 향상을 29.4%의 대역폭 향상을 보였다.
https://doi.org/10.3745/PKIPS.y2020m11a.8 인용 PDF

Count-Min HyperLogLog : Cardinality Estimation Algorithm for Big Network Data (Count-Min HyperLogLog : 네트워크 빅데이터를 위한 카디널리티 추정 알고리즘)

Sinjung Kang;DaeHun Nyang
- Journal of the Korea Institute of Information Security & Cryptology
- /
- v.33 no.3
- /
- pp.427-435
- /
- 2023
Cardinality estimation is used in wide range of applications and a fundamental problem processing a large range of data. While the internet moves into the era of big data, the function addressing cardinality estimation use only on-chip cache memory. To use memory efficiently, there have been various methods proposed. However, because of the noises between estimator, which is data structure per flow, loss of accuracy occurs in these algorithms. In this paper, we focus on minimizing noises. We propose multiple data structure that each estimator has the number of estimated value as many as the number of structures and choose the minimum value, which is one with minimum noises, We discover that the proposed algorithm achieves better performance than the best existing work using the same tight memory, such as 1 bit per flow, through experiment.
https://doi.org/10.13089/JKIISC.2023.33.3.427 인용 PDF HTML

A Performance Study on CPU-GPU Data Transfers of Unified Memory Device (통합메모리 장치에서 CPU-GPU 데이터 전송성능 연구)

Kwon, Oh-Kyoung;Gu, Gibeom
- KIPS Transactions on Computer and Communication Systems
- /
- v.11 no.5
- /
- pp.133-138
- /
- 2022
Recently, as GPU performance has improved in HPC and artificial intelligence, its use is becoming more common, but GPU programming is still a big obstacle in terms of productivity. In particular, due to the difficulty of managing host memory and GPU memory separately, research is being actively conducted in terms of convenience and performance, and various CPU-GPU memory transfer programming methods are suggested. Meanwhile, recently many SoC (System on a Chip) products such as Apple M1 and NVIDIA Tegra that bundle CPU, GPU, and integrated memory into one large silicon package are emerging. In this study, data between CPU and GPU devices are used in such an integrated memory device and performance-related research is conducted during transmission. It shows different characteristics from the existing environment in which the host memory and GPU memory in the CPU are separated. Here, we want to compare performance by CPU-GPU data transmission method in NVIDIA SoC chips, which are integrated memory devices, and NVIDIA SMX-based V100 GPU devices. For the experimental workload for performance comparison, a two-dimensional matrix transposition example frequently used in HPC applications was used. We analyzed the following performance factors: the difference in GPU kernel performance according to the CPU-GPU memory transfer method for each GPU device, the transfer performance difference between page-locked memory and pageable memory, overall performance comparison, and performance comparison by workload size. Through this experiment, it was confirmed that the NVIDIA Xavier can maximize the benefits of integrated memory in the SoC chip by supporting I/O cache consistency.
https://doi.org/10.3745/KTCCS.2022.11.5.133 인용 PDF KSCI

Study on the methods of extracting Electrical parameters on PCB design process (PCB 설계에서 기판의 전기적 파라미터 추출 기법 고찰)

최순신
- Journal of the Korea Computer Industry Society
- /
- v.2 no.12
- /
- pp.1533-1540
- /
- 2001
In this paper, we described extraction method of electrical parameters and modeling method of PCB nets on PCB design process. To analyze electrical characteristics of real PCB structure, we selected a cache memory system as an experimental board and designed 6 layer PCB substrate. For extraction of the electrical parameters, we divided circuit elements into the components of conductor types which are wires, via holes, BGA balls etc. and combined the calculated value by real net structure to modeling the PCB nets. We analyzed the electrical characteristics of the PCB nets with the simulation tools of SPICE and XNS. The simulation analysis has shown that the maximum signal delay was 2.6ns and the maximum crosstalk noise was 281 mV and we found that the designed substrate was adequate to system specification.
PDF

Cache Simulator Design for Optimizing Write Operations of Nonvolatile Memory Based Caches (비휘발성 메모리 기반 캐시의 쓰기 작업 최적화를 위한 캐시 시뮬레이터 설계)

Joo, Yongsoo;Kim, Myeung-Heo;Han, In-Kyu;Lim, Sung-Soo
- IEMEK Journal of Embedded Systems and Applications
- /
- v.11 no.2
- /
- pp.87-95
- /
- 2016
Nonvolatile memory (NVM) is being considered as an alternative of traditional memory devices such as SRAM and DRAM, which suffer from various limitations due to the technology scaling of modern integrated circuits. Although NVMs have advantages including nonvolatility, low leakage current, and high density, their inferior write performance in terms of energy and endurance becomes a major challenge to the successful design of NVM-based memory systems. In order to overcome the aforementioned drawback of the NVM, extensive research is required to develop energy- and endurance-aware optimization techniques for NVM-based memory systems. However, researchers have experienced difficulty in finding a suitable simulation tool to prototype and evaluate new NVM optimization schemes because existing simulation tools do not consider the feature of NVM devices. In this article, we introduce a NVM-based cache simulator to support rapid prototyping and evaluation of NVM-based caches, as well as energy- and endurance-aware NVM cache optimization schemes. We demonstrate that the proposed NVM cache simulator can easily prototype PRAM cache and PRAM+STT-RAM hybrid cache as well as evaluate various write traffic reduction schemes and wear leveling schemes.
https://doi.org/10.14372/IEMEK.2016.11.2.87 인용 PDF KSCI

NVM-based Write Amplification Reduction to Avoid Performance Fluctuation of Flash Storage (플래시 스토리지의 성능 지연 방지를 위한 비휘발성램 기반 쓰기 증폭 감소 기법)

Lee, Eunji;Jeong, Minseong;Bahn, Hyokyung
- The Journal of the Institute of Internet, Broadcasting and Communication
- /
- v.16 no.4
- /
- pp.15-20
- /
- 2016
Write amplification is a critical factor that limits the stable performance of flash-based storage systems. To reduce write amplification, this paper presents a new technique that cooperatively manages data in flash storage and nonvolatile memory (NVM). Our scheme basically considers NVM as the cache of flash storage, but allows the original data in flash storage to be invalidated if there is a cached copy in NVM, which can temporarily serve as the original data. This scheme eliminates the copy-out operation for a substantial number of cached data, thereby enhancing garbage collection efficiency. Experimental results show that the proposed scheme reduces the copy-out overhead of garbage collection by 51.4% and decreases the standard deviation of response time by 35.4% on average.
https://doi.org/10.7236/JIIBC.2016.16.4.15 인용 PDF KSCI

Improving Log-Structured File System Performance by Utilizing Non-Volatile Memory (비휘발성 메모리를 이용한 로그 구조 파일 시스템의 성능 향상)

Kang, Yang-Wook;Choi, Jong-Moo;Lee, Dong-Hee;Noh, Sam-H.
- Journal of KIISE:Computing Practices and Letters
- /
- v.14 no.5
- /
- pp.537-541
- /
- 2008
Log-Structured File System(LFS) is a disk based file system that is optimized for improving the write performance. LFS gathers dirty data in memory as long as possible, and flushes all dirty data sequentially at once. In a real system, however, maintaining dirty data in memory should be flushed into a disk to meet file system consistency issues even if more memory is still available. This synchronizations increase the cleaner overhead of LFS and make LFS to write down more metadata into a disk. In this paper, by adapting Non-volatile RAM(NV-RAM) we modifies LFS and virtual memory subsystem to guarantee that LFS could gather enough dirty data in the memory and reduce small disk writes. By doing so, we improves the performance of LFS by around 2.5 times than the original LFS.
PDF KSCI

Bit-Map Based Hybrid Fast IP Lookup Technique (비트-맵 기반의 혼합형 고속 IP 검색 기법)

Oh Seung-Hyun
- Journal of Korea Multimedia Society
- /
- v.9 no.2
- /
- pp.244-254
- /
- 2006
This paper presents an efficient hybrid technique to compact the trie indexing the huge forward table small enough to be stored into cache for speeding up IP lookup. It combines two techniques, an encoding scheme called bit-map and a controlled-prefix expanding scheme to replace slow memory search with few fast-memory accesses and computations. For compaction, the bit-map represents each index and child pointer with one bit respectively. For example, when one node denotes n bits, the bit-map gives a high compression rate by consumes $2^{n-1}$ bits for $2^n$ index and child link pointers branched out of the node. The controlled-prefix expanding scheme determines the number of address bits represented by all root node of each trie's level. At this time, controlled-prefix scheme use a dynamic programming technique to get a smallest trie memory size with given number of trie's level. This paper proposes standard that can choose suitable trie structure depending on memory size of system and the required IP lookup speed presenting optimal memory size and the lookup speed according to trie level number.
PDF

Search Result 242, Processing Time 0.024 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)