Search | Korea Science

Resolving Memory Bottlenecks in Hardware Accelerators with Data Prefetch

Hyein Lee;Jinoo Joung
- Journal of the Korea Society of Computer and Information
- /
- v.29 no.6
- /
- pp.1-12
- /
- 2024
Deep learning with faster and more accurate results requires large amounts of storage space and large computations. Accordingly, many studies are using hardware accelerators for quick and accurate calculations. However, the performance bottleneck is due to data movement between the hardware accelerators and the CPU. In this paper, we propose a data prefetch strategy that can efficiently reduce such operational bottlenecks. The core idea of the data prefetch strategy is to predict the data needed for the next task and upload it to local memory while the hardware accelerator (Matrix Multiplication Unit, MMU) performs a task. This strategy can be enhanced by using a dual buffer to perform read and write operations simultaneously. This reduces latency and execution time of data transfer. Through simulations, we demonstrate a 24% improvement in the performance of hardware accelerators by maximizing parallel processing with dual buffers and bottlenecks between memories with data prefetch.
https://doi.org/10.9708/jksci.2024.29.06.001 인용 PDF HTML

Multi-Dimensional Traveling Salesman Problem Scheme Using Top-n Skyline Query (Top-n 스카이라인 질의를 이용한 다차원 외판원 순회문제 기법)

Jin, ChangGyun;Oh, Dukshin;Kim, Jongwan
- KIPS Transactions on Software and Data Engineering
- /
- v.9 no.1
- /
- pp.17-24
- /
- 2020
The traveling salesman problem is an algorithmic problem tasked with finding the shortest route that a salesman visits, visiting each city and returning to the started city. Due to the exponential time complexity of TSP, it's hard to implement on cases like amusement park or delivery. Also, TSP is hard to meet user's demand that is associated with multi-dimensional attributes like travel time, interests, waiting time because it uses only one attribute - distance between nodes. This paper proposed Top-n Skyline-Multi Dimension TSP to resolve formerly adverted problems. The proposed algorithm finds the shortest route faster than the existing method by decreasing the number of operations, selecting multi-dimensional nodes according to the dominance of skyline. In the simulation, we compared computation time of dynamic programming algorithm to the proposed a TS-MDT algorithm, and it showed that TS-MDT was faster than dynamic programming algorithm.
https://doi.org/10.3745/KTSDE.2020.9.1.17 인용 PDF KSCI

Fast Intermode Decision of Scalable Video Coding using Statistical Hypothesis Testing (스케일러블 비디오 부호화에서 통계적 가설 검증 기법을 이용한 프레임 간 모드 결정)

Lee, Bum-Shik;Kim, Mun-Churl;Hahm, Sang-Jin;Lee, Keun-Sik;Park, Keun-Soo
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2006.11a
- /
- pp.111-115
- /
- 2006
스케일러블 비디오 코딩(SVC, Scalable Video Coding)은 MPEG(Moving Picture Expert Group)과 VCEG (Video Coding Expert Group)의 JVT(Joint VIdeo Team)에 의해 현재 표준화 되고 있는 새로운 압축 표준 기술이며 시간, 공간 및 화질의 스케일러빌리티를 지원하기 위해 계층 구조를 가지고 있다. 특히 시간적 스케일러빌리티를 위해 계층적 B-픽처 구조를 채택하고 있다. 스케일러블 비디오 코딩의 기본 계층은 H.264|AVC와 호환적이므로, 모션 예측과 모드 결정과정에서 $16{\times}16,\;16{\times}8,\;8{\times}16,\;8{\times}8,\;8{\times}4,\;4{\times}8$ 그리고 $4{\times}4$와 같은 7개의 서로 다른 크기를 갖는 블록을 사용한다. 스케일러블 비디오 코딩에서 사용되고있는 계층적 B-픽처 구조는 키 픽처인 I와 P 픽처를 제외하고는 한 GOP (Group of Picture)내에서 모두 B-픽처를 사용하므로 H.264|AVC와 비교했을 때 연산량 증가와 함께 부호화 지연도 급격히 증가한다. B-픽처는 양방향 모션 벡터인 LIST0와 LIST1을 사용하고 양방향 모두에서 다중 참조 픽처를 사용하기 때문이다. 본 논문에서는 통계적 가선 검증을 이용하여 스케일러블 비디오 부호화에 적용 가능한 고속 프레임간 모드 결정 알고리듬 대해 소개한다. 제안된 방법은 $16{\times}16$ 매크로 블록과 $8{\times}8$ 서브 매크로 블록에 통계적 가설 감증 기법을 적용하여 실행되며, 현재 블록과 복원된 참조 블록간의 픽셀 값을 비교하여 RD(Rate Distortion) 최적화 기반 모드 결정을 빨리 완료함으로써 고속 프레임간 모드 결정을 가능하게 한다. 제안된 방법은 프레임 간 모드 결정을 고속화함으로써 스케일러블 비디오 부호화기의 연산량과 복잡도를 최대 57%감소시킨다. 그러나 연산량 감소에 따른 비트율의 증가나 화질의 열화는 최대 1.74% 비트율 증가 및 0.08dB PSNR 감소로 무시할 정도로 작다.
PDF

The Design and Implementation of a Cleaning Algorithm using NAND-Type Flash Memory (NAND-플래시 메모리를 이용한 클리닝 알고리즘의 구현 및 설계)

Koo, Yong-Wan;Han, Dae-Man
- Journal of Internet Computing and Services
- /
- v.7 no.6
- /
- pp.105-112
- /
- 2006
This paper be composed to file system by making a new i_node structure which can decrease Write frequency because this's can improved the file system efficiency if reduced Write operation frequency of flash memory in respect of file system, i-node is designed to realize Cleaning policy of data in order to perform Write operation. This paper suggest Cleaning Algorithm for Write operation through a new i_node structure. In addition, this paper have mode the oldest data cleaned and the most recent data maintained longest as a result of experiment that the recent applied program and data tend to be implemented again through the concept of regional and time space which appears automatically when applied program is implemented. Through experiment and realization of the Flash file system, this paper proved the efficiency of NAND-type flash file system which is required in on Embedded system.
PDF

An Optimized Hardware Design for High Performance Residual Data Decoder (고성능 잔여 데이터 복호기를 위한 최적화된 하드웨어 설계)

Jung, Hong-Kyun;Ryoo, Kwang-Ki
- Journal of the Korea Academia-Industrial cooperation Society
- /
- v.13 no.11
- /
- pp.5389-5396
- /
- 2012
In this paper, an optimized residual data decoder architecture is proposed to improve the performance in H.264/AVC. The proposed architecture is an integrated architecture that combined parallel inverse transform architecture and parallel inverse quantization architecture with common operation units applied new inverse quantization equations. The equations without division operation can reduce execution time and quantity of operation for inverse quantization process. The common operation unit uses multiplier and left shifter for the equations. The inverse quantization architecture with four common operation units can reduce execution cycle of inverse quantization to one cycle. The inverse transform architecture consists of eight inverse transform operation units. Therefore, the architecture can reduce the execution cycle of inverse transform to one cycle. Because inverse quantization operation and inverse transform operation are concurrency, the execution cycle of inverse transform and inverse quantization operation for one $4{\times}4$ block is one cycle. The proposed architecture is synthesized using Magnachip 0.18um CMOS technology. The gate count and the critical path delay of the architecture are 21.9k and 5.5ns, respectively. The throughput of the architecture can achieve 2.89Gpixels/sec at the maximum clock frequency of 181MHz. As the result of measuring the performance of the proposed architecture using the extracted data from JM 9.4, the execution cycle of the proposed architecture is about 88.5% less than that of the existing designs.
https://doi.org/10.5762/KAIS.2012.13.11.5389 인용 PDF KSCI

A Common Synthesis Filter for MPEG-2 BC/AAC Audio Using Recursive Structure (Recursive 구조를 이용한 MPEG-2 BC/AAC 오디오 공용 합성 필터)

강명수;박세기;오신범;이채욱
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.29 no.6C
- /
- pp.874-882
- /
- 2004
MPEG Audio compression algorithm is the international standard for the digital compression of high quality audio using mechanism of the perceptual coding based on psychoacoustic masking. It is necessary to discuss the constraints on designing of common filter banks for MPEG-2 BC and MPEG-2 AAC decoder system, which is not Down yet, mapping audio signals from the time domain into the frequency domain. In this paper, we present an architecture of common synthesis filter whcih can be used for MPEG-2 BC and MPEG-2 AAC decoder using recursive structure. The proposed algorithm is based on recursive architecture that effectively performs common compulsion.
PDF KSCI

Delayed Write Scheme to Enhance Write Performance of Flash Memory Based Embedded Database Systems (플래시 메모리 기반 임베디드 데이터베이스 시스템의 쓰기 성능 향상을 위한 지연쓰기 기법)

Song, Ha-Joo;Kwon, Oh-Heum
- Journal of Korea Multimedia Society
- /
- v.12 no.2
- /
- pp.165-177
- /
- 2009
Embedded database systems (EDBMS) based on NAND flash memories are widely adopted for logging data on sensor nodes. Since write and erase operations of a flash memory are time consuming compared to read operations and wear memory cells, it is important to reduce these operations to enhance the EDBMS performance and to extend the memory life. In this paper, we propose a delayed write scheme to archive this goal. Proposed scheme stores updated parts of database pages into delayed write records to reduce the database page writes. By doing that, it decreases write and erase operations on a flash memory. Therefore, the proposed scheme enhances the logging performance of a write-intensive EDBMS on a sensor node and extends the flash memory life.
PDF

A High Speed Block Turbo Code Decoding Algorithm and Hardware Architecture Design (고속 블록 터보 코드 복호 알고리즘 및 하드웨어 구조 설계)

유경철;신형식;정윤호;김근회;김재석
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.41 no.7
- /
- pp.97-103
- /
- 2004
In this paper, we propose a high speed block turbo code decoding algorithm and an efficient hardware architecture. The multimedia wireless data communication systems need channel codes which have the high-performance error correcting capabilities. Block turbo codes support variable code rates and packet sizes, and show a high performance due to a soft decision iteration decoding of turbo codes. However, block turbo codes have a long decoding time because of the iteration decoding and a complicated extrinsic information operation. The proposed algorithm using the threshold that represents a channel information reduces the long decoding time. After the threshold is decided by a simulation result, the proposed algorithm eliminates the calculation for the bits which have a good channel information and assigns a high reliability value to the bits. The threshold is decided by the absolute mean and the standard deviation of a LLR(Log Likelihood Ratio) in consideration that the LLR distribution is a gaussian one. Also, the proposed algorithm assigns '1', the highest reliable value, to those bits. The hardware design result using verilog HDL reduces a decoding time about 30% in comparison with conventional algorithm, and includes about 20K logic gate and 32Kbit memory sizes.
PDF KSCI

New RPWM techniques for three-phase induction motor drive using four-switch three-phase inverter (4-SWITCH 3상인버터를 이용한 3상 유도전동기 구동을 위한 새로운 RPWM 기법)

Lee Hyo-Sang;Kwon Soo-Bum;Park Jong-Jin;Kim Nam-Joon
- Proceedings of the KIPE Conference
- /
- 2003.11a
- /
- pp.168-172
- /
- 2003
본 논문에서는 고주파 스위칭 시 스위칭 손실의 감소, 구현의 용이성 및 인버터 제어를 위하여 요구되는 연산시간 감소 등 다양한 장점을 가진 2-LEG 인버터를 대상으로, 새로운 RPWM(Random PWM) 기법에 의한 3상유도전동기 구동 방식에 대하여 서술한다. 기존의 RPWM 방식과 비교하여 제안한 RPWM 기법으로부터, 10000(rpm) 이상의 고속운전 영역에서의 인버터 출력전류의 고조파 스펙트럼을 넓은 주파수 영역으로(특정주파수의 side band) 고루 분산시켜 RPWM의 고조파 저감효과에 대한 우수성을 입증하고자 한다. 이러한 과정에서 제안된 RPWM 기법을 적용한 알고리즘에 대하여 DSP를 이용한 IGBT 인버터에 의한 실험을 수행하여, 이로부터 그 결과를 검토하여 제안된 기법의 타당성을 검증하고자 한다.
PDF

The Spatial View Client-Side Materialization Techniques for Load-Balancing in Server-Side Computing Cost (서버 처리비용 분산을 위한 공간 뷰 클라이언트 실체화 기법)

김태연;정보흥;이재동;배해영
- Proceedings of the Korean Information Science Society Conference
- /
- 2001.04b
- /
- pp.211-213
- /
- 2001
공간 데이터베이스 시스템에서는 데이터의 보안과 사용자의 편의성을 제공하기 위해 사용자가 원하는 공간데이터만으로 구성된 공간 뷰를 제공한다. 클라이언트/서버 환경의 공간 데이터베이스 시스템에서 다수의 클라이언트에 의해 공간 뷰에 대한 질의가 요청 될 시 대용량의 데이터를 처리하기 위한 서버의 I/O 연산의 수행비용과 질의처리 비용 및 결과 데이터의 전송을 위한 전송 비용이 서버의 부하를 일으키고 질의 처리속도의 저하를 야기시킨다. 본 논문에서는 클라이언트/서버 환경의 공간 데이터베이스 시스템에서 공간 뷰의 생성 과정을 서버와 클라이언트에 분산시킨 크라이언트 실체화 기법을 제안한다. 공간 뷰 생성의 질의처리를 서버와 클라이언트에 분산시켜 대용량의 데이터와 복잡한 공간 연산에 따른 공간 뷰 생성과정의 서버 부하를 감소시키고 클라이언트에 실체화 함으로 해서 공간뷰에 대한 질의처리 요구에 따른 서버의 병목현상과 서버 부하를 감소시켜 사용자 응답시간을 최소화한다.
PDF

Search Result 400, Processing Time 0.032 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)