• Title/Summary/Keyword: Memory Buffer

Search Result 369, Processing Time 0.02 seconds

Block-based Adaptive Bit Allocation for Reference Memory Reduction (효율적인 참조 메모리 사용을 위한 블록기반 적응적 비트할당 알고리즘)

  • Park, Sea-Nae;Nam, Jung-Hak;Sim, Dong-Gy;Joo, Young-Hun;Kim, Yong-Serk;Kim, Hyun-Mun
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.46 no.3
    • /
    • pp.68-74
    • /
    • 2009
  • In this paper, we propose an effective memory reduction algorithm to reduce the amount of reference frame buffer and memory bandwidth in video encoder and decoder. In general video codecs, decoded previous frames should be stored and referred to reduce temporal redundancy. Recently, reference frames are recompressed for memory efficiency and bandwidth reduction between a main processor and external memory. However, these algorithms could hurt coding efficiency. Several algorithms have been proposed to reduce the amount of reference memory with minimum quality degradation. They still suffer from quality degradation with fixed-bit allocation. In this paper, we propose an adaptive block-based min-max quantization that considers local characteristics of image. In the proposed algorithm, basic process unit is $8{\times}8$ for memory alignment and apply an adaptive quantization to each $4{\times}4$ block for minimizing quality degradation. We found that the proposed algorithm can obtain around 1.7% BD-bitrate gain and 0.03dB BD-PSNR gain, compared with the conventional fixed-bit min-max algorithm with 37.5% memory saving.

The Early Write Back Scheme For Write-Back Cache (라이트 백 캐쉬를 위한 빠른 라이트 백 기법)

  • Chung, Young-Jin;Lee, Kil-Whan;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.11
    • /
    • pp.101-109
    • /
    • 2009
  • Generally, depth cache and pixel cache of 3D graphics are designed by using write-back scheme for efficient use of memory bandwidth. Also, there are write after read operations of same address or only write operations are occurred frequently in 3D graphics cache. If a cache miss is detected, an access to the external memory for write back operation and another access to the memory for handling the cache miss are operated simultaneously. So on frequent cache miss situations, as the memory access bandwidth limited, the access time of the external memory will be increased due to memory bottleneck problem. As a result, the total performance of the processor or the IP will be decreased, also the problem will increase peak power consumption. So in this paper, we proposed a novel early write back cache architecture so as to solve the problems issued above. The proposed architecture controls the point when to access the external memory as to copy the valid data block. And this architecture can improve the cache performance with same hit ratio and same capacity cache. As a result, the proposed architecture can solve the memory bottleneck problem by preventing intensive memory accesses. We have evaluated the new proposed architecture on 3D graphics z cache and pixel cache on a SoC environment where ARM11, 3D graphic accelerator and various IPs are embedded. The simulation results indicated that there were maximum 75% of performance increase when using various simulation vectors.

Design and Implementation of B-Tree on Flash Memory (플래시 메모리 상에서 B-트리 설계 및 구현)

  • Nam, Jung-Hyun;Park, Dong-Joo
    • Journal of KIISE:Databases
    • /
    • v.34 no.2
    • /
    • pp.109-118
    • /
    • 2007
  • Recently, flash memory is used to store data in mobile computing devices such as PDAs, SmartCards, mobile phones and MP3 players. These devices need index structures like the B-tree to efficiently support some operations like insertion, deletion and search. The BFTL(B-tree Flash Translation Layer) technique was first introduced which is for implementing the B-tree on flash memory. Flash memory has characteristics that a write operation is more costly than a read operation and an overwrite operation is impossible. Therefore, the BFTL method focuses on minimizing the number of write operations resulting from building the B-tree. However, we indicate in this paper that there are many rooms of improving the performance of the I/O cost in building the B-tree using this method and it is not practical since it increases highly the usage of the SRAM memory storage. In this paper, we propose a BOF(the B-tree On Flash memory) approach for implementing the B-tree on flash memory efficiently. The core of this approach is to store index units belonging to the same B-tree node to the same sector on flash memory in case of the replacement of the buffer used to build the B-tree. In this paper, we show that our BOF technique outperforms the BFTL or other techniques.

A Receiver-driven TCP Flow Control for Memory Constrained Mobile Receiver (제한된 메모리의 모바일 수신자를 고려한 수신자 기반 TCP 흐름 제어)

  • 이종민;차호정
    • Journal of KIISE:Information Networking
    • /
    • v.31 no.1
    • /
    • pp.91-100
    • /
    • 2004
  • This paper presents a receiver-driven TCP flow control mechanism, which is adaptive to the wireless condition, for memory constrained mobile receiver. A receiver-driven TCP flow control is, in general, achieved by adjusting the size of advertised window at the receiver. The proposed method constantly measures at the receiver both the available wireless bandwidth and the packet round-trip time. Depending on the measured values, the receiver adjusts appropriately the size of advertised window. Constrained by the adjusted window which reflects the current state of the wireless network, the sender achieves an improved TCP throughput as well as the reduced round-trip packet delay. Its implementation only affects the protocol stack at the receiver and hence neither the sender nor the router are required to be modified. The mechanism has been implemented in real environments. The experimental results show that in CDMA2000 1x networks the TCP throughput of the proposed method has improved about 5 times over the conventional method when the receiver's buffer size is limited to 2896 bytes. Also, with 64Kbytes of buffer site, the packet round-trip time of the proposed method has been reduced in half, compared the case with the conventional method.

Data Stream Storing Techniques for Supporting Hybrid Query (하이브리드 질의를 위한 데이터 스트림 저장 기술)

  • Shin, Jae-Jyn;You, Byeong-Seob;Eo, Sang-Hun;Lee, Dong-Wook;Bae, Hae-Young
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.11
    • /
    • pp.1384-1397
    • /
    • 2007
  • This paper proposes fast storage techniques for hybrid query of data streams. DSMS(Data Stream Management System) have been researched for processing data streams that have busting income. To process hybrid query that retrieve both current incoming data streams and past data streams data streams have to be stored into disk. But due to fast input speed of data stream and memory and disk space limitation, the main research is not about querying to stored data streams but about querying to current incoming data streams. Proposed techniques of this paper use circular buffer for maximizing memory utility and for make non blocking insertion possible. Data in a disk is compressed to maximize the number of data in the disk. Through experiences, proposed technique show that bursting insertion is stored fast.

  • PDF

A New Hardware Architecture of High-Speed Motion Estimator for H.264 Video CODEC (H.264 비디오 코덱을 위한 고속 움직임 예측기의 하드웨어 구조)

  • Lim, Jeong-Hun;Seo, Young-Ho;Choi, Hyun-Jun;Kim, Dong-Wook
    • Journal of Broadcast Engineering
    • /
    • v.16 no.2
    • /
    • pp.293-304
    • /
    • 2011
  • In this paper, we proposed a new hardware architecture for motion estimation (ME) which is the most time-consuming unit among H.264 algorithms and designed to the type of intellectual property (IP). The proposed ME hardware consists of buffer, processing unit (PU) array, SAD (sum of absolute difference) selector, and motion vector (MVgenerator). PU array is composed of 16 PUs and each PU consists of 16 processing elements (PUs). The main characteristics of the proposed hardware are that current and reference frames are re-used to reduce the number of access to the external memory and that there is no clock loss during SAD operation. The implemented ME hardware occupies 3% hardware resources of StatixIII EP3SE80F1152C2 which is a FPGA of Altera Inc. and can operate at up to 446.43MHz. Therefore it can process up to 50 frames of 1080p in a second.

Design of RISC-based Transmission Wrapper Processor IP for TCP/IP Protocol Stack (TCP/IP프로토콜 스택을 위한 RISC 기반 송신 래퍼 프로세서 IP 설계)

  • 최병윤;장종욱
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.8 no.6
    • /
    • pp.1166-1174
    • /
    • 2004
  • In this paper, a design of RISC-based transmission wrapper processor for TCP/IP protocol stack is described. The processor consists of input and output buffer memory with dual bank structure, 32-bit RISC microprocessor core, DMA unit with on-the-fly checksum capability, and memory module. To handle the various modes of TCP/IP protocol, hardware-software codesign approach based on RISC processor is used rather than the conventional state machine design. To eliminate large delay time due to sequential executions of data transfer and checksum operation, DMA module which can execute the checksum operation along with data transfer operation is adopted. The designed processor exclusive of variable-size input/output buffer consists of about 23,700 gates and its maximum operating frequency is about 167MHz under 0.35${\mu}m$ CMOS technology.

Resolving Memory Bottlenecks in Hardware Accelerators with Data Prefetch

  • Hyein Lee;Jinoo Joung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.6
    • /
    • pp.1-12
    • /
    • 2024
  • Deep learning with faster and more accurate results requires large amounts of storage space and large computations. Accordingly, many studies are using hardware accelerators for quick and accurate calculations. However, the performance bottleneck is due to data movement between the hardware accelerators and the CPU. In this paper, we propose a data prefetch strategy that can efficiently reduce such operational bottlenecks. The core idea of the data prefetch strategy is to predict the data needed for the next task and upload it to local memory while the hardware accelerator (Matrix Multiplication Unit, MMU) performs a task. This strategy can be enhanced by using a dual buffer to perform read and write operations simultaneously. This reduces latency and execution time of data transfer. Through simulations, we demonstrate a 24% improvement in the performance of hardware accelerators by maximizing parallel processing with dual buffers and bottlenecks between memories with data prefetch.

Design and Implementation of a Lightweight On-Device AI-Based Real-time Fault Diagnosis System using Continual Learning (연속학습을 활용한 경량 온-디바이스 AI 기반 실시간 기계 결함 진단 시스템 설계 및 구현)

  • Youngjun Kim;Taewan Kim;Suhyun Kim;Seongjae Lee;Taehyoun Kim
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.19 no.3
    • /
    • pp.151-158
    • /
    • 2024
  • Although on-device artificial intelligence (AI) has gained attention to diagnosing machine faults in real time, most previous studies did not consider the model retraining and redeployment processes that must be performed in real-world industrial environments. Our study addresses this challenge by proposing an on-device AI-based real-time machine fault diagnosis system that utilizes continual learning. Our proposed system includes a lightweight convolutional neural network (CNN) model, a continual learning algorithm, and a real-time monitoring service. First, we developed a lightweight 1D CNN model to reduce the cost of model deployment and enable real-time inference on the target edge device with limited computing resources. We then compared the performance of five continual learning algorithms with three public bearing fault datasets and selected the most effective algorithm for our system. Finally, we implemented a real-time monitoring service using an open-source data visualization framework. In the performance comparison results between continual learning algorithms, we found that the replay-based algorithms outperformed the regularization-based algorithms, and the experience replay (ER) algorithm had the best diagnostic accuracy. We further tuned the number and length of data samples used for a memory buffer of the ER algorithm to maximize its performance. We confirmed that the performance of the ER algorithm becomes higher when a longer data length is used. Consequently, the proposed system showed an accuracy of 98.7%, while only 16.5% of the previous data was stored in memory buffer. Our lightweight CNN model was also able to diagnose a fault type of one data sample within 3.76 ms on the Raspberry Pi 4B device.

Efficient DRAM Buffer Access Scheduling Techniques for SSD Storage System (SSD 스토리지 시스템을 위한 효율적인 DRAM 버퍼 액세스 스케줄링 기법)

  • Park, Jun-Su;Hwang, Yong-Joong;Han, Tae-Hee
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.48 no.7
    • /
    • pp.48-56
    • /
    • 2011
  • Recently, new storage device SSD(Solid State Disk) based on NAND flash memory is gradually replacing HDD(Hard Disk Drive) in mobile device and thus a variety of research efforts are going on to find the cost-effective ways of performance improvement. By increasing the NAND flash channels in order to enhance the bandwidth through parallel processing, DRAM buffer which acts as a buffer cache between host(PC) and NAND flash has become the bottleneck point. To resolve this problem, this paper proposes an efficient low-cost scheme to increase SSD performance by improving DRAM buffer bandwidth through scheduling techniques which utilize DRAM multi-banks. When both host and NAND flash multi-channels request access to DRAM buffer concurrently, the proposed technique checks their destination and then schedules appropriately considering properties of DRAMs. It can reduce overheads of bank active time and row latency significantly and thus optimizes DRAM buffer bandwidth utilization. The result reveals that the proposed technique improves the SSD performance by 47.4% in read and 47.7% in write operation respectively compared to conventional methods with negligible changes and increases in the hardware.