• Title/Summary/Keyword: Main memory access architecture

Search Result 25, Processing Time 0.018 seconds

A Unified Software Architecture for Storage Class Random Access Memory (스토리지 클래스 램을 위한 통합 소프트웨어 구조)

  • Baek, Seung-Jae;Choi, Jong-Moo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.36 no.3
    • /
    • pp.171-180
    • /
    • 2009
  • Slowly, but surely, we are seeing the emergence of a variety of embedded systems that are employing Storage Class RAM (SCRAM) such as FeRAM, MRAM and PRAM, SCRAM not only has DRAM-characteristic, that is, random byte-unit access capability, but also Disk-characteristic, that is, non-volatility. In this paper, we propose a new software architecture that allows SCRAM to be used both for main memory and for secondary storage simultaneously- The proposed software architecture has two core modules, one is a SCRAM driver and the other is a SCRAM manager. The SCRAM driver takes care of SCRAM directly and exports low level interfaces required for upper layer software modules including traditional file systems, buddy systems and our SCRAM manager. The SCRAM manager treats file objects and memory objects as a single object and deals with them in a unified way so that they can be interchanged without copy overheads. Experiments conducted on real embedded board with FeRAM have shown that the SCRAM driver indeed supports both the traditional F AT file system and buddy system seamlessly. The results also have revealed that the SCRAM manager makes effective use of both characteristics of SCRAM and performs an order of magnitude better than the traditional file system and buddy system.

An image data processing unit of efficient H/W structure for mask/logic operations (마스크/논리 연산에 효율적인 H/W 구조를 갖는 영상 데이터 처리장치)

  • 이상현;김진헌;박귀태
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1993.10a
    • /
    • pp.685-691
    • /
    • 1993
  • This paper introduces a PC-based image data processing unit that is composed of preprocessor board and main processor board; The preprocessor contains Inmos A110 processor and efficient H/W architecture for fast mask/logic operations at the speed of video signal rate. It is controlled by the main processor which communicates with the host PC. The main processor board contains TI TMS320C31 digital signal processor, and can access the frame memory of the processor for extra S/W tasks. We test 3*3, 5*5 masks and logic operations on 386/486/DSP and compare the result with that of the proposed unit. The result shows ours are extremely faster than conventional CPU based approach, that is, over several hundred times faster than even DSP.

  • PDF

Remote Cache Replacement Policy using Processor Locality in Multi-Processor System (다중 프로세서 시스템에서 프로세서 지역성을 이용한 원격 캐쉬 교체 정책)

  • Han Sang Yoon;Kwak Jong Wook;Jhang Seong Tae;Jhon Chu Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.541-556
    • /
    • 2005
  • The memory access latency of the system has been a primary factor of performance degradation in single-processor system and multi-processor system. The remote memory access latency takes a lot of overhead over the local memory access latency especially in the distributed shared-memory system. To resolve this problem, the multi-level cache architecture that contains a remote cache in the multi-processor system has been proposed. In this paper, we propose a new cache replacement policy that improves the performance of the multi-processor system with the remote cache. If the multi-level cache keeps the multi-level inclusion(MLI) property and uses the LRU(Least Recently Used) cache replacement policy, the LRU information of the higher-level cache(a processor cache) would be different with that of the lower-level cache(a remote cache). In this situation, the replacement of a remote cache line can induce the exchange of a processor cache line that is used by the processor. It is a main factor of performance degradation in a whole system. To alleviate this disadvantage of the LRU replacement polity, the new policy analyses tht processor's remote memory access pattern of each node and uses this information to reduce the number of invalidations of the useful cache line in the higher-level cache. The new replacement policy of the remote cache can improve the performance by $3.5\%$ in maximum and $2.5\%$ in average on SPLASH-2 benchmarks, compared to the general LRU cache replacement policy.

Join Operation of Parallel Database System with Large Main Memory (대용량 메모리를 가진 병렬 데이터베이스 시스템의 조인 연산)

  • Park, Young-Kyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.12 no.3
    • /
    • pp.51-58
    • /
    • 2007
  • The shared-nothing multiprocessor architecture has advantages in scalability, this architecture has been adopted in many multiprocessor database system. But, if the data are not uniformly distributed across the processors, load will be unbalanced. Therefore, the whole system performance will deteriorate. This is the data skew problem, which usually occurs in processing parallel hash join. Balancing the load before performing join will resolve this problem efficiently and the whole system performance can be improved. In this paper, we will present an algorithm using merit of very large memory to reduce disk access overhead in performing load balancing and to efficiently solve the data skew problem. Also, we will present analytical model of our new algorithm and present the result of some performance study we made comparing our algorithm with the other algorithms in handling data skew.

  • PDF

A New Hardware Architecture of High-Speed Motion Estimator for H.264 Video CODEC (H.264 비디오 코덱을 위한 고속 움직임 예측기의 하드웨어 구조)

  • Lim, Jeong-Hun;Seo, Young-Ho;Choi, Hyun-Jun;Kim, Dong-Wook
    • Journal of Broadcast Engineering
    • /
    • v.16 no.2
    • /
    • pp.293-304
    • /
    • 2011
  • In this paper, we proposed a new hardware architecture for motion estimation (ME) which is the most time-consuming unit among H.264 algorithms and designed to the type of intellectual property (IP). The proposed ME hardware consists of buffer, processing unit (PU) array, SAD (sum of absolute difference) selector, and motion vector (MVgenerator). PU array is composed of 16 PUs and each PU consists of 16 processing elements (PUs). The main characteristics of the proposed hardware are that current and reference frames are re-used to reduce the number of access to the external memory and that there is no clock loss during SAD operation. The implemented ME hardware occupies 3% hardware resources of StatixIII EP3SE80F1152C2 which is a FPGA of Altera Inc. and can operate at up to 446.43MHz. Therefore it can process up to 50 frames of 1080p in a second.

A Study of FC-NIC Design Using zynq SoC for Host Load Reduction (호스트 부하 경감 달성을 위한 zynq SoC를 적용한 FC-NIC 설계에 관한 연구)

  • Hwang, Byeung-Chang;Seo, Jung-hoon;Kim, Young-Su;Ha, Sung-woo;Kim, Jae-Young;Jang, Sun-geun
    • Journal of Advanced Navigation Technology
    • /
    • v.19 no.5
    • /
    • pp.423-432
    • /
    • 2015
  • This paper shows that design, manufacture and the performance of FC-NIC (fibre channel network interface card) for network unit configuration which is based on one of the 5 main configuration items of the common functional module for IMA (integrated modular Avionics) architecture. Especially, FC-NIC uses zynq SoC (system on chip) for host load reductions. The host merely transmit FC destination address, source memory location and size information to the FC-NIC. After then the FC-NIC read the host memory via DMA (direct memory access). FC upper layer protocol and sequence process at local processor and programmable logic of FC-NIC zynq SoC. It enables to free from host load for external communication. The performance of FC-NIC shows average 5.47 us low end-to-end latency at 2.125 Gbps line speed. It represent that FC-NIC is one of good candidate network for IMA.

An Implementation of 3D Graphic Accelerator for Phong Shading (퐁 음영법을 위한 3차원 그래픽 가속기의 구현)

  • Lee, Hyung;Park, Youn-Ok;Park, Jong-Won
    • Journal of Korea Multimedia Society
    • /
    • v.3 no.5
    • /
    • pp.526-534
    • /
    • 2000
  • There have been many researches on the 3D graphic accelerator for high speed by needs of CAD/CAM,3D modeling, virtual reality or medical image. In this paper, an SIMD processor architecture for 3D graphic accelerator is proposed in order to improve the processing time of the 3D graphics, and a parallel Phong shading algorithm is presented to estimate performance of the proposed architecture. The proposed SIMD processor architecture for 3D graphic accelerator consists of PCI local bus interface, 16 Processing Elements (PE's), and Park's multi-access memory system (NAMS) that has 17 memory modules. A serial algorithm for Phong shading is modified for the architecture and the main key is to divide a polygon into $4\times{4}$ squares. And, for processing a square, 4 PE's are regarded as a PE Grou logically. Since MAMS can support block access type with interval 1, it is possible that 4 PE Groups process a square at a time. In consequence, 16 pixels are processed simultaneously. The proposed SIMD processor architecture is simulated by CADENCE Verilog-XL that is a package for the hardware simulation. With the same simulated results as that of the serial algorithm, the speed enhancement by the parallel algorithm to the serial one is 5.68.

  • PDF

A New Hardware Design for Generating Digital Holographic Video based on Natural Scene (실사기반 디지털 홀로그래픽 비디오의 실시간 생성을 위한 하드웨어의 설계)

  • Lee, Yoon-Hyuk;Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.49 no.11
    • /
    • pp.86-94
    • /
    • 2012
  • In this paper we propose a hardware architecture of high-speed CGH (computer generated hologram) generation processor, which particularly reduces the number of memory access times to avoid the bottle-neck in the memory access operation. For this, we use three main schemes. The first is pixel-by-pixel calculation rather than light source-by-source calculation. The second is parallel calculation scheme extracted by modifying the previous recursive calculation scheme. The last one is a fully pipelined calculation scheme and exactly structured timing scheduling by adjusting the hardware. The proposed hardware is structured to calculate a row of a CGH in parallel and each hologram pixel in a row is calculated independently. It consists of input interface, initial parameter calculator, hologram pixel calculators, line buffer, and memory controller. The implemented hardware to calculate a row of a $1,920{\times}1,080$ CGH in parallel uses 168,960 LUTs, 153,944 registers, and 19,212 DSP blocks in an Altera FPGA environment. It can stably operate at 198MHz. Because of the three schemes, the time to access the external memory is reduced to about 1/20,000 of the previous ones at the same calculation speed.

Model Validation of a Fast Ethernet Controller for Performance Evaluation of Network Processors (네트워크 프로세서의 성능 예측을 위한 고속 이더넷 제어기의 상위 레벨 모델 검증)

  • Lee Myeong-jin
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.11 no.1
    • /
    • pp.92-99
    • /
    • 2005
  • In this paper, we present a high-level design methodology applied on a network system-on-a-chip(SOC) using SystemC. The main target of our approach is to get optimum performance parameters for high network address translation(NAT) throughput. The Fast Ethernet media access controller(MAC) and its direct memory access(DMA) controller are modeled with SystemC in transaction level. They are calibrated through the cycle-based measurement of the operation of the real Verilog register transfer language(RTL). The NAT throughput of the model is within $\pm$10% error compared to the output of the real evaluation board. Simulation speed of the model is more than 100 times laster than the RTL. The validated models are used for intensive architecture exploration to find the performance bottleneck in the NAT router.

Cache Architecture Design for the Performance Improvement of OpenRISC Core (OpenRISC 코어의 성능향상을 위한 캐쉬 구조 설계)

  • Jung, Hong-Kyun;Ryoo, Kwang-Ki
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.1
    • /
    • pp.68-75
    • /
    • 2009
  • As the recent performance of microprocessor is improving quickly, the necessity of cache is growing because of the increase of the access time of main memory. Every block of direct-mapped cache maps to one cache line. Although the mapping rule is simple, if different blocks map to one cache line, the miss ratio will be higher than the set-associative cache due to conflicts. In this paper, for the improvement of the direct-mapped cache of OpenRISC, 4-way set-associative cache is proposed. Four blocks of the main memory of the proposed cache map to one cache line so that the miss ratio is less than the direct-mapped cache. Pseudo-LRU Policy, which is one of the Line Replacement Policies, is used for decreasing the number of bits that store LRU value. The OpenRISC core including the 4-way set-associative cache was verified with FPGA emulation. As the result of performance measurement using test program, the performance of the OpenRISC core including the 4-way set-associative cache is higher than the previous one by 50% and the decrease of miss ratio is more than 15%.