• Title/Summary/Keyword: embedded chip-multiprocessor

Search Result 6, Processing Time 0.016 seconds

A System Level Network-on-chip Model with MLDesigner

  • Agarwal, Ankur;Shankar, Rabi;Pandya, A.S.;Lho, Young-Uhg
    • Journal of information and communication convergence engineering
    • /
    • v.6 no.2
    • /
    • pp.122-128
    • /
    • 2008
  • Multiprocessor architectures and platforms, such as, a multiprocessor system on chip (MPSoC) recently introduced to extend the applicability of the Moore's law, depend upon concurrency and synchronization in both software and hardware to enhance design productivity and system performance. With the rapidly approaching billion transistors era, some of the main problem in deep sub-micron technologies characterized by gate lengths in the range of 60-90 nm will arise from non scalable wire delays, errors in signal integrity and non-synchronized communication. These problems may be addressed by the use of Network on Chip (NOC) architecture for future System-on-Chip (SoC). We have modeled a concurrent architecture for a customizable and scalable NOC in a system level modeling environment using MLDesigner (from MLD Inc.). Varying network loads under various traffic scenarios were applied to obtain realistic performance metrics. We provide the simulation results for latency as a function of the buffer size. We have abstracted the area results for NOC components from its FPGA implementation. Modeled NOC architecture supports three different levels of quality-of-service (QoS).

Low-power heterogeneous uncore architecture for future 3D chip-multiprocessors

  • Dorostkar, Aniseh;Asad, Arghavan;Fathy, Mahmood;Jahed-Motlagh, Mohammad Reza;Mohammadi, Farah
    • ETRI Journal
    • /
    • v.40 no.6
    • /
    • pp.759-773
    • /
    • 2018
  • Uncore components such as on-chip memory systems and on-chip interconnects consume a large amount of energy in emerging embedded applications. Few studies have focused on next-generation analytical models for future chip-multiprocessors (CMPs) that simultaneously consider the impacts of the power consumption of core and uncore components. In this paper, we propose a convex-optimization approach to design heterogeneous uncore architectures for embedded CMPs. Our convex approach optimizes the number and placement of memory banks with different technologies on the memory layer. In parallel with hybrid memory architecting, optimizing the number and placement of through silicon vias as a viable solution in building three-dimensional (3D) CMPs is another important target of the proposed approach. Experimental results show that the proposed method outperforms 3D CMP designs with hybrid and traditional memory architectures in terms of both energy delay products (EDPs) and performance parameters. The proposed method improves the EDPs by an average of about 43% compared with SRAM design. In addition, it improves the throughput by about 7% compared with dynamic RAM (DRAM) design.

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

  • Oh, Jaeg-Eun;Hwang, Seok-Joong;Nguyen, Huong Giang;Kim, A-Reum;Kim, Seon-Wook;Kim, Chul-Woo;Kim, Jong-Kook
    • ETRI Journal
    • /
    • v.30 no.4
    • /
    • pp.576-586
    • /
    • 2008
  • In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.

  • PDF

The Design of Hardware MPI Units for MPSoC (MPSoC를 위한 저비용 하드웨어 MPI 유닛 설계)

  • Jeong, Ha-Young;Chung, Won-Young;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.1B
    • /
    • pp.86-92
    • /
    • 2011
  • In this paper, we propose a novel hardware MPI(Message Passing Interface) unit which supports message passing in multiprocessor system which use distributed memory architecture. MPI Hardware unit processes data synchronization, transmission and completion, and it supports processor non-blocking operation so it reduces overhead according to synchronization. Additionally, MPI hardware unit combines ready entry, request entry, reserve entry which save and manage the synchronized messages and performs the multiple outstanding issue and out of order completion. According to BFM(Bus Functional Model) simulation result, the performance is increased by 25% on many to many communication. After we designed MPI unit using HDL, with synopsys design compiler we synthesized, and for synthesis library we used MagnaChip $0.18{\mu}m$. And then we making prototype chip. The proposed message transmission interface hardware shows high performance for its increase in size. Thus, as we consider low-cost design and scalability, MPI hardware unit is useful in increasing overall performance of embedded MPSoC(Multi-Processor System-on-Chip).

Design and Implementation of an InfiniBand System Interconnect for High-Performance Cluster Systems (고성능 클러스터 시스템을 위한 인피니밴드 시스템 연결망의 설계 및 구현)

  • Mo, Sang-Man;Park, Kyung;Kim, Sung-Nam;Kim, Myung-Jun;Im, Ki-Wook
    • The KIPS Transactions:PartA
    • /
    • v.10A no.4
    • /
    • pp.389-396
    • /
    • 2003
  • InfiniBand technology is being accepted as the future system interconnect to serve as the high-end enterprise fabric for cluster computing. This paper presents the design and implementation of the InfiniBand system interconnect, focusing on an InfiniBand host channel adapter (HCA) based on dual ARM9 processor cores The HCA is an SoC tailed KinCA which connects a host node onto the InfiniBand network both in hardware and in software. Since the ARM9 processor core does not provide necessary features for multiprocessor configuration, novel inter-processor communication and interrupt mechanisms between the two processors were designed and embedded within the KinCA chip. Kinch was fabricated as a 564-pin enhanced BGA (Bail Grid Array) device using 0.18${\mu}{\textrm}{m}$ CMOS technology Mounted on host nodes, it provides 10 Gbps outbound and inbound channels for transmit and receive, respectively, resulting in a high-performance cluster system.

Parallel SystemC Cosimulation using Virtual Synchronization (가상 동기화 기법을 이용한 SystemC 통합시뮬레이션의 병렬 수행)

  • Yi, Young-Min;Kwon, Seong-Nam;Ha, Soon-Hoi
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.12
    • /
    • pp.867-879
    • /
    • 2006
  • This paper concerns fast and time accurate HW/SW cosimulation for MPSoC(Multi-Processor System-on-chip) architecture where multiple software and/or hardware components exist. It is becoming more and more common to use MPSoC architecture to design complex embedded systems. In cosimulation of such architecture, as the number of the component simulators participating in the cosimulation increases, the time synchronization overhead among simulators increases, thereby resulting in low overall cosimulation performance. Although SystemC cosimulation frameworks show high cosimulation performance, it is in inverse proportion to the number of simulators. In this paper, we extend the novel technique, called virtual synchronization, which boosts cosimulation speed by reducing time synchronization overhead: (1) SystemC simulation is supported seamlessly in the virtual synchronization framework without requiring the modification on SystemC kernel (2) Parallel execution of component simulators with virtual synchronization is supported. We compared the performance and accuracy of the proposed parallel SystemC cosimulation framework with MaxSim, a well-known commercial SystemC cosimulation framework, and the proposed one showed 11 times faster performance for H.263 decoder example, while the accuracy was maintained below 5%.