• Title/Summary/Keyword: On-Chip Memory

Search Result 296, Processing Time 0.026 seconds

Hardware Design of SURF-based Feature extraction and description for Object Tracking (객체 추적을 위한 SURF 기반 특이점 추출 및 서술자 생성의 하드웨어 설계)

  • Do, Yong-Sig;Jeong, Yong-Jin
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.5
    • /
    • pp.83-93
    • /
    • 2013
  • Recently, the SURF algorithm, which is conjugated for object tracking system as part of many computer vision applications, is a well-known scale- and rotation-invariant feature detection algorithm. The SURF, due to its high computational complexity, there is essential to develop a hardware accelerator in order to be used on an IP in embedded environment. However, the SURF requires a huge local memory, causing many problems that increase the chip size and decrease the value of IP in ASIC and SoC system design. In this paper, we proposed a way to design a SURF algorithm in hardware with greatly reduced local memory by partitioning the algorithms into several Sub-IPs using external memory and a DMA. To justify validity of the proposed method, we developed an example of simplified object tracking algorithm. The execution speed of the hardware IP was about 31 frame/sec, the logic size was about 74Kgate in the 30nm technology with 81Kbytes local memory in the embedded system platform consisting of ARM Cortex-M0 processor, AMBA bus(AHB-lite and APB), DMA and a SDRAM controller. Hence, it can be used to the hardware IP of SoC Chip. If the image processing algorithm akin to SURF is applied to the method proposed in this paper, it is expected that it can implement an efficient hardware design for target application.

Design, Implementation, and Performance Evaluation of File System on a Chip (파일시스템을 내장한 저장장치의 설계, 구현 및 성능분석)

  • Ahn Seongiun;Choi Jongmoo;Lee Donghee;Noh Sam H.;Min Sang Lyul;Cho Yookun
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.10 no.6
    • /
    • pp.448-459
    • /
    • 2004
  • Interoperability is an important requirement of portable storage devices that are used to exchange and share data among diverse hosts. However, the required interoperability cannot be provided if different host systems use different file systems. To address this problem, we propose a new type of storage device called FSOC(File System On a Chip) that contains the file system within the storage device. In this paper, we give an example of the design and implementation of a flash memory-based FSOC and propose the performance models of the conventional storage device and the FSOC. We also analyze the performance characteristics of the conventional storage device and the FSOC based on the proposed performance models, and provide several experimental results using real applications that validate the performance models.

Design of Crossbar Switch On-chip Bus for Performance Improvement of SoC (SoC의 성능 향상을 위한 크로스바 스위치 온칩 버스 설계)

  • Heo, Jung-Burn;Ryoo, Kwang-Ki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.3
    • /
    • pp.684-690
    • /
    • 2010
  • Most of the existing SoCs have shared bus architecture which always has a bottleneck state. The more IPs are in an SOC, the less performance it is of the SOC, Therefore, its performance is effected by the entire communication rather than CPU speed. In this paper, we propose cross-bar switch bus architecture for the reduction of the bottleneck state and the improvement of the performance. The cross-bar switch bus supports up to 8 masters and 16 slaves and parallel communication with architecture of multiple channel bus. Each slave has an arbiter which stores priority information about masters. So, it prevents only one master occupying one slave and supports efficient communication. We compared WISHBONE on-chip shared bus architecture with crossbar switch bus architecture of the SOC platform, which consists of an OpenRISC processor, a VGA/LCD controller, an AC97 controller, a debug interface, a memory interface, and the performance improved by 26.58% than the previous shared bus.

A VLSI Design and Implementation of a Single-Chip Encoder/Decoder with Dictionary Search Processor(DISP) using LZSS Algorithm and Entropy Coding (LZSS 알고리즘과 엔트로피 부호를 이용한 사전탐색처리장치를 갖는 부호기/복호기 단일-칩의 VLSI 설계 및 구현)

  • Kim, Jong-Seop;Jo, Sang-Bok
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.2
    • /
    • pp.103-113
    • /
    • 2001
  • This paper described a design and implementation of a single-chip encoder/decoder using the LZSS algorithm and entropy coding in 0.6${\mu}{\textrm}{m}$ CMOS technology. Dictionary storage for the dictionary search processor(DISP) used a 2K$\times$8bit on-chip memory with 50MHz clock speed. It performs compression on byte-oriented input data at a data rate of one byte per clock cycle except when one out of every 33 cycles is used to update the string window of dictionary. In result, the average compression ratio is 46% by applied entropy coding of the LZSS codeword output. This is to improved on the compression performance of 7% much more then LZSS.

  • PDF

Design and Performance Analysis of High Performance Processor-Memory Integrated Architectures (고성능 프로세서-메모리 혼합 구조의 설계 및 성능 분석)

  • Kim, Young-Sik;Kim, Shin-Dug;Han, Tack-Don
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.10
    • /
    • pp.2686-2703
    • /
    • 1998
  • The widening pClformnnce gap between processor and memory causes an emergence of the promising architecture, processor-memory (PM) integration In this paper, various design issues for P-M integration are studied, First, an analytical model of the DRAM access time is constructed considering both the bank conflict ratio and the DRAM page hit ratio. Then the points of both the performance improvement and the perfonnance bottle neck are found by the proposed model as designing on-chip DRAM architectures. This paper proposes the new architecture, called the delayed precharge bank architecture, to improve the perfonnance of memory system as increasing the DRAM page hit ratio. This paper also adapts an efficient bank interleaving mechanism to the proposed architecture. This architecture is verified !II he better than the hierarchical multi-bank architecture as well as the conventional bank architecture by executiun driven simulation. Eight SPEC95 benchmarks are used for simulation as changing parameters for the cache architecture, the number of DRAM banks, and the delayed time quantum.

  • PDF

Low-resistance W Bit-line Implementation with RTP Anneal & Additional ion Implantation (RTP 어닐과 추가 이온주입에 의한 저-저항 텅스텐 비트-선 구현)

  • Lee, Yong-Hui;Lee, Cheon-Hui
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.5
    • /
    • pp.375-381
    • /
    • 2001
  • As the device geometry continuously shrink down less than sub-quarter micrometer, DRAM makers are going to replace conventional tungsten-polycide bit-line with tungsten bit-line structure in order to reduce the chip size and use it as a local interconnection. In this paper we showed low resistance tungsten bit-line fabrication process with various RTP(Rapid Thermal Process) temperature and additional ion implantation. As a result we obtained that major parameters impact on tungsten bit-line process are RTP Anneal temperature and BF$_2$ ion implantation dopant. These tungsten bit-line process are promising to fabricate high density chip technology.

  • PDF

Implementation of Image Enhancement Using DSP Chip (TI DAVINCI를 이용한 영상 개선 알고리즘 구현)

  • Park, Jong-Hwa;Ahn, Tae-Ki;Jo, Byung-Mok;Park, Goo-Man
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.11 no.6
    • /
    • pp.311-317
    • /
    • 2011
  • In this paper, we proposed realtime image enhancing method on the three noise types of input images, such as haze, low contrast and back light images. Some conventional de-hazing algorithms have good performance but need large memories and high computational burdens. We proposed the efficient algorithm which not only removes the haze but also reduces memory usage and computational complexity. We implemented the realtime system by using DM6446 DSP chip, and it showed the excellent result in these three problems; haze, low contrast and back light. We implemented the system with the processing speed at 15 frames/sec.

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

  • Oh, Jaeg-Eun;Hwang, Seok-Joong;Nguyen, Huong Giang;Kim, A-Reum;Kim, Seon-Wook;Kim, Chul-Woo;Kim, Jong-Kook
    • ETRI Journal
    • /
    • v.30 no.4
    • /
    • pp.576-586
    • /
    • 2008
  • In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.

  • PDF

Paratic Impedance Extraction of FC-PGA Package Pin using the Static Fast Multipole Method (Static FMM을 이용한 FC-PGA 패키지 핀에서의 기생 임피던스 추출)

  • 천정남;이정태;어수지;김형동
    • The Journal of Korean Institute of Electromagnetic Engineering and Science
    • /
    • v.12 no.7
    • /
    • pp.1076-1085
    • /
    • 2001
  • In this paper, the FMM(Fast Multipole Method) combined with GMRES(Generalized Minimal RESidual Method) matrix solver is used to extract the parasitic impedance for complicated 3-D structures in uniform dielectric materials which limit the use of MoM(Method of Moment) due to its large computation time and memory requirement. This algorithm is a fast multipole-accelerated method based on quasistatic analysis and is very efficient for computing impedance between conductors. This paper proved the accuracy and efficiency of the FMM by comparing with MoM in simple examples. Finally the parasitic impedance of FC-PGA(Flip Chip Pin Grid Array) Package pins has been extracted by this algorithm and we have considered the possibility of the EMI/EMC problem caused by the signal interference.

  • PDF

AB9: A neural processor for inference acceleration

  • Cho, Yong Cheol Peter;Chung, Jaehoon;Yang, Jeongmin;Lyuh, Chun-Gi;Kim, HyunMi;Kim, Chan;Ham, Je-seok;Choi, Minseok;Shin, Kyoungseon;Han, Jinho;Kwon, Youngsu
    • ETRI Journal
    • /
    • v.42 no.4
    • /
    • pp.491-504
    • /
    • 2020
  • We present AB9, a neural processor for inference acceleration. AB9 consists of a systolic tensor core (STC) neural network accelerator designed to accelerate artificial intelligence applications by exploiting the data reuse and parallelism characteristics inherent in neural networks while providing fast access to large on-chip memory. Complementing the hardware is an intuitive and user-friendly development environment that includes a simulator and an implementation flow that provides a high degree of programmability with a short development time. Along with a 40-TFLOP STC that includes 32k arithmetic units and over 36 MB of on-chip SRAM, our baseline implementation of AB9 consists of a 1-GHz quad-core setup with other various industry-standard peripheral intellectual properties. The acceleration performance and power efficiency were evaluated using YOLOv2, and the results show that AB9 has superior performance and power efficiency to that of a general-purpose graphics processing unit implementation. AB9 has been taped out in the TSMC 28-nm process with a chip size of 17 × 23 ㎟. Delivery is expected later this year.