• 제목/요약/키워드: memory based instruction

검색결과 80건 처리시간 0.029초

A Low Power Design of H.264 Codec Based on Hardware and Software Co-design

  • Park, Seong-Mo;Lee, Suk-Ho;Shin, Kyoung-Seon;Lee, Jae-Jin;Chung, Moo-Kyoung;Lee, Jun-Young;Eum, Nak-Woong
    • 정보와 통신
    • /
    • 제25권12호
    • /
    • pp.10-18
    • /
    • 2008
  • In this paper, we present a low-power design of H.264 codec based on dedicated hardware and software solution on EMP(ETRI Multi-core platform). The dedicated hardware scheme has reducing computation using motion estimation skip and reducing memory access for motion estimation. The design reduces data transfer load to 66% compared to conventional method. The gate count of H.264 encoder and the performance is about 455k and 43Mhz@30fps with D1(720x480) for H.264 encoder. The software solution is with ASIP(Application Specific Instruction Processor) that it is SIMD(Single Instruction Multiple Data), Dual Issue VLIW(Very Long Instruction Word) core, specified register file for SIMD, internal memory and data memory access for memory controller, 6 step pipeline, and 32 bits bus width. Performance and gate count is 400MHz@30fps with CIF(Common Intermediated format) and about 100k per core for H.264 decoder.

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • 한국컴퓨터정보학회논문지
    • /
    • 제24권4호
    • /
    • pp.1-9
    • /
    • 2019
  • LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.

국채보상운동 기록물을 활용한 도서관협력수업 설계: 고등학교 한국사 교과를 중심으로 (Designing a Library Collaborative Instruction Using the Archives of the National Debt Redemption Movement: Focusing on the Korean History Subject in High School)

  • 송미애;이지원
    • 한국기록관리학회지
    • /
    • 제23권4호
    • /
    • pp.47-71
    • /
    • 2023
  • 우리나라는 아시아 태평양 지역에서 가장 많은 세계기록유산을 보유하고 있는 기록강국임에도 후대를 위한 기록의 적극적 활용이 부족하다. 이에 본 연구에서는 기록에 대한 관심을 높이고 기록물을 학교 수업에 직접적으로 활용하는 방안을 모색하였다. 학교 교과 수업에 활용한 기록물은 세계기록유산으로 선정된 국채보상운동 기록물이며 이를 기반으로 도서관협력수업 설계를 수행하였다. 도서관협력수업은 고등학교 한국사 교과와 도서관 정보활용교육이 협력하는 형태이며 총 3차시로 설계되었다. 문헌연구를 기반으로 설계한 결과, 도서관협력수업 계획서, 교수학습안, 학습지 등이 도출되었다. 설계된 도서관 협력수업의 실현은 세계기록유산에 대한 관심을 증진시키고 학교 현장에서 교육과정과 기록물이 연계되는 효과를 가질 것으로 예상되며 기록의 이용자층이 교사와 학생으로 확대되었다는 점에서 의의가 있다.

Comparison of Traditional Workloads and Deep Learning Workloads in Memory Read and Write Operations

  • Jeongha Lee;Hyokyung Bahn
    • International journal of advanced smart convergence
    • /
    • 제12권4호
    • /
    • pp.164-170
    • /
    • 2023
  • With the recent advances in AI (artificial intelligence) and HPC (high-performance computing) technologies, deep learning is proliferated in various domains of the 4th industrial revolution. As the workload volume of deep learning increasingly grows, analyzing the memory reference characteristics becomes important. In this article, we analyze the memory reference traces of deep learning workloads in comparison with traditional workloads specially focusing on read and write operations. Based on our analysis, we observe some unique characteristics of deep learning memory references that are quite different from traditional workloads. First, when comparing instruction and data references, instruction reference accounts for a little portion in deep learning workloads. Second, when comparing read and write, write reference accounts for a majority of memory references, which is also different from traditional workloads. Third, although write references are dominant, it exhibits low reference skewness compared to traditional workloads. Specifically, the skew factor of write references is small compared to traditional workloads. We expect that the analysis performed in this article will be helpful in efficiently designing memory management systems for deep learning workloads.

Design of a DI model-based Content Addressable Memory for Asynchronous Cache

  • Battogtokh, Jigjidsuren;Cho, Kyoung-Rok
    • International Journal of Contents
    • /
    • 제5권2호
    • /
    • pp.53-58
    • /
    • 2009
  • This paper presents a novel approach in the design of a CAM for an asynchronous cache. The architecture of cache mainly consists of four units: control logics, content addressable memory, completion signal logic units and instruction memory. The pseudo-DCVSL is useful to make a completion signal which is a reference for handshake control. The proposed CAM is a very simple extension of the basic circuitry that makes a completion signal based on DI model. The cache has 2.75KB CAM for 8KB instruction memory. We designed and simulated the proposed asynchronous cache including CAM. The results show that the cache hit ratio is up to 95% based on pseudo-LRU replacement policy.

핫스팟 접근영역 인식에 기반한 바이너리 코드 역전 기법을 사용한 저전력 IoT MCU 코드 메모리 인터페이스 구조 연구 (Low-Power IoT Microcontroller Code Memory Interface using Binary Code Inversion Technique Based on Hot-Spot Access Region Detection)

  • 박대진
    • 대한임베디드공학회논문지
    • /
    • 제11권2호
    • /
    • pp.97-105
    • /
    • 2016
  • Microcontrollers (MCUs) for endpoint smart sensor devices of internet-of-thing (IoT) are being implemented as system-on-chip (SoC) with on-chip instruction flash memory, in which user firmware is embedded. MCUs directly fetch binary code-based instructions through bit-line sense amplifier (S/A) integrated with on-chip flash memory. The S/A compares bit cell current with reference current to identify which data are programmed. The S/A in reading '0' (erased) cell data consumes a large sink current, which is greater than off-current for '1' (programmed) cell data. The main motivation of our approach is to reduce the number of accesses of erased cells by binary code level transformation. This paper proposes a built-in write/read path architecture using binary code inversion method based on hot-spot region detection of instruction code access to reduce sensing current in S/A. From the profiling result of instruction access patterns, hot-spot region of an original compiled binary code is conditionally inverted with the proposed bit-inversion techniques. The de-inversion hardware only consumes small logic current instead of analog sink current in S/A and it is integrated with the conventional S/A to restore original binary instructions. The proposed techniques are applied to the fully-custom designed MCU with ARM Cortex-M0$^{TM}$ using 0.18um Magnachip Flash-embedded CMOS process and the benefits in terms of power consumption reduction are evaluated for Dhrystone$^{TM}$ benchmark. The profiling environment of instruction code executions is implemented by extending commercial ARM KEIL$^{TM}$ MDK (MCU Development Kit) with our custom-designed access analyzer.

스트리밍 데이터의 선인출에 사용되는 참조예측표의 스칼라 우선 교체 전략 (Scalar First Replacement Strategy for Reference Prediction Table Used in Prefetching Streaming Data)

  • 임철후;전영숙;김석일;전중남
    • 정보처리학회논문지A
    • /
    • 제11A권3호
    • /
    • pp.163-172
    • /
    • 2004
  • 멀티미디어 응용프로그램의 데이터는 주소 간격이 일정한 스트리밍 패턴으로 참조되는 특성이 있다. 이러한 특성을 선인출방법에 적용하여 멀티미디어 응용프로그램의 수행속도를 향상시킬 수 있다. 참조예측표에 의한 선인출방법은 메모리 참조명령어의 과거 기록을 이용하여 규칙적으로 참조되는 메모리주소를 예측한다. 이 논문은 참조예측표를 사용하는 하드웨어 기반의 규칙 선인출방법에서 효율적인 참조예측표 운영방법을 제안한다. 참조예측표에 입력되는 메모리 참조명령어는 스칼라데이터 참조명령어와 스트리밍데이터 참조명령어로 구성된다. 스칼라데이터 참조명령어는 선인출에 사용되지 않으므로 스칼라데이터 참조명령어를 우선적으로 교체함으로서, 참조예측표를 효과적으로 사용할 수 있다. 이방법은 기존 FIFO 방법과 비교할 때, 선인출에 사용되는 스트리밍데이터 참조명령어를 참조예측표에 더 오래 유지함으로써, 선인출 성능이 향상된다.

혼합 지연 모델에 기반한 비동기 명령어 캐시 설계 (Design of an Asynchronous Instruction Cache based on a Mixed Delay Model)

  • 전광배;김석만;이제훈;오명훈;조경록
    • 한국콘텐츠학회논문지
    • /
    • 제10권3호
    • /
    • pp.64-71
    • /
    • 2010
  • 최근에는 프로세서의 고성능화에 따라 명령어 캐시와 데이타 캐시를 분리하는 구조의 설계가 일반적이다. 본 논문에서는 혼합 지연모델을 갖는 비동기식 명령어 캐쉬구조를 제안하며, 데이타 패스에는 지연무관인 회로모델을 적용하고 메모리 에는 번들지연모델을 도입하였다. 요소기술로는 명령어 캐시는 CPU, 프로그램 메모리와 4-상 핸드쉐이크(hand-shake) 프로토콜로 데이터를 전달하고, 8-K바이트, 4상 연관의 맵핑 구조를 가지며 Pseudo-LRU 엔트리 교체알고리즘을 채택하였다. 성능분석을 위하여 제안된 명령어 캐시를 게이트레벨로 합성하고 32비트 임베디드 프로세서와 연동하는 플랫폼을 구축하였다. 구축한 플랫폼에서 MI벤치마크 프로그램을 테스트하여 99%의 캐시히트율과 레이턴시가 68% 감소하는 결과를 얻었다.

On-Demand Remote Software Code Execution Unit Using On-Chip Flash Memory Cloudification for IoT Environment Acceleration

  • Lee, Dongkyu;Seok, Moon Gi;Park, Daejin
    • Journal of Information Processing Systems
    • /
    • 제17권1호
    • /
    • pp.191-202
    • /
    • 2021
  • In an Internet of Things (IoT)-configured system, each device executes on-chip software. Recent IoT devices require fast execution time of complex services, such as analyzing a large amount of data, while maintaining low-power computation. As service complexity increases, the service requires high-performance computing and more space for embedded space. However, the low performance of IoT edge devices and their small memory size can hinder the complex and diverse operations of IoT services. In this paper, we propose a remote on-demand software code execution unit using the cloudification of on-chip code memory to accelerate the program execution of an IoT edge device with a low-performance processor. We propose a simulation approach to distribute remote code executed on the server side and on the edge side according to the program's computational and communicational needs. Our on-demand remote code execution unit simulation platform, which includes an instruction set simulator based on 16-bit ARM Thumb instruction set architecture, successfully emulates the architectural behavior of on-chip flash memory, enabling embedded devices to accelerate and execute software using remote execution code in the IoT environment.

Memory Latency Penalty를 개선한 SIMT 기반 Stream Processor의 Memory Operation System Architecture 설계 (An Implementation of a Memory Operation System Architecture for Memory Latency Penalty Reduction in SIMT Based Stream Processor)

  • 이광엽
    • 전기전자학회논문지
    • /
    • 제18권3호
    • /
    • pp.392-397
    • /
    • 2014
  • 본 논문은 Memory Latency Penalty를 개선한 SIMT Architecture 기반 Stream Processor의 Memory Operation System Architecture를 제안한다. 제안하는 구조는 Non-Blocking Cache Architecture를 적용하여 기존의 Blocking Cache Architecture에서 발생하는 Cache Miss Penalty를 개선하였고 다양한 알고리즘의 처리속도를 비교하여 제안하는 Memory Operation System Architecture를 적용한 Stream Processor의 성능 향상을 검증하였다. 실험은 각 알고리즘의 Memory 명령어의 비율에 따라 향상된 성능을 측정하여 Stream Processor의 성능이 최소 8.2%에서 최대 46.5%까지 향상됨을 확인하였다.