• Title/Summary/Keyword: memory based instruction

Search Result 80, Processing Time 0.024 seconds

A Low Power Design of H.264 Codec Based on Hardware and Software Co-design

  • Park, Seong-Mo;Lee, Suk-Ho;Shin, Kyoung-Seon;Lee, Jae-Jin;Chung, Moo-Kyoung;Lee, Jun-Young;Eum, Nak-Woong
    • Information and Communications Magazine
    • /
    • v.25 no.12
    • /
    • pp.10-18
    • /
    • 2008
  • In this paper, we present a low-power design of H.264 codec based on dedicated hardware and software solution on EMP(ETRI Multi-core platform). The dedicated hardware scheme has reducing computation using motion estimation skip and reducing memory access for motion estimation. The design reduces data transfer load to 66% compared to conventional method. The gate count of H.264 encoder and the performance is about 455k and 43Mhz@30fps with D1(720x480) for H.264 encoder. The software solution is with ASIP(Application Specific Instruction Processor) that it is SIMD(Single Instruction Multiple Data), Dual Issue VLIW(Very Long Instruction Word) core, specified register file for SIMD, internal memory and data memory access for memory controller, 6 step pipeline, and 32 bits bus width. Performance and gate count is 400MHz@30fps with CIF(Common Intermediated format) and about 100k per core for H.264 decoder.

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.4
    • /
    • pp.1-9
    • /
    • 2019
  • LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.

Designing a Library Collaborative Instruction Using the Archives of the National Debt Redemption Movement: Focusing on the Korean History Subject in High School (국채보상운동 기록물을 활용한 도서관협력수업 설계: 고등학교 한국사 교과를 중심으로)

  • Miae Song;Ji-won Lee
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.23 no.4
    • /
    • pp.47-71
    • /
    • 2023
  • Korea is a record powerhouse nation with the largest number of Memory of the World in the Asia-Pacific; however, active utilization of archives for future generations has been lacking. Therefore, this study sought to enhance interest in archives and utilize them directly in school lessons. The Korean National Debt Redemption Movement's archives, which were selected as a Memory of the World, were used in the school lessons, and based on them, a library collaborative instruction design was developed. A library collaborative instruction is a collaboration between the high school Korean history subject and the library information literacy instruction, designed in three sessions. Through the literature research-based design, a library collaborative instruction plan, teaching and learning plan, and activity sheets were derived. Implementing this designed library collaborative instruction is expected to stimulate interest in the Memory of the World, linking curriculum and archives at schools and significantly expanding the users of the archives into teachers and students.

Comparison of Traditional Workloads and Deep Learning Workloads in Memory Read and Write Operations

  • Jeongha Lee;Hyokyung Bahn
    • International journal of advanced smart convergence
    • /
    • v.12 no.4
    • /
    • pp.164-170
    • /
    • 2023
  • With the recent advances in AI (artificial intelligence) and HPC (high-performance computing) technologies, deep learning is proliferated in various domains of the 4th industrial revolution. As the workload volume of deep learning increasingly grows, analyzing the memory reference characteristics becomes important. In this article, we analyze the memory reference traces of deep learning workloads in comparison with traditional workloads specially focusing on read and write operations. Based on our analysis, we observe some unique characteristics of deep learning memory references that are quite different from traditional workloads. First, when comparing instruction and data references, instruction reference accounts for a little portion in deep learning workloads. Second, when comparing read and write, write reference accounts for a majority of memory references, which is also different from traditional workloads. Third, although write references are dominant, it exhibits low reference skewness compared to traditional workloads. Specifically, the skew factor of write references is small compared to traditional workloads. We expect that the analysis performed in this article will be helpful in efficiently designing memory management systems for deep learning workloads.

Design of a DI model-based Content Addressable Memory for Asynchronous Cache

  • Battogtokh, Jigjidsuren;Cho, Kyoung-Rok
    • International Journal of Contents
    • /
    • v.5 no.2
    • /
    • pp.53-58
    • /
    • 2009
  • This paper presents a novel approach in the design of a CAM for an asynchronous cache. The architecture of cache mainly consists of four units: control logics, content addressable memory, completion signal logic units and instruction memory. The pseudo-DCVSL is useful to make a completion signal which is a reference for handshake control. The proposed CAM is a very simple extension of the basic circuitry that makes a completion signal based on DI model. The cache has 2.75KB CAM for 8KB instruction memory. We designed and simulated the proposed asynchronous cache including CAM. The results show that the cache hit ratio is up to 95% based on pseudo-LRU replacement policy.

Low-Power IoT Microcontroller Code Memory Interface using Binary Code Inversion Technique Based on Hot-Spot Access Region Detection (핫스팟 접근영역 인식에 기반한 바이너리 코드 역전 기법을 사용한 저전력 IoT MCU 코드 메모리 인터페이스 구조 연구)

  • Park, Daejin
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.2
    • /
    • pp.97-105
    • /
    • 2016
  • Microcontrollers (MCUs) for endpoint smart sensor devices of internet-of-thing (IoT) are being implemented as system-on-chip (SoC) with on-chip instruction flash memory, in which user firmware is embedded. MCUs directly fetch binary code-based instructions through bit-line sense amplifier (S/A) integrated with on-chip flash memory. The S/A compares bit cell current with reference current to identify which data are programmed. The S/A in reading '0' (erased) cell data consumes a large sink current, which is greater than off-current for '1' (programmed) cell data. The main motivation of our approach is to reduce the number of accesses of erased cells by binary code level transformation. This paper proposes a built-in write/read path architecture using binary code inversion method based on hot-spot region detection of instruction code access to reduce sensing current in S/A. From the profiling result of instruction access patterns, hot-spot region of an original compiled binary code is conditionally inverted with the proposed bit-inversion techniques. The de-inversion hardware only consumes small logic current instead of analog sink current in S/A and it is integrated with the conventional S/A to restore original binary instructions. The proposed techniques are applied to the fully-custom designed MCU with ARM Cortex-M0$^{TM}$ using 0.18um Magnachip Flash-embedded CMOS process and the benefits in terms of power consumption reduction are evaluated for Dhrystone$^{TM}$ benchmark. The profiling environment of instruction code executions is implemented by extending commercial ARM KEIL$^{TM}$ MDK (MCU Development Kit) with our custom-designed access analyzer.

Scalar First Replacement Strategy for Reference Prediction Table Used in Prefetching Streaming Data (스트리밍 데이터의 선인출에 사용되는 참조예측표의 스칼라 우선 교체 전략)

  • Lim, Chul-hoo;Chon, Young-Suk;Kim, Suk-il;Jeon, Joong-nam
    • The KIPS Transactions:PartA
    • /
    • v.11A no.3
    • /
    • pp.163-172
    • /
    • 2004
  • Multimedia applications tend to access their data as a streaming pattern with regular intervals. This characteristic can be utilized in prefetching the multimedia data into cache memory so as to reduce their execution speeds. The reference-prediction prefetch algorithm predicts the memory address that seems to be used in the next time based on the previous history of memory references stored in the prediction reference table. This paper proposes a strategy to manipulate the reference prediction table which contains all of the data reference instructions to scalar and streaming data. We have recognized that the scalar reference instructions do not contribute to the data prefetching algorithm. Therefore, when replacing an element in the reference prediction table, the proposed algorithm preferentially selects the scalar reference instruction before the stream reference instruction. It makes the stream reference instruction to stay for a long time compared to the FIFO replacement policy, and eventually improves the performance of data prefetching.

Design of an Asynchronous Instruction Cache based on a Mixed Delay Model (혼합 지연 모델에 기반한 비동기 명령어 캐시 설계)

  • Jeon, Kwang-Bae;Kim, Seok-Man;Lee, Je-Hoon;Oh, Myeong-Hoon;Cho, Kyoung-Rok
    • The Journal of the Korea Contents Association
    • /
    • v.10 no.3
    • /
    • pp.64-71
    • /
    • 2010
  • Recently, to achieve high performance of the processor, the cache is splits physically into two parts, one for instruction and one for data. This paper proposes an architecture of asynchronous instruction cache based on mixed-delay model that are DI(delay-insensitive) model for cache hit and Bundled delay model for cache miss. We synthesized the instruction cache at gate-level and constructed a test platform with 32-bit embedded processor EISC to evaluate performance. The cache communicates with the main memory and CPU using 4-phase hand-shake protocol. It has a 8-KB, 4-way set associative memory that employs Pseudo-LRU replacement algorithm. As the results, the designed cache shows 99% cache hit ratio and reduced latency to 68% tested on the platform with MI bench mark programs.

On-Demand Remote Software Code Execution Unit Using On-Chip Flash Memory Cloudification for IoT Environment Acceleration

  • Lee, Dongkyu;Seok, Moon Gi;Park, Daejin
    • Journal of Information Processing Systems
    • /
    • v.17 no.1
    • /
    • pp.191-202
    • /
    • 2021
  • In an Internet of Things (IoT)-configured system, each device executes on-chip software. Recent IoT devices require fast execution time of complex services, such as analyzing a large amount of data, while maintaining low-power computation. As service complexity increases, the service requires high-performance computing and more space for embedded space. However, the low performance of IoT edge devices and their small memory size can hinder the complex and diverse operations of IoT services. In this paper, we propose a remote on-demand software code execution unit using the cloudification of on-chip code memory to accelerate the program execution of an IoT edge device with a low-performance processor. We propose a simulation approach to distribute remote code executed on the server side and on the edge side according to the program's computational and communicational needs. Our on-demand remote code execution unit simulation platform, which includes an instruction set simulator based on 16-bit ARM Thumb instruction set architecture, successfully emulates the architectural behavior of on-chip flash memory, enabling embedded devices to accelerate and execute software using remote execution code in the IoT environment.

A Program Code Compression Method with Very Fast Decoding for Mobile Devices (휴대장치를 위한 고속복원의 프로그램 코드 압축기법)

  • Kim, Yong-Kwan;Wee, Young-Cheul
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.11
    • /
    • pp.851-858
    • /
    • 2010
  • Most mobile devices use a NAND flash memory as their secondary memory. A compressed code of the firmware is stored in the NAND flash memory of mobile devices in order to reduce the size and the loading time of the firmware from the NAND flash memory to a main memory. In order to use a demand paging properly, a compressed code should be decompressed very quickly. The thesis introduces a new dictionary based compression algorithm for the fast decompression. The introduced compression algorithm uses a different method with the current LZ method by storing the "exclusive or" value of the two instructions when the instruction for compression is not equal to the referenced instruction. Therefore, the thesis introduces a new compression format that minimizes the bit operation in order to improve the speed of decompression. The experimental results show that the decoding time is reduced up to 5 times and the compression ratio is improved up to 4% compared to the zlib. Moreover, the proposed compression method with the fast decoding time leads to 10-20% speed up of booting time compared to the booting time of the uncompressed method.