• Title/Summary/Keyword: Reorder Buffer

Search Result 9, Processing Time 0.203 seconds

The Performance Analysis of Distributed Reorder Buffer Superscalar Processor using Queuing Model (큐잉 모델을 이용한 분산된 리오더 버퍼 수퍼스칼라 프로세서의 성능분석)

  • Baek, Seock-Kyun;Jung, Jin-Ha;Shin, Kwang-Sik;Choi, Sang-Bang
    • Proceedings of the IEEK Conference
    • /
    • 2005.11a
    • /
    • pp.1087-1090
    • /
    • 2005
  • In all contemporary superscalar processors, the result repositories are implemented as the Reorder Buffer(ROB) slots. In such designs, the ROB is a large multi-ported structure. There are several approaches for reducing the ROB complexity in processors. The one technique relies on a distributed implementation that spreads the centralized ROB structure across the function units(FUs). Each distributed component sized to match the FU workload and with one write port and one read port on each component. We are using M/M/1 Queuing theory to determine the number of entries in each ROB component that the performance of processor depends on. Our schemes are evaluated using the simulation of CPU2000 benchmarks.

  • PDF

The Performance Analysis of Distributed Reorder Buffer in Superscalar Processor using Analytical Model (해석적 모델을 이용한 분산된 리오더 버퍼 슈퍼스칼라 프로세서의 성능분석)

  • Yoon, Wan-Oh;Shin, Kwang-Sik;Kim, Kyeong-Seob;Lee, Yun-Sub;Choi, Sang-Bang
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.12
    • /
    • pp.73-82
    • /
    • 2008
  • There are several approaches for reducing the ROB(Reorder Buffer) complexity in processors. The one technique that makes the simplest ROB ports relies on a distributed implementation that spreads the centralized ROB structure across the functional units(FUs). Each distributed buffers are decided on the size of them by workload of the functional units. The performance of the processor depends on the size of distributed ROB. However, most of previous works have depended on the simulation results to decide the optimsize of distributed ROB. In this Paper, we use an analytical model based on the M/M/1 Queuing theory to determine the optimum size of each distributed ROB. Our schemes are evaluated by using the simulation performed by the CPU2000 benchmarks. We are able to choose the optimum size of distributed ROB showing the 99.2% performance compared with existing superscalar processors. We can save 82% hardware resources in ports and reduce more than 30% of delay when ROB and distributed ROB proposed in this paper are designed by HDL.

Design of Hardwired Variable Length Decoder for H.264/AVC (하드웨어 구조의 H.264/AVC 가변길이 복호기 설계)

  • Yu, Yong-Hoon;Lee, Chan-Ho
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.11
    • /
    • pp.71-76
    • /
    • 2008
  • H.264(or MPEG-4/AVC pt.10) is a high performance video coding standard, and is widely used. Variable length code (VLC) of the H.264 standard compresses data using the statistical distribution of values. A decoder parses the compressed bit stream and searches decoded values in lookup tables, and the decoding process is not easy to implement by hardware. We propose an architecture of variable length decoder(VLD) for the H.264 baseline profile(BP) L4. The CAVLD decodes syntax elements using the combination of arithmetic units and lookup tables for the optimized hardware architecture. A barral shifter and a first 1's detector parse NAL bit stream, and are shared by Exp-Golomb decoder and CAVLD. A FIFO memory between CAVLD and the reorder unit and a buffer at the output of the reorder unit eliminate the bottleneck of data stream. The proposed VLD is designed using Verilog-HDL and is implemented using an FPGA. The synthesis result using a 0.18um standard CMOS technology shows that the gate count is 22,604 and the decoder can process HD($1920{\times}1080$) video at 120MHz.

Research on Conditional Execution Out-of-order Instruction Issue Microprocessor Using Register Renaming Method (레지스터 리네이밍 방법을 사용하는 조건부 실행 비순차적 명령어 이슈 마이크로프로세서에 관한 연구)

  • 최규백;김문경;홍인표;이용석
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.28 no.9A
    • /
    • pp.763-773
    • /
    • 2003
  • In this paper, we present a register renaming method for conditional execution out-of-order instruction issue microprocessors. Register renaming method reduces false data dependencies (write after read(WAR) and write after write(WAW)). To implement a conditional execution out-of-order instruction issue microprocessor using register renaming, we use a register file which includes both in-order state physical registers and look-ahead state physical registers to share all logical registers. And we design an in-order state indicator, a renaming state indicator, a physical register assigning indicator, a condition prediction buffer and a reorder buffer. As we utilize the above hardwares, we can do register renaming and trace the in-order state. In this paper, we present an improved register renaming method using smaller hardware resources than conventional register renaming method. And this method eliminates an associative lookup and provides a short recovery time.

Performance Improvement of Reorder Buffer in Out-of-order Issue Superscalar Processors (비순차이슈 수퍼스칼라 프로세서에서 리오더버퍼의 성능개선)

  • Jang, Mun-Seok;Lee, Jeong-U;Choe, Sang-Bang
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.28 no.1_2
    • /
    • pp.90-102
    • /
    • 2001
  • 리오더버퍼는 명령어를 비순차로 이슈하는 수퍼스칼라 파이프라인에서의 명령어 실행을 순차적으로 완료하는데 사용된다. 본 논문에서는 리오더버퍼에 의하여 발생할 수 있는 명령어의 스테그네이션(stagnation)을 효율적으로 제거시킬 뿐만 아니라 리오더버퍼의 크기를 감소시킬 수 있는 쉘터버퍼를 사용한 리오더버퍼 구조를 제안하였다. 시뮬레이션을 수행한 결과 리오더버퍼의 엔트리 개수가 8개에서 32개 사이일 때 쉘터버퍼는 단지 1개 또는 2개만 사용하여도 뚜렷한 성능 향상을 얻을 수 있음을 보여준다. 쉘터버퍼를 4개 사용했을 때는 2개만 사용했을 경우와 비교하여 주목할만한 성능 향상은 없었다. 이는 쉘터버퍼를 2개만 사용하여도 대부분의 스테그네이션을 제거하는데 충분함을 보여준다. 실행율의 손실이 없는 상태에서 2개의 쉘터버퍼를 사용하면 Whetstone 벤치마크 프로그램에서는 44%, FFT 벤치마크 프로그램에서는 50%, FM 벤치마크 프로그램에서는 60%, Linpack 벤치마크 프로그램에서는 75%의 리오더버퍼의 크기를 줄일 수 있었다. 쉘터버퍼를 사용했을 때 수행 시간 역시 Whetstone에서는 19.78%, FFT에서는 19.67%, FM에서는 23.93%, Linpack에서는 8.65%의 성능 향상을 얻을 수 있었다.

  • PDF

A Study for Efficient Transmission Policies using Multimedia Scenarios (멀티미디어 시나리오를 이용한 효율적인 데이터 전송 기법 연구)

  • Suh, Duk-Rok;Lee, Won-Suk
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.11
    • /
    • pp.2797-2808
    • /
    • 1998
  • Multimedia scenario database system is a read-only multimedia-on-demand system which transfers scenarios representing the display ordering of multimedia objects. A scenario is a graph of multimedia objects and it contains spatial, temporal and contextual information of multimedia data. By structuring multimedia objects as a scenario, it is possible to enforce their display order based on their context. Furthermore, it can provide multiple display paths as well as the sharing of objects between different scenarios. As a result, the multimedia scenario database system can perform the pre-scheduling of multimedia objects, which makes it possible to reorder the transmission order of objects in a scenario. Consequently, the overall system resource such as data buffer and network bandwidth can be highly utilized. In this paper, we discuss the requirements of structuring a scenario to design a scenario database that stores and manages multimedia scenario. Furthermore, we devise and analyze several scheduling policies based on the reordering mechanism for the objects in a scenario.

  • PDF

Implementation of a Scoreboard Array and a Port Arbiter for In-order SMT Processors (순차적 SMT Processor를 위한 Scoreboard Array와 포트 중재 모듈의 구현)

  • Heo, Chang-Yong;Hong, In-Pyo;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.6
    • /
    • pp.59-70
    • /
    • 2004
  • SMT(Simultaneous Multi Threading) architecture uses TLP(Thread Level Parallelism) and increases processor throughput, such that issue slots can be filled with instructions from multiple independent threads. Having multiple ready threads reduces the probability that a functional unit is left idle, which increases processor efficiency. To utilize those advantages for the SMT processors, the issue unit must control the flow of instructions from different threads and not create conflicts among those instructions, which make the SMT issue logic extremely complex. Therefore, our SMT architecture, which is modeled in this paper, uses an in-order-issue and completion scheme, and therefore, can use a simple issue mechanism with a scoreboard already instead of using register renaming or a reorder buffer. However, an SMT scoreboarding mechanism is still more complex and costlier than that of a single threaded conventional processor. This paper proposes an optimal implementation of a scoreboarding mechanism for an ARM-based SMT architecture.

Efficient Parallel IP Address Lookup Architecture with Smart Distributor (스마트 분배기를 이용한 효율적인 병렬 IP 주소 검색 구조)

  • Kim, Junghwan;Kim, Jinsoo
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.2
    • /
    • pp.44-51
    • /
    • 2013
  • Routers should perform fast IP address lookup for Internet to provide high-speed service. In this paper, we present a hybrid parallel IP address lookup structure composed of four-stage pipeline. It achieves parallelism at low cost by using multiple SRAMs in stage 2 and partitioned TCAMs in stage 3, and improves the performance through pipelining. The smart distributor in stage 1 does not transfer any IP address identical to previous one toward the next stage, but only uses the result of the previous lookup. So it improves throughput of lookup by caching effects, and decreases the access conflict to TCAM bank in stage 3 as well. In the last stage, the reorder buffer rearranges the completed IP addresses according to the input order. We evaluate the performance of our parallel pipelined IP lookup structure comparing with previous hybrid structure, using the real routing table and traffic distributions generated by Zipf's law.

Design and Simulation for Out-of-Order Execution Processor of a Fully Pipelined Scheme (완전한 파이프라인 방식의 비순차실행 프로세서의 설계 및 모의실행)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.5
    • /
    • pp.143-149
    • /
    • 2020
  • Currently, a multi-core processor is mainly used as a central processing unit of a computer system, and a high-performance out-of-order processor is adopted as each core to maximize system performance. The early out-of-order execution processor with Tomasulo algorithm aimed at floating-point instructions, and it took several cycles to execute by the use of complex structures such as reorder buffer and reservation station. However, in order for the processor to properly utilize out-of-order execution and increase the throughput of instructions, it must operate in a fully pipelined manner. In this paper, a fully pipelined out-of-order processor with speculative execution is designed with VHDL and verified with GHDL. As a result of the simulation, a program composed of ARM instructions is successfully performed.