• Title/Summary/Keyword: superscalar architecture

Search Result 39, Processing Time 0.021 seconds

Development of a High-performance DSP Coprocessor Architecture (고성능 32-bit DSP 코프로세서의 아키텍쳐 개발)

  • Yun, Seong-Cheol;Kim, Sang-Uk;Bae, Seong-Il;Gang, Seong-Ho;Kim, Yong-Cheon;Jeong, Seung-Jae;Kim, Sang-U;Mun, Sang-Hun
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.39 no.2
    • /
    • pp.72-81
    • /
    • 2002
  • A new high-performance DSP architecture is proposed, which behaves as a coprocessor of a 32bit microcontroller. Because the proposed DSP architecture is a dual MAC(Multiply and Accumulate) DSP architecture, it can process efficiently a number of SOP(sum of product) operations used in many DSP applications. In order to efficiently perform other operations such as pure additions without any restriction, a MAC is composed of a multiplier and a ALU placed in parallel. In addition, it is a 3-way superscalar architecture, which can issue 3 instructions at a time. The benchmark results with 3 thor dual MAC DSPs show that the proposed DSP has the best performance. Futhermore, it is proven that the proposed DSP is more efficient in memory usage, although the performance is comparable in some algorithms such as Viterbi decoding and FFT butterfly.

A Performance Study of Embedded Multicore Processor Architectures (임베디드 멀티코어 프로세서의 성능 연구)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.1
    • /
    • pp.163-169
    • /
    • 2013
  • Recently, the importance of embedded system is growing rapidly. In-order to satisfy the real-time constraints of the system, high performance embedded processor is required. Therefore, as in general purpose computer systems, embedded processor should be designed as multicore architecture as well. Using MiBench benchmarks as input, the trace-driven simulation has been performed and analyzed for the 2-core to 16-core embedded processor architectures with different types of cores from simple RISC to in-order and out-of-order superscalar processors, extensively. As a result, the achievable performance is as high as 23 times over the single core embedded RISC processor.

Performance Analysis of Multicore Processor Architectures Based On Cache Size Effects (캐쉬 용량 효과에 대한 멀티코어 프로세서의 성능 연구)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.12 no.6
    • /
    • pp.175-180
    • /
    • 2012
  • In order to overcome the complexity and performance limit problems of superscalar processors, the multicore architecture has been prevalent recently. The configuration and the size of instruction and data caches greatly gives effect on the performance of multicore processors. Using SPEC 2000 benchmarks as input, the trace-driven simulation has been performed for the 2-core to 16-core architectures with different sizes of caches extensively. As a result, the 2-way set associative instruction and data cache with the size of 64KB brought the best cost-effective performance.

Performance Study of Multicore Digital Signal Processor Architectures (멀티코어 디지털 신호처리 프로세서의 성능 연구)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.4
    • /
    • pp.171-177
    • /
    • 2013
  • Due to the demand for high speed 3D graphic rendering, video file format conversion, compression, encryption and decryption technologies, the importance of digital signal processor system is growing rapidly. In order to satisfy the real-time constraints, high performance digital signal processor is required. Therefore, as in general purpose computer systems, digital signal processor should be designed as multicore architecture as well. Using UTDSP benchmarks as input, the trace-driven simulation has been performed and analyzed for the 2 to 16-core digital signal processor architectures with the cores from simple RISC to in-order and out-of-order superscalar processors for the various window sizes, extensively.

Instruction Queue Architecture for Low Power Microprocessors (마이크로프로세서 전력소모 절감을 위한 명령어 큐 구조)

  • Choi, Min;Maeng, Seung-Ryoul
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.11
    • /
    • pp.56-62
    • /
    • 2008
  • Modern microprocessors must deliver high application performance, while the design process should not subordinate power. In terms of performance and power tradeoff, the instructions window is particularly important. This is because a large instruction window leads to achieve high performance. However, naive scaling conventional instruction window can severely affect the complexity and power consumption. This paper explores an architecture level approach to reduce power dissipation. We propose a low power issue logic with an efficient tag translation. The direct lookup table (DTL) issue logic eliminates the associative wake-up of conventional instruction window. The tag translation scheme deals with data dependencies and resource conflicts by using bit-vector based structure. Experimental results show that, for SPEC2000 benchmarks, the proposed design reduces power consumption by 24.45% on average over conventional approach.

Accelerating Symmetric and Asymmetric Cryptographic Algorithms with Register File Extension for Multi-words or Long-word Operation (다수 혹은 긴 워드 연산을 위한 레지스터 파일 확장을 통한 대칭 및 비대칭 암호화 알고리즘의 가속화)

  • Lee Sang-Hoon;Choi Lynn
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.2 s.308
    • /
    • pp.1-11
    • /
    • 2006
  • In this paper, we propose a new register file architecture called the Register File Extension for Multi-words or Long-word Operation (RFEMLO) to accelerate both symmetric and asymmetric cryptographic algorithms. Based on the idea that most of cryptographic algorithms heavily use multi-words or long-word operations, RFEMLO allows multiple contiguous registers to be specified as a single operand. Thus, a single instruction can specify a SIMD-style multi-word operation or a long-word operation. RFEMLO can be applied to general purpose processors by adding instruction set for multi-words or long-word operands and functional units for additional instruction set. To evaluate the performance of RFEMLO, we use Simplescalar/ARM 3.0 (with gcc 2.95.2) and run detailed simulations on various symmetric and asymmetric cryptographic algorithms. By applying RFEMLO, we could get maximum 62% and 70% reductions in the total instruction count of symmetric and asymmetric cryptographic algorithms respectively. Also, performance results show that a speedup of 1.4 to 2.6 can be obtained in symmetric cryptographic algorithms and a speedup of 2.5 to 3.3 can be obtained for asymmetric cryptographic algorithms when we apply RFEMLO to a processor with an in-order pipeline. We also found that RFEMLO can effectively improve the performance of these cryptographic algorithms with much less cost compared to issue-width increase available in Superscalar implementations. Moreover, the RFEMLO can also be applied to Superscalar processor, leading to additional 83% and 138% performance gain in symmetric and asymmetric cryptographic algorithms.

On Designing 4-way Superscalar Digital Signal Processor Core (4-way 수퍼 스칼라 디지털 시그널 프로세서 코어 설계)

  • 김준석;유선국;박성욱;정남훈;고우석;이근섭;윤대희
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.23 no.6
    • /
    • pp.1409-1418
    • /
    • 1998
  • The recent audio CODEC(Coding/Decoding) algorithms are complex of several coding techniques, and can be divided into DSP tasks, controller tasks and mixed tasks. The traditional DSP processor has been designed for fast processing of DSP tasks only, but not for controller and mixed tasks. This paper presents a new architecture that achieves high throughput on both controller and mixed tasks of such algorithms while maintaining high performance for DSP tasks. The proposed processor, YSP-3, operates four algorithms while maintaining high performance for DSP tasks. The proposed processor, YSP-3, operates functional units (Multiplier, two ALUs, Load/Store Unit) in parallel via 4-issue super-scalar instruction structure. The performance evaluation of YSP-3 has been done through the implementation of the several DSP algorithms and the part of the AC-3 decoding algorithms.

  • PDF

Radiation-Induced Soft Error Detection Method for High Speed SRAM Instruction Cache (고속 정적 RAM 명령어 캐시를 위한 방사선 소프트오류 검출 기법)

  • Kwon, Soon-Gyu;Choi, Hyun-Suk;Park, Jong-Kang;Kim, Jong-Tae
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.35 no.6B
    • /
    • pp.948-953
    • /
    • 2010
  • In this paper, we propose multi-bit soft error detection method which can use an instruction cache of superscalar CPU architecture. Proposed method is applied to high-speed static RAM for instruction cache. Using 1D parity and interleaving, it has less memory overhead and detects more multi-bit errors comparing with other methods. It only detects occurrence of soft errors in static RAM. Error correction is treated like a cache miss situation. When soft errors are occurred, it is detected by 1D parity. Instruction cache just fetch the words from lower-level memory to correct errors. This method can detect multi-bit errors in maximum 4$\times$4 window.

Performance Improvement of Single Chip Multiprocessor using Concurrent Branch Execution (분기 동시 수행을 이용한 단일 칩 멀티프로세서의 성능 개선)

  • Lee, Seung-Ryul;Kim, Jun-Shik;Choi, Jae-Hyeok;Choi, Sang-Bang
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.44 no.2
    • /
    • pp.61-71
    • /
    • 2007
  • The instruction level parallelism, which has been used to improve the performance of processors, expose its limit. The change of a control flow by a branch miss prediction is one of the obstacles that restrict the instruction level parallelism. The single chip multiprocessors have been developed to utilize the thread level parallelism. However, we could not use the maximum performance of the single chip multiprocessor in case of executing the coded programs without considering the multi-thread. In order to overcome the two performance degradation factors, in this paper, we suggest the concurrent branch execution method that applies to the multi-path execution method at a single chip multiprocessor. We executes all two flows of the conditional branch using the idle core processor. Through this, we can improve the processor's efficiency with blocking the control flow termination by the branch instruction and reducing the idle time. We analyze the effects of concurrent branch execution proposed in this paper through the simulation. As a result of that, concurrent branch execution reduces about 20% of idle time and improves the maximum 10% of the branch prediction accuracy. We show that our scheme improves the overall performance of maximum 39% compared to the normal single chip multiprocessor and maximum 27% compared to the superscalar processor.