• Title/Summary/Keyword: 고성능 프로세서

Search Result 235, Processing Time 0.03 seconds

A Study on complement techniques for an efficient instruction scheduling (효과적인 명령어 스케쥴링 보완 기술 연구)

  • Cho, Jungseok;Cho, Doosan;Jung, Yoojin;Hyun, Heasook;Kim, Dongkyu;Jung, Insang;Choi, Changmoon;Youn, Jonghee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.10a
    • /
    • pp.164-165
    • /
    • 2015
  • 고성능 복수 연산 처리 장치를 갖는 VLIW (Very Long Instruction Word)와 같은 프로세서 아키텍처는 정밀한 명령어 스케쥴링을 하드웨어가 아닌 소프트웨어가 처리해 주어야 한다. 통상 컴파일러가 하드웨어의 풍부한 자원을 충분히 활용할 수 있도록 이러한 기능을 수행하여 준다. 기존에 다양한 명령어 스케쥴링 알고리즘이 연구되었고 수 십년에 걸쳐 새로운 스케쥴링 기법들이 소개되었다. 이러한 스케쥴링 기법의 성능은 알고리즘의 효율뿐만 아니라 프로그램 코드에 내재된 의존관계 (dependence relation)의 복잡도에 따라 상당한 영향을 받는다. 본 연구에서는 의존도 완화기법으로서 레지스터 재할당 (register reallocation) 기법을 살펴보고 이를 활용하여 스케쥴링 성능 개선을 시도하여 보았다.

Video Surveillance System Design and Realization with Interframe Probability Distribution Analyzation (인터프레임 확률분포분석에 의한 비디오 감시 시스템 설계 구현)

  • Ryu, Kwang-Ryol;Kim, Ja-Hwan
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.6
    • /
    • pp.1064-1069
    • /
    • 2008
  • A system design and realization for video surveillance with interframe probability distribution analyzation is presented in this paper. The system design is based on a high performance DSP professor, video surveillance is implemented by analyzing interframe probability distribution using trivariate normal distribution(weight, mean, variance) for scanning objects in a restricted area and the video analysis algorithm is decided for forming a different image from the probability distribution of several frame compressed by the standardized JPEG. The system processing time of D1$(720{\times}480)$ image per frame is 85ms and enables to process the system at 12 frames per second. An object surveillance about the restricted area by rules is extracted to 100% unless object is moved faster.

Analysis of Components Performance for Programmable Video Decoder (프로그래머블 비디오 복호화기를 위한 구성요소의 성능 분석)

  • Kim, Jaehyun;Park, Gooman
    • Journal of Broadcast Engineering
    • /
    • v.24 no.1
    • /
    • pp.182-185
    • /
    • 2019
  • This paper analyzes performances of modules in implementing a programmable multi-format video decoder. The goal of the proposed platform is the high-end Full High Definition (FHD) video decoder. The proposed multi-format video decoder consists of a reconfigurable processor, dedicated bit-stream co-processor, memory controller, cache for motion compensation, and flexible hardware accelerators. The experiments suggest performance baseline of modules for the proposed architecture operating at 300 MHz clock with capability of decoding HEVC bit-streams of FHD 30 frames per second.

Hybrid Scheme of Data Cache Design for Reducing Energy Consumption in High Performance Embedded Processor (고성능 내장형 프로세서의 에너지 소비 감소를 위한 데이타 캐쉬 통합 설계 방법)

  • Shim, Sung-Hoon;Kim, Cheol-Hong;Jhang, Seong-Tae;Jhon, Chu-Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.3
    • /
    • pp.166-177
    • /
    • 2006
  • The cache size tends to grow in the embedded processor as technology scales to smaller transistors and lower supply voltages. However, larger cache size demands more energy. Accordingly, the ratio of the cache energy consumption to the total processor energy is growing. Many cache energy schemes have been proposed for reducing the cache energy consumption. However, these previous schemes are concerned with one side for reducing the cache energy consumption, dynamic cache energy only, or static cache energy only. In this paper, we propose a hybrid scheme for reducing dynamic and static cache energy, simultaneously. for this hybrid scheme, we adopt two existing techniques to reduce static cache energy consumption, drowsy cache technique, and to reduce dynamic cache energy consumption, way-prediction technique. Additionally, we propose a early wake-up technique based on program counter to reduce penalty caused by applying drowsy cache technique. We focus on level 1 data cache. The hybrid scheme can reduce static and dynamic cache energy consumption simultaneously, furthermore our early wake-up scheme can reduce extra program execution cycles caused by applying the hybrid scheme.

Design of MAHA Supercomputing System for Human Genome Analysis (대용량 유전체 분석을 위한 고성능 컴퓨팅 시스템 MAHA)

  • Kim, Young Woo;Kim, Hong-Yeon;Bae, Seungjo;Kim, Hag-Young;Woo, Young-Choon;Park, Soo-Jun;Choi, Wan
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.2
    • /
    • pp.81-90
    • /
    • 2013
  • During the past decade, many changes and attempts have been tried and are continued developing new technologies in the computing area. The brick wall in computing area, especially power wall, changes computing paradigm from computing hardwares including processor and system architecture to programming environment and application usage. The high performance computing (HPC) area, especially, has been experienced catastrophic changes, and it is now considered as a key to the national competitiveness. In the late 2000's, many leading countries rushed to develop Exascale supercomputing systems, and as a results tens of PetaFLOPS system are prevalent now. In Korea, ICT is well developed and Korea is considered as a one of leading countries in the world, but not for supercomputing area. In this paper, we describe architecture design of MAHA supercomputing system which is aimed to develop 300 TeraFLOPS system for bio-informatics applications like human genome analysis and protein-protein docking. MAHA supercomputing system is consists of four major parts - computing hardware, file system, system software and bio-applications. MAHA supercomputing system is designed to utilize heterogeneous computing accelerators (co-processors like GPGPUs and MICs) to get more performance/$, performance/area, and performance/power. To provide high speed data movement and large capacity, MAHA file system is designed to have asymmetric cluster architecture, and consists of metadata server, data server, and client file system on top of SSD and MAID storage servers. MAHA system softwares are designed to provide user-friendliness and easy-to-use based on integrated system management component - like Bio Workflow management, Integrated Cluster management and Heterogeneous Resource management. MAHA supercomputing system was first installed in Dec., 2011. The theoretical performance of MAHA system was 50 TeraFLOPS and measured performance of 30.3 TeraFLOPS with 32 computing nodes. MAHA system will be upgraded to have 100 TeraFLOPS performance at Jan., 2013.

Analysis of the CPU/GPU Temperature and Energy Efficiency depending on Executed Applications (응용프로그램 실행에 따른 CPU/GPU의 온도 및 컴퓨터 시스템의 에너지 효율성 분석)

  • Choi, Hong-Jun;Kang, Seung-Gu;Kim, Jong-Myon;Kim, Cheol-Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.5
    • /
    • pp.9-19
    • /
    • 2012
  • As the clock frequency increases, CPU performance improves continuously. However, power and thermal problems in the CPU become more serious as the clock frequency increases. For this reason, utilizing the GPU to reduce the workload of the CPU becomes one of the most popular methods in recent high-performance computer systems. The GPU is a specialized processor originally designed for graphics processing. Recently, the technologies such as CUDA which utilize the GPU resources more easily become popular, leading to the improved performance of the computer system by utilizing the CPU and GPU simultaneously in executing various kinds of applications. In this work, we analyze the temperature and the energy efficiency of the computer system where the CPU and the GPU are utilized simultaneously, to figure out the possible problems in upcoming high-performance computer systems. According to our experimentation results, the temperature of both CPU and GPU increase when the application is executed on the GPU. When the application is executed on the CPU, CPU temperature increases whereas GPU temperature remains unchanged. The computer system shows better energy efficiency by utilizing the GPU compared to the CPU, because the throughput of the GPU is much higher than that of the CPU. However, the temperature of the system tends to be increased more easily when the application is executed on the GPU, because the GPU consumes more power than the CPU.

An Energy-Delay Efficient System with Adaptive Victim Caches (선택적 희생 캐쉬를 이용한 저전력 고성능 시스템 설계 방안)

  • Kim Cheol Hong;Shim Sunghoon;Jhon Chu Shik;Jhang Seong Tae
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.663-674
    • /
    • 2005
  • We propose a system aimed at achieving high energy-delay efficiency by using adaptive victim caches. Particularly, we investigate methods to improve the hit rates in the first level of memory hierarchy, which reduces the number of accesses to mort power consuming memory structures such as L2 cache. Victim cache is a memory element for reducing conflict misses in a direct-mapped L1 cache. We present two techniques to fill the victim cache with the blocks that have higher probability to be re-reqeusted by processor. Hit-based victim cache ks tilled with the blocks which were referenced frequently by processor. Replacement-based victim cache is filled with the blocks which were evicted from the sets where block replacements had happened frequently According to our simulations, replacement-based victim cache scheme outperforms the conventional victim cache scheme about $2\%$ on average and refutes the power consumption by up to $8\%$.

Development of RTEMS SMP Platform Based on XtratuM Virtualization Environment for Satellite Flight Software (위성비행소프트웨어를 위한 XtratuM 가상화 기반의 RTEMS SMP 플랫폼)

  • Kim, Sun-wook;Choi, Jong-Wook;Jeong, Jae-Yeop;Yoo, Bum-Soo
    • Journal of the Korean Society for Aeronautical & Space Sciences
    • /
    • v.48 no.6
    • /
    • pp.467-478
    • /
    • 2020
  • Hypervisor virtualize hardware resources to utilize them more effectively. At the same time, hypervisor's characteristics of time and space partitioning improves reliability of flight software by reducing a complexity of the flight software. Korea Aerospace Research Institute chooses one of hypervisors for space, XtratuM, and examine its applicability to the flight software. XtratuM has strong points in performance improvement with high reliability. However, it does not support SMP. Therefore, it has limitation in using it with high performance applications including satellite altitude orbit control systems. This paper proposes RTEMS XM-SMP to support SMP with RTEMS, one of real time operating systems for space. Several components are added as hypercalls, and initialization processes are modified to use several processors with inter processors communication routines. In addition, all components related to processors are updated including context switch and interrupts. The effectiveness of the developed RTEMS XM-SMP is demonstrated with a GR740 board by executing SMP benchmark functions. Performance improvements are reviewed to check the effectiveness of SMP operations.

Design and Implementation of Asynchronous Memory for Pipelined Bus (파이프라인 방식의 버스를 위한 비 동기식 주 기억장치의 설계 및 구현)

  • Hahn, Woo-Jong;Kim, Soo-Won
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.31B no.11
    • /
    • pp.45-52
    • /
    • 1994
  • In recent days low cost, high performance microprocessors have led to construction of medium scale shared memory multiprocessor systems with shared bus. Such multiprocessor systems are heavily influenced by the structures of memory systems and memory systems become more important factor in design space as microprocessors are getting faster. Even though local cache memories are very common for such systems, the latency on access to the shared memory limits throughput and scalability. There have been many researches on the memory structure for multiprocessor systems. In this paper, an asynchronous memory architecture is proposed to utilize the bandwith of system bus effectively as well as to provide flexibility of implementation. The effect of the proposed architecture if shown by simulation. We choose, as our model of the shared bus is HiPi+Bus which is designed by ETRI to meet the requirements of the High-Speed Midrange Computer System. The simulation is done by using Verilog hardware decription language. With this simulation, it is explored that the proposed asynchronous memory architecture keeps the utilization of system bus low enough to provide better throughput and scalibility. The implementation trade-offs are also described in this paper. The asynchronous memory is implemented and tested under the prototype testing environment by using test program. This intensive test has validated the operation of the proposed architecture.

  • PDF

Implementation of Internet Terminal using G.729.1 Wideband Speech Codec for Next Generation Network (차세대 통신망을 위한 G.729.1 광대역 음성 코덱을 활용한 인터넷 단말 구현)

  • So, Woon-Seob;Kim, Dae-Young
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.33 no.10B
    • /
    • pp.939-945
    • /
    • 2008
  • Tn this paper we described the process and the results of an implementation of Internet terminal using G.729.1 wideband speech codec for next generation network. For this purpose firstly we chose a high performance RISC application processor having DSP features for speech codec processing and enhanced Multimedia Accelerator(eMMA) function for video codec. In the implementation of this terminal, we used G.729.1 codec recently standardized in ITU-T which is a new scalable speech and audio codec that extends 0.729 speech coding standard. To adopt G.729.1 codec to this terminal we transformed most of the fixed point C codes which require more complexity into assembly codes so as to minimize processing time in the processor. As a result of this work we reduced the execution time of the original C codes about 80% and operated in real time on the terminal. For video we used H.263/MPEG-4 codec which is supported by the eMMA with hardware in the processor. In the SIP call processing test connected to real network we obtained under looms end-to-end delay and 3.8 MOS value measured with PESQ instrument. Besides this terminal operated well with commercial terminals.