• Title/Summary/Keyword: many-core processor

Search Result 54, Processing Time 0.024 seconds

Debugging of TTP(Train Tilting Processor) In Use The Embedded System (임베디드 시스템을 이용한 틸팅 제어 시스템(T.T.P)에 관한 연구)

  • Song, Yong-Soo;Shin, Seung-Kwon;Lee, Su-Gil;Han, Seong-Ho
    • Proceedings of the KIEE Conference
    • /
    • 2004.07d
    • /
    • pp.2625-2627
    • /
    • 2004
  • Recently many technology of the T.T.P.(Train Tilting Processor) has been introduced for an efficient real-time operating system. but the problems of testing increasing complex digital integrated system continue to challenge the design and test community. Design main processor part that can be used on railroad synthesis control part by ARM CORE chip.

  • PDF

Architecture Exploration of Optimal Many-Core Processors for a Vector-based Rasterization Algorithm (래스터화 알고리즘을 위한 최적의 매니코어 프로세서 구조 탐색)

  • Son, Dong-Koo;Kim, Cheol-Hong;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.9 no.1
    • /
    • pp.17-24
    • /
    • 2014
  • In this paper, we implement and evaluate the performance of a vector-based rasterization algorithm for 3D graphics by using a SIMD (single instruction multiple data) many-core processor architecture. In addition, we evaluate the impact of a data-per-processing elements (DPE) ratio that is defined as the amount of data directly mapped to each processing element (PE) within many-core in terms of performance, energy efficiency, and area efficiency. For the experiment, we utilize seven different PE configurations by varying the DPE ratio (or the number PEs), which are implemented in the same 130 nm CMOS technology with a 500 MHz clock frequency. Experimental results indicate that the optimal PE configuration is achieved as the DPE ratio is in the range from 16,384 to 256 (or the number of PEs is in the range from 16 and 1,024), which meets the requirements of mobile devices in terms of the optimal performance and efficiency.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

Design Space Exploration of Many-Core Processor for High-Speed Cluster Estimation (고속의 클러스터 추정을 위한 매니코어 프로세서의 디자인 공간 탐색)

  • Seo, Jun-Sang;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.10
    • /
    • pp.1-12
    • /
    • 2014
  • This paper implements and improves the performance of high computational subtractive clustering algorithm using a single instruction, multiple data (SIMD) based many-core processor. In addition, this paper implements five different processing element (PE) architectures (PEs=16, 64, 256, 1,024, 4,096) to select an optimal PE architecture for the subtractive clustering algorithm by estimating execution time and energy efficiency. Experimental results using two different medical images and three different resolutions ($128{\times}128$, $256{\times}256$, $512{\times}512$) show that PEs=4,096 achieves the highest performance and energy efficiency for all the cases.

Multimedia Extension Instructions and Optimal Many-core Processor Architecture Exploration for Portable Ultrasonic Image Processing (휴대용 초음파 영상처리를 위한 멀티미디어 확장 명령어 및 최적의 매니코어 프로세서 구조 탐색)

  • Kang, Sung-Mo;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.8
    • /
    • pp.1-10
    • /
    • 2012
  • This paper proposes design space exploration methodology of many-core processors including multimedia specific instructions to support high-performance and low power ultrasound imaging for portable devices. To explore the impact of multimedia instructions, we compare programs using multimedia instructions and baseline programs with a same many-core processor in terms of execution time, energy efficiency, and area efficiency. Experimental results using a $256{\times}256$ ultrasound image indicate that programs using multimedia instructions achieve 3.16 times of execution time, 8.13 times of energy efficiency, and 3.16 times of area efficiency over the baseline programs, respectively. Likewise, programs using multimedia instructions outperform the baseline programs using a $240{\times}320$ image (2.16 times of execution time, 4.04 times of energy efficiency, 2.16 times of area efficiency) as well as using a $240{\times}400$ image (2.25 times of execution time, 4.34 times of energy efficiency, 2.25 times of area efficiency). In addition, we explore optimal PE architecture of many-core processors including multimedia instructions by varying the number of PEs and memory size.

Implementation of the Shared Memory in the Dual Core System (Dual Core 시스템에서 Shared Memory 기능 구현)

  • Jang, Seung-Ju
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.9
    • /
    • pp.27-33
    • /
    • 2008
  • This paper designs Shared Memory on the Dual Core system so that it operates a general System V IPC on the Linux O.S. Shared Memory is the technique that many processes can access to identical memory area. We treat Shared Memory which is SVR in a kernel step. We design a share memory facility of Linux operating system on the Dual Core System. In this paper the suggesting of share memory facility design plan in Dual Core system is enhance the performance in existing an unity processor system as a dual core practical use. We attemp a performance enhance in each CPU for each process which uses a share memory.

The Design of the Shared Memory in the Dual Core System (Dual Core 시스템에서 Shared Memory 기능 설계)

  • Jang, Seung-Ju;Lee, Gwang-Yong;Kim, Jae-Myeong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.8
    • /
    • pp.1448-1455
    • /
    • 2008
  • This paper designs Shared Memory on the Dual Core system so that it operates a general System V IPC on the Linux O.S. Shared Memory is the technique that many processes can access to identical memory area. We treat Shared Memory in this paper among big two branches of Shared Memory which are SVR in a kernel step format. We design a share memory facility of Linux operating system on the Dual Core System. In this paper the suggesting design plan of share memory facility in Dual Core system is enhancing the performance in existing unity processor system as a dual core practical use. We attempt a performance enhance in each CPU for each process which uses a share memory.

Empirical Study on Performance and Power Consumption in Multi-Core and Multi-Threaded Smartphones (데이터 송수신이 필수적인 환경에서의 스마트폰의 멀티코어와 멀티쓰레드에 따른 성능 및 전력 분석)

  • Lee, Woonghee;Kim, Hwangnam
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.39C no.8
    • /
    • pp.722-730
    • /
    • 2014
  • Due to the advance of hardware, various devices have mobility features, and many applications need the data transmission. In addition, it is essential for latest smartphones to utilize multi-cores and multi-threads because of the enhancement of Application Processor. Therefore, this paper analyzes the performance/power consumption according to transmission rate, the number of cores, and that of threads in the system that is supposed to conduct data transmission and processing simultaneously. Through the analysis, this paper provides a direction for the proper number of threads in terms of performance improvement and efficient power consumption.

Design of Fast Parallel Floating-Point Multiplier using Partial Product Re-arrangement Technique (효율적인 부분곱의 재배치를 통한 고속 병렬 Floating-Point 고속연산기의 설계)

  • 김동순;김도경;이성철;김진태;최종찬
    • Proceedings of the IEEK Conference
    • /
    • 2001.06e
    • /
    • pp.47-50
    • /
    • 2001
  • Nowadays ARM7 core is used in many fields such as PDA systems because of the low power and low cost. It is a general-purpose processor, designed for both efficient digital signal processing and controller operations. But the advent of the wireless communication creates a need for high computational performance for signal processing. And then This paper has been designed a floating-point multiplier compatible to IEEE-754 single precision format for ARMTTDMI performance improvement.

  • PDF

Computation-Communication Overlapping in AES-CCM Using Thread-Level Parallelism on a Multi-Core Processor (멀티코어 프로세서의 쓰레드-수준 병렬성을 활용한 AES-CCM 계산-통신 중첩화)

  • Lee, Eun-Ji;Lee, Sung-Ju;Chung, Yong-Wha;Lee, Myung-Ho;Min, Byoung-Ki
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.8
    • /
    • pp.863-867
    • /
    • 2010
  • Multi-core processors are becoming increasingly popular. As they are widely adopted in embedded systems as well as desktop PC's, many multimedia applications are being parallelized on multi-core platforms. However, it is difficult to parallelize applications with inherent data dependencies such as encryption algorithms for multimedia data. In order to overcome this limit, we propose a technique to overlap computation and communication using an otherwise idle core in this paper. In particular, we interpret the problem of multimedia computation and communication as a pipeline design problem at the application program level, and derive an optimal number of stages in the pipeline.