• Title/Summary/Keyword: Algorithm Instruction

Search Result 155, Processing Time 0.029 seconds

Architecture Exploration of Optimal Many-Core Processors for a Vector-based Rasterization Algorithm (래스터화 알고리즘을 위한 최적의 매니코어 프로세서 구조 탐색)

  • Son, Dong-Koo;Kim, Cheol-Hong;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.9 no.1
    • /
    • pp.17-24
    • /
    • 2014
  • In this paper, we implement and evaluate the performance of a vector-based rasterization algorithm for 3D graphics by using a SIMD (single instruction multiple data) many-core processor architecture. In addition, we evaluate the impact of a data-per-processing elements (DPE) ratio that is defined as the amount of data directly mapped to each processing element (PE) within many-core in terms of performance, energy efficiency, and area efficiency. For the experiment, we utilize seven different PE configurations by varying the DPE ratio (or the number PEs), which are implemented in the same 130 nm CMOS technology with a 500 MHz clock frequency. Experimental results indicate that the optimal PE configuration is achieved as the DPE ratio is in the range from 16,384 to 256 (or the number of PEs is in the range from 16 and 1,024), which meets the requirements of mobile devices in terms of the optimal performance and efficiency.

Accelerated VPN Encryption using AES-NI (AES-NI를 이용한 VPN 암호화 가속화)

  • Jeong, Jin-Pyo;Hwang, Jun-Ho;Han, Keun-Hee;Kim, Seok-Woo
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.24 no.6
    • /
    • pp.1065-1078
    • /
    • 2014
  • Considering the safety of the data and performance, it can be said that the performance of the AES algorithm in a symmetric key-based encryption is the best in the IPSec-based VPN. When using the AES algorithm in IPSec-based VPN even with the expensive hardware encryption card such as OCTEON Card series of Cavium Networks, the Performance of VPN works less than half of the firewall using the same hardware. In 2008, Intel announced a set of 7 AES-NI instructions in order to improve the performance of the AES algorithm on the Intel CPU. In this paper, we verify how much the performance IPSec-based VPN can be improved when using seven sets of AES-NI instruction of the Intel CPU.

Rapid Data Allocation Technique for Multiple Memory Bank Architectures (다중 메모리 뱅크 구조를 위한 고속의 자료 할당 기법)

  • 조정훈;백윤홍;최준식
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.10a
    • /
    • pp.196-198
    • /
    • 2003
  • Virtually every digital signal processors(DSPs) support on-chip multi- memory banks that allow the processor to access multiple words of data from memory in a single instruction cycle. Also, all existing fixed-point DSPs have irregular architecture of heterogeneous register which contains multiple register files that are distributed and dedicated to different sets of instructions. Although there have been several studies conducted to efficiently assign data to multi-memory banks, most of them assumed processors with relatively simple, homogeneous general-purpose resisters. Therefore, several vendor-provided compilers fer DSPs were unable to efficiently assign data to multiple data memory banks. thereby often failing to generate highly optimized code fer their machines. This paper presents an algorithm that helps the compiler to efficiently assign data to multi- memory banks. Our algorithm differs from previous work in that it assigns variables to memory banks in separate, decoupled code generation phases, instead of a single, tightly-coupled phase. The experimental results have revealed that our decoupled algorithm greatly simplifies our code generation process; thus our compiler runs extremely fast, yet generates target code that is comparable In quality to the code generated by a coupled approach

  • PDF

Optimal Design Space Exploration of Multi-core Architecture for Real-time Lane Detection Algorithm (실시간 차선인식 알고리즘을 위한 최적의 멀티코어 아키텍처 디자인 공간 탐색)

  • Jeong, Inkyu;Kim, Jongmyon
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.7 no.3
    • /
    • pp.339-349
    • /
    • 2017
  • This paper proposes a four-stage algorithm for detecting lanes on a driving car. In the first stage, it extracts region of interests in an image. In the second stage, it employs a median filter to remove noise. In the third stage, a binary algorithm is used to classify two classes of backgrond and foreground of an input image. Finally, an image erosion algorithm is utilized to obtain clear lanes by removing noises and edges remained after the binary process. However, the proposed lane detection algorithm requires high computational time. To address this issue, this paper presents a parallel implementation of a real-time line detection algorithm on a multi-core architecture. In addition, we implement and simulate 8 different processing element (PE) architectures to select an optimal PE architecture for the target application. Experimental results indicate that 40×40 PE architecture show the best performance, energy efficiency and area efficiency.

The Design and Simulation of Out-of-Order Execution Processor using Tomasulo Algorithm (토마술로 알고리즘을 이용하는 비순차실행 프로세서의 설계 및 모의실행)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.4
    • /
    • pp.135-141
    • /
    • 2020
  • Today, CPUs in general-purpose computers such as servers, desktops and laptops, as well as home appliances and embedded systems, consist mostly of multicore processors. In order to improve performance, it is required to use an out-of-order execution processor by Tomasulo algorithm as each core processor. An out-of-order execution processor with Tomasulo algorithm can execute the available instructions in any order and perform speculation in order to reduce control dependencies. Therefore, the performance of an out-of-order execution processor can be significantly improved compared to an in-order execution processor. In this paper, an out-of-order execution processor using Tomasulo algorithm and ARM instruction set is designed using VHDL record data types and simulated by GHDL. As a result, it is possible to successfully perform operations on programs written in ARM instructions.

A Resource-Aware Mapping Algorithm for Coarse-Grained Reconfigurable Architecture Using List Scheduling (리스트 스케줄링을 통한 Coarse-Grained 재구성 구조의 맵핑 알고리즘 개발)

  • Kim, Hyun-Jin;Hong, Hye-Jeong;Kim, Hong-Sik;Kang, Sung-Ho
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.6
    • /
    • pp.58-64
    • /
    • 2009
  • For the success of the reconfigurable computing, the algorithm for mapping operations onto coarse-grained reconfigurable architecture is very important. This paper proposes a resource-aware mapping system for the coarse-grained reconfigurable architecture and its own underlying heuristic algorithm. The operation assignment and the routing path allocation are simultaneously performed with a cycle-accurate time-exclusive resource model. The proposed algorithm minimizes the communication resource usage and the global memory access with the list scheduling heuristic. The operation to be mapped are prioritized with general properties of data flow. The evaluations of the proposed algorithm show that the performance is significantly enhanced in several benchmark applications.

The PC Clustering of the SIMD Structure for a Distributed Process of On-line Contingency (온라인 선로상정사고 분산처리를 위한 SIMD 구조의 PC 클러스터링)

  • Jang, Se-Hwan;Kim, Jin-Ho;Park, June-Ho
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.57 no.7
    • /
    • pp.1150-1156
    • /
    • 2008
  • This paper introduces the PC clustering of the SIMD structure for a distributed processing of on-line contingency to assess a static security of a power system. To execute on-line contingency analysis of a large-scale power system, we need to use high-speed execution device. Therefore, we constructed PC-cluster system using PC clustering method of the SIMD structure and applied to a power system, which relatively shows high quality on the high-speed execution and has a low price. SIMD(single instruction stream, multiple data stream) is a structure that processes are controlled by one signal. The PC cluster system is consisting of 8 PCs. Each PC employs the 2 GHz Pentium 4 CPU and is connected with the others through ethernet switch based fast ethernet. Also, we consider N-1 line contingency that have high potentiality of occurrence realistically. We propose the distributed process algorithm of the SIMD structure for reducing too much execution time on the on-line N-1 line contingency analysis in the large-scale power system. And we have verified a usefulness of the proposed algorithm and the constructed PC cluster system through IEEE 39 and 118 bus system.

Performance Characteristics of Subband Adaptive Array Antenna using Kalman Algorithm (Kalman 알고리즘에 의한 대역분할. 합성형 어댑티브 어레이 안테나의 동작 특성)

  • 박재성;오경석;주창복;박남천;정주수
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.3 no.3
    • /
    • pp.501-507
    • /
    • 1999
  • At the mobile unit for adaptation the propagation environment, it is necessity to adapt very fast the weight coefficient vector of adaptive array antenna In this paper, for the BPSK and BFSK signals with S/I=2, S/N=10 subband adaptive array signal processing method to the linear array antenna using the LMS & the Kalman filter algorithm is proposed. For the 4 elements equidistance linear array antenna systems LMS and Kalman algorithms with subband adaptive instruction principles using the subband signal processing method are adopted and the computer simulation results to the constant amplitude envelope signals such as BPSK or BFSK can be seen that the convergence characteristics of directional patterns and the signal following characteristics are more fast and stable.

  • PDF

Hardware Design of VLIW coprocessor for Computer Vision Application (컴퓨터 비전 응용을 위한 VLIW 보조프로세서의 하드웨어 설계)

  • Choi, Byeong-Yoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.18 no.9
    • /
    • pp.2189-2196
    • /
    • 2014
  • In this paper, a VLIW(Very Long Instruction Word) vision coprocessor which can efficiently accelerate computer vision algorithm for automotive is designed. The VLIW coprocessor executes four instructions per clock cycle via 8-stage pipelined structure and has 36 integer and floating-point instructions to accelerate computer vision algorithm for pedestrian detection. The processor has about 300-MHz operating frequency and about 210,900 gates under 45nm CMOS technology and its estimated performance is 1.2 GOPS(Giga Operations Per Second). The vision system composed of vision primitive engine and eight VLIW coprocessors can execute pedestrian detection at 25~29 frames per second(FPS). Because the VLIW coprocessor has high detection rate and loosely coupled interface with host processor, it can be efficiently applicable to a wide range of vision applications.

Suggestion of CPA Attack and Countermeasure for Super-Light Block Cryptographic CHAM (초경량 블록 암호 CHAM에 대한 CPA 공격과 대응기법 제안)

  • Kim, Hyun-Jun;Kim, Kyung-Ho;Kwon, Hyeok-Dong;Seo, Hwa-Jeong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.9 no.5
    • /
    • pp.107-112
    • /
    • 2020
  • Ultra-lightweight password CHAM is an algorithm with efficient addition, rotation and XOR operations on resource constrained devices. CHAM shows high computational performance, especially on IoT platforms. However, lightweight block encryption algorithms used on the Internet of Things may be vulnerable to side channel analysis. In this paper, we demonstrate the vulnerability to side channel attack by attempting a first power analysis attack against CHAM. In addition, a safe algorithm was proposed and implemented by applying a masking technique to safely defend the attack. This implementation implements an efficient and secure CHAM block cipher using the instruction set of an 8-bit AVR processor.