• Title/Summary/Keyword: Parallel processor

Search Result 485, Processing Time 0.024 seconds

An Efficient Hardware Implementation of Lightweight Block Cipher LEA-128/192/256 for IoT Security Applications (IoT 보안 응용을 위한 경량 블록암호 LEA-128/192/256의 효율적인 하드웨어 구현)

  • Sung, Mi-Ji;Shin, Kyung-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.7
    • /
    • pp.1608-1616
    • /
    • 2015
  • This paper describes an efficient hardware implementation of lightweight encryption algorithm LEA-128/192/256 which supports for three master key lengths of 128/192/256-bit. To achieve area-efficient and low-power implementation of LEA crypto- processor, the key scheduler block is optimized to share hardware resources for encryption/decryption key scheduling of three master key lengths. In addition, a parallel register structure and novel operating scheme for key scheduler is devised to reduce clock cycles required for key scheduling, which results in an increase of encryption/decryption speed by 20~30%. The designed LEA crypto-processor has been verified by FPGA implementation. The estimated performances according to master key lengths of 128/192/256-bit are 181/162/109 Mbps, respectively, at 113 MHz clock frequency.

Technology of Non-destructive Stress Measurement in Spot Welded Joint using ESPI Method (ESPI법에 의한 스폿 용접부의 비파괴적 응력측정 기술)

  • 김덕중;국정한;오세용;김봉중;유원일;김영호
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.1 no.1
    • /
    • pp.23-26
    • /
    • 2000
  • In spot welded joint. Electronic Speckle Pattern Interferometry(ESPI) method using the Model 95 Ar laser a video system and an image processor was applied to measure the stress Unlike traditional strain gauges or Moire method, ESPI method has no special surface preparation or attachments and can be measured in-plane displacement with non-contact and real time. In this experiment, specimens are loaded in parallel with a load cell. The specimens are made of the cold rolled steel sheet with 1mm thickness, are attached strain. gauges. This study Provides an example of how ESPI has been used to measure stress and strain inspecimen. The results measured by ESPI are compared with the data which was measured by strain gauge method under tensile testing.

  • PDF

A Scalable ECC Processor for Elliptic Curve based Public-Key Cryptosystem (타원곡선 기반 공개키 암호 시스템 구현을 위한 Scalable ECC 프로세서)

  • Choi, Jun-Baek;Shin, Kyung-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.8
    • /
    • pp.1095-1102
    • /
    • 2021
  • A scalable ECC architecture with high scalability and flexibility between performance and hardware complexity is proposed. For architectural scalability, a modular arithmetic unit based on a one-dimensional array of processing element (PE) that performs finite field operations on 32-bit words in parallel was implemented, and the number of PEs used can be determined in the range of 1 to 8 for circuit synthesis. A scalable algorithms for word-based Montgomery multiplication and Montgomery inversion were adopted. As a result of implementing scalable ECC processor (sECCP) using 180-nm CMOS technology, it was implemented with 100 kGEs and 8.8 kbits of RAM when NPE=1, and with 203 kGEs and 12.8 kbits of RAM when NPE=8. The performance of sECCP with NPE=1 and NPE=8 was analyzed to be 110 PSMs/sec and 610 PSMs/sec, respectively, on P256R elliptic curve when operating at 100 MHz clock.

Implementation of Massive FDTD Simulation Computing Model Based on MPI Cluster for Semi-conductor Process (반도체 검증을 위한 MPI 기반 클러스터에서의 대용량 FDTD 시뮬레이션 연산환경 구축)

  • Lee, Seung-Il;Kim, Yeon-Il;Lee, Sang-Gil;Lee, Cheol-Hoon
    • The Journal of the Korea Contents Association
    • /
    • v.15 no.9
    • /
    • pp.21-28
    • /
    • 2015
  • In the semi-conductor process, a simulation process is performed to detect defects by analyzing the behavior of the impurity through the physical quantity calculation of the inner element. In order to perform the simulation, Finite-Difference Time-Domain(FDTD) algorithm is used. The improvement of semiconductor which is composed of nanoscale elements, the size of simulation is getting bigger. Problems that a processor such as CPU or GPU cannot perform the simulation due to the massive size of matrix or a computer consist of multiple processors cannot handle a massive FDTD may come up. For those problems, studies are performed with parallel/distributed computing. However, in the past, only single type of processor was used. In GPU's case, it performs fast, but at the same time, it has limited memory. On the other hand, in CPU, it performs slower than that of GPU. To solve the problem, we implemented a computing model that can handle any FDTD simulation regardless of size on the cluster which consist of heterogeneous processors. We tested the simulation on processors using MPI libraries which is based on 'point to point' communication and verified that it operates correctly regardless of the number of node and type. Also, we analyzed the performance by measuring the total execution time and specific time for the simulation on each test.

Parallel SystemC Cosimulation using Virtual Synchronization (가상 동기화 기법을 이용한 SystemC 통합시뮬레이션의 병렬 수행)

  • Yi, Young-Min;Kwon, Seong-Nam;Ha, Soon-Hoi
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.12
    • /
    • pp.867-879
    • /
    • 2006
  • This paper concerns fast and time accurate HW/SW cosimulation for MPSoC(Multi-Processor System-on-chip) architecture where multiple software and/or hardware components exist. It is becoming more and more common to use MPSoC architecture to design complex embedded systems. In cosimulation of such architecture, as the number of the component simulators participating in the cosimulation increases, the time synchronization overhead among simulators increases, thereby resulting in low overall cosimulation performance. Although SystemC cosimulation frameworks show high cosimulation performance, it is in inverse proportion to the number of simulators. In this paper, we extend the novel technique, called virtual synchronization, which boosts cosimulation speed by reducing time synchronization overhead: (1) SystemC simulation is supported seamlessly in the virtual synchronization framework without requiring the modification on SystemC kernel (2) Parallel execution of component simulators with virtual synchronization is supported. We compared the performance and accuracy of the proposed parallel SystemC cosimulation framework with MaxSim, a well-known commercial SystemC cosimulation framework, and the proposed one showed 11 times faster performance for H.263 decoder example, while the accuracy was maintained below 5%.

8.1 Gbps High-Throughput and Multi-Mode QC-LDPC Decoder based on Fully Parallel Structure (전 병렬구조 기반 8.1 Gbps 고속 및 다중 모드 QC-LDPC 복호기)

  • Jung, Yongmin;Jung, Yunho;Lee, Seongjoo;Kim, Jaeseok
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.11
    • /
    • pp.78-89
    • /
    • 2013
  • This paper proposes a high-throughput and multi-mode quasi-cyclic (QC) low-density parity-check (LDPC) decoder based on a fully parallel structure. The proposed QC-LDPC decoder employs the fully parallel structure to provide very high throughput. The high interconnection complexity, which is the general problem in the fully parallel structure, is solved by using a broadcasting-based sum-product algorithm and proposing a low-complexity cyclic shift network. The high complexity problem, which is caused by using a large amount of check node processors and variable node processors, is solved by proposing a combined check and variable node processor (CCVP). The proposed QC-LDPC decoder can support the multi-mode decoding by proposing a routing-based interconnection network, the flexible CCVP and the flexible cyclic shift network. The proposed QC-LDPC decoder is operated at 100 MHz clock frequency. The proposed QC-LDPC decoder supports multi-mode decoding and provides 8.1 Gbps throughput for a (1944, 1620) QC-LDPC code.

Real-Time Power-Saving Scheduling Based on Genetic Algorithms in Multi-core Hybrid Memory Environments (멀티코어 이기종메모리 환경에서의 유전 알고리즘 기반 실시간 전력 절감 스케줄링)

  • Yoo, Suhyeon;Jo, Yewon;Cho, Kyung-Woon;Bahn, Hyokyung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.1
    • /
    • pp.135-140
    • /
    • 2020
  • Recently, due to the rapid diffusion of intelligent systems and IoT technologies, power saving techniques in real-time embedded systems has become important. In this paper, we propose P-GA (Parallel Genetic Algorithm), a scheduling algorithm aims at reducing the power consumption of real-time systems in multi-core hybrid memory environments. P-GA improves the Proportional-Fairness (PF) algorithm devised for multi-core environments by combining the dynamic voltage/frequency scaling of the processor with the nonvolatile memory technologies. Specifically, P-GA applies genetic algorithms for optimizing the voltage and frequency modes of processors and the memory types, thereby minimizing the power consumptions of the task set. Simulation experiments show that the power consumption of P-GA is reduced by 2.85 times compared to the conventional schemes.

Determination of a Grain Size for Reducing Cache Miss Rate of Direct-Mapped Caches (직접 사상 캐쉬의 캐쉬 실패율을 감소시키기 위한 성김도 정책)

  • Jung, In-Bum;Kong, Ki-Sok;Lee, Joon-Won
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.7
    • /
    • pp.665-674
    • /
    • 2000
  • In data parallel programs incurring high cache locality, the choice of grain sizes affects cache performance. Though the grain sizes chosen provide fair load balance among processors, the grain sizes that ignore underlying caching effect result in address interferences between grains allocated to a processor. These address interferences appear to have a negative impact on the cache locality, since they result in cache conflict misses. To address this problem, we propose a best grain size driven from a cache size and the number of processors based on direct mapped cache's characteristic. Since the proposed method does not map the grains to the same location in the cache, cache conflict misses are reduced. Simulation results show that the proposed best grain size substantially improves the performance of tested data parallel programs through the reduction of cache misses on direct-mapped caches.

  • PDF

Peducing the Overhead of Virtual Address Translation Process (가상주소 변환 과정에 대한 부담의 줄임)

  • U, Jong-Jeong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.3 no.1
    • /
    • pp.118-126
    • /
    • 1996
  • Memory hierarchy is a useful mechanism for improving the memory access speed and making the program space larger by layering the memories and separating program spaces from memory spaces. However, it needs at least two memory accesses for each data reference : a TLB(Translation Lookaside Buffer) access for the address translation and a data cache access for the desired data. If the cache size increases to the multiplication of page size and the cache associativity, it is difficult to access the TLB with the cache in parallel, thereby making longer the critical timing path in the processor. To achieve such parallel accesses, we present the hybrid mapped TLB which combines a direct mapped TLB with a very small fully-associative mapped TLB. The former can reduce the TLB access time. while the latter removes the conflict misses from the former. The trace-driven simulation shows that under given workloads the proposed TLB is effective even when a fully-associative mapped TLB with only four entries is added because the effects of its increased misses are offset by its speed benefits.

  • PDF

DSP Implementation and Open Sea Test of Underwater Image Transmission System Using QPSK Scheme (QPSK 방식을 이용한 수중영상 정보전송 시스템의 DSP구현 및 실해역 실험 연구)

  • 박종원;고학림;이덕환;최영철;김시문;김승근;임용곤
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.2
    • /
    • pp.117-124
    • /
    • 2004
  • In this paper, we have been implemented the QPSK-based underwater transmission systems using DSP in order to transmit the underwater image data. We have adopted a BDPA (Block Data Parallel Architecture) to control multiple DSPs used in the transmitter and receiver in order to transmit the image data in real-time. We also have developed GUI software in order to drive and to debug the implemanted system in real-time. We have executed open sea tests in order to analyze the performance of the implemented system at East Sea near Kosung in Kangwon-Do. As a result of these experiments, it has been demonstrated that 10 kbps image data can be received without errors at 30m and 80m depth points, while the distance between the transmitter and the receiver is up to 20m.