• Title/Summary/Keyword: Parallel processor

Search Result 482, Processing Time 0.027 seconds

Optimization Study of Toom-Cook Algorithm in NIST PQC SABER Utilizing ARM/NEON Processor (ARM/NEON 프로세서를 활용한 NIST PQC SABER에서 Toom-Cook 알고리즘 최적화 구현 연구)

  • Song, JinGyo;Kim, YoungBeom;Seo, Seog Chung
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.31 no.3
    • /
    • pp.463-471
    • /
    • 2021
  • Since 2016, National Institute of Standards and Technology (NIST) has been conducting a post quantum cryptography standardization project in preparation for a quantum computing environment. Three rounds are currently in progress, and most of the candidates (5/7) are lattice-based. Lattice-based post quantum cryptography is evaluated to be applicable even in an embedded environment where resources are limited by providing efficient operation processing and appropriate key length. Among them, SABER KEM provides the efficient modulus and Toom-Cook to process polynomial multiplication with computation-intensive tasks. In this paper, we present the optimized implementation of evaluation and interpolation in Toom-Cook algorithm of SABER utilizing ARM/NEON in ARMv8-A platform. In the evaluation process, we propose an efficient interleaving method of ARM/NEON, and in the interpolation process, we introduce an optimized implementation methodology applicable in various embedded environments. As a result, the proposed implementation achieved 3.5 times faster performance in the evaluation process and 5 times faster in the interpolation process than the previous reference implementation.

Semantic Depth Data Transmission Reduction Techniques using Frame-to-Frame Masking Method for Light-weighted LiDAR Signal Processing Platform (LiDAR 신호처리 플랫폼을 위한 프레임 간 마스킹 기법 기반 유효 데이터 전송량 경량화 기법)

  • Chong, Taewon;Park, Daejin
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.12
    • /
    • pp.1859-1867
    • /
    • 2021
  • Multi LiDAR sensors are being mounted on autonomous vehicles, and a system to multi LiDAR sensors data is required. When sensors data is transmitted or processed to the main processor, a huge amount of data causes a load on the transport network or data processing. In order to minimize the number of load overhead into LiDAR sensor processors, only semantic data is transmitted through data comparison between frames in LiDAR data. When data from 4 LiDAR sensors are processed in a static environment without moving objects and a dynamic environment in which a person moves within sensor's field of view, in a static experiment environment, the transmitted data reduced by 89.5% from 232,104 to 26,110 bytes. In dynamic environment, it was possible to reduce the transmitted data by 88.1% to 29,179 bytes.

A Fast Processor Architecture and 2-D Data Scheduling Method to Implement the Lifting Scheme 2-D Discrete Wavelet Transform (리프팅 스킴의 2차원 이산 웨이브릿 변환 하드웨어 구현을 위한 고속 프로세서 구조 및 2차원 데이터 스케줄링 방법)

  • Kim Jong Woog;Chong Jong Wha
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.42 no.4 s.334
    • /
    • pp.19-28
    • /
    • 2005
  • In this paper, we proposed a parallel fast 2-D discrete wavelet transform hardware architecture based on lifting scheme. The proposed architecture improved the 2-D processing speed, and reduced internal memory buffer size. The previous lifting scheme based parallel 2-D wavelet transform architectures were consisted with row direction and column direction modules, which were pair of prediction and update filter module. In 2-D wavelet transform, column direction processing used the row direction results, which were not generated in column direction order but in row direction order, so most hardware architecture need internal buffer memory. The proposed architecture focused on the reducing of the internal memory buffer size and the total calculation time. Reducing the total calculation time, we proposed a 4-way data flow scheduling and memory based parallel hardware architecture. The 4-way data flow scheduling can increase the row direction parallel performance, and reduced the initial latency of starting of the row direction calculation. In this hardware architecture, the internal buffer memory didn't used to store the results of the row direction calculation, while it contained intermediate values of column direction calculation. This method is very effective in column direction processing, because the input data of column direction were not generated in column direction order The proposed architecture was implemented with VHDL and Altera Stratix device. The implementation results showed overall calculation time reduced from $N^2/2+\alpha$ to $N^2/4+\beta$, and internal buffer memory size reduced by around $50\%$ of previous works.

Design and Implementation of the Central Queue Based Loop Scheduling Method (중앙 큐 기반의 루프 스케쥴링 기법의 설계 및 구현)

  • Kim, Hyun-Chul;Kim, Hyo-Cheol;Yoo, Kee-Young
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.38 no.5
    • /
    • pp.16-26
    • /
    • 2001
  • In this paper, we present a new scheduling method called CDSS(Carried-Dependence Self-Scheduling) for efficiently execution of the loop with intra dependency between iterations based on the central queue. We also implemented it on shared memory system using Java language. Also, we study the modification that converts the existing self-scheduling method based on the central task queue for parallel loops onto the same form applied to loop with loop-carried dependences. The proposed method is self scheduling and assigns the loops in three-level considering the synchronization point according to the dependence distance of the loops. To adapt the proposed scheme and modified methods into various platforms, including a uni-processor system, we use threads for implementation. Compared to other assignment algorithms with various changes of application and system parameters, CDSS is found to be more efficient than other methods in overall execution time including scheduling overheads. CDSS shows improved performance over modified SS, Factoring, GSS and CSS by about 0.02, 40.5, 46.1 and 53.6%, respectively. In CDSS, we achieve the best performance on varying application programs using a few threads, which equal the dependence distance.

  • PDF

A Study on the Parallel Routing in Hybrid Optical Networks-on-Chip (하이브리드 광학 네트워크-온-칩에서 병렬 라우팅에 관한 연구)

  • Seo, Jung-Tack;Hwang, Yong-Joong;Han, Tae-Hee
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.48 no.8
    • /
    • pp.25-32
    • /
    • 2011
  • Networks-on-chip (NoC) is emerging as a key technology to overcome severe bus traffics in ever-increasing complexity of the Multiprocessor systems-on-chip (MPSoC); however traditional electrical interconnection based NoC architecture would be faced with technical limits of bandwidth and power consumptions in the near future. In order to cope with these problems, a hybrid optical NoC architecture which use both electrical interconnects and optical interconnects together, has been widely investigated. In the hybrid optical NoCs, wormhole switching and simple deterministic X-Y routing are used for the electrical interconnections which is responsible for the setup of routing path and optical router to transmit optical data through optical interconnects. Optical NoC uses circuit switching method to send payload data by preset paths and routers. However, conventional hybrid optical NoC has a drawback that concurrent transmissions are not allowed. Therefore, performance improvement is limited. In this paper, we propose a new routing algorithm that uses circuit switching and adaptive algorithm for the electrical interconnections to transmit data using multiple paths simultaneously. We also propose an efficient method to prevent livelock problems. Experimental results show up to 60% throughput improvement compared to a hybrid optical NoC and 65% power reduction compared to an electrical NoC.

A Study on Welding Deformation of thin plate block in PCTC (PCTC 박판 블록 용접 변형에 관한 연구)

  • Kang, Serng-Ku;Yang, Jong-Su;Kim, Ho-Kyeong
    • Proceedings of the KWS Conference
    • /
    • 2009.11a
    • /
    • pp.97-97
    • /
    • 2009
  • The use of thin plate increases due to the need for light weight in large ship. Thin plate is easily distorted and has residual stress by welding heat. Therefore, the thin plate should be carefully joined to minimize the welding deformation which costs time and money for repair. For one effort to reduce welding deformation, it is very useful to predict welding deformation before welding execution. There are two methods to analyze welding deformation. One is simple linear analysis. The other is nonlinear analysis. The simple linear analysis is elastic analysis using the equivalent load method or inherent strain method from welding experiments. The nonlinear analysis is thermo-elastic analysis which gives consideration to the nonlinearity of material dependent on temperature and time, welding current, voltage, speed, sequence and constraint. In this study, the welding deformation is analyzed by using thermo-elastic method for PCTC(Pure Car and Truck Carrier) which carries cars and trucks. PCTC uses thin plates of 6mm thickness which is susceptible to welding heat. The analysis dimension is 19,200mm(length) * 13,825mm(width) * 376mm(height). MARC and MENTAT are used as pre and post processor and solver. The boundary conditions are based on the real situation in shipyard. The simulations contain convection and gravity. The material of the thin block is mild steel with $235N/mm^2$ yield strength. Its nonlinearity of conductivity, specific heat, Young's modulus and yield strength is applied in simulations. Welding is done in two pass. First pass lasts 2,100 second, then it rests for 900 second, then second pass lasts 2,100 second and then it rests for 20,000 second. The displacement at 0 sec is caused by its own weight. It is maximum 19mm at the free side. The welding line expands, shrinks during welding and finally experiences shrinkage. It results in angular distortion of thin block. Final maximum displacement, 17mm occurs around welding line. The maximum residual stress happens at the welding line, where the stress is above the yield strength. Also, the maximum equivalent plastic strain occurs at the welding line. The plastic strain of first pass is more than that of second pass. The flatness of plate in longitudinal direction is calculated in parallel with the direction of girder and compared with deformation standard of ${\pm}15mm$. Calculated value is within the standard range. The flatness of plate in transverse direction is calculated in perpendicular to the direction of girder and compared with deformation standard of ${\pm}6mm$. It satisfies the standard. Buckle of plate is calculated between each longitudinal and compared with the deformation standard. All buckle value is within the standard range of ${\pm}6mm$.

  • PDF

Efficient RMESH Algorithms for Computing the Intersection and the Union of Two Visibility Polygons (두 가시성 다각형의 교집합과 합집합을 구하는 효율적인 RMESH 알고리즘)

  • Kim, Soo-Hwan
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.2
    • /
    • pp.401-407
    • /
    • 2016
  • We can consider the following problems for two given points p and q in a simple polygon P. (1) Compute the set of points of P which are visible from both p and q. (2) Compute the set of points of P which are visible from either p or q. They are corresponding to the problems which are to compute the intersection and the union of two visibility polygons. In this paper, we consider algorithms for solving these problems on a reconfigurable mesh(in short, RMESH). The algorithm in [1] can compute the intersection of two general polygons in constant time on an RMESH with size O($n^3$), where n is the total number of vertices of two polygons. In this paper, we construct the planar subdivision graph in constant time on an RMESH with size O($n^2$) using the properties of the visibility polygon for preprocessing. Then we present O($log^2n$) time algorithms for computing the union as well as the intersection of two visibility polygons, which improve the processor-time product from O($n^3$) to O($n^2log^2n$).

Design and Analysis of a Digit-Serial $AB^{2}$ Systolic Arrays in $GF(2^{m})$ ($GF(2^{m})$ 상에서 새로운 디지트 시리얼 $AB^{2}$ 시스톨릭 어레이 설계 및 분석)

  • Kim Nam-Yeun;Yoo Kee-Young
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.4
    • /
    • pp.160-167
    • /
    • 2005
  • Among finite filed arithmetic operations, division/inverse is known as a basic operation for public-key cryptosystems over $GF(2^{m})$ and it is computed by performing the repetitive $AB^{2}$ multiplication. This paper presents a digit-serial-in-serial-out systolic architecture for performing the $AB^2$ operation in GF$(2^{m})$. To obtain L×L digit-serial-in-serial-out architecture, new $AB^{2}$ algorithm is proposed and partitioning, index transformation and merging the cell of the architecture, which is derived from the algorithm, are proposed. Based on the area-time product, when the digit-size of digit-serial architecture, L, is selected to be less than about m, the proposed digit-serial architecture is efficient than bit-parallel architecture, and L is selected to be less than about $(1/5)log_{2}(m+1)$, the proposed is efficient than bit-serial. In addition, the area-time product complexity of pipelined digit-serial $AB^{2}$ systolic architecture is approximately $10.9\%$ lower than that of nonpipelined one, when it is assumed that m=160 and L=8. Additionally, since the proposed architecture can be utilized for the basic architecture of crypto-processor and it is well suited to VLSI implementation because of its simplicity, regularity and pipelinability.

Parallel Range Query processing on R-tree with Graphics Processing Units (GPU를 이용한 R-tree에서의 범위 질의의 병렬 처리)

  • Yu, Bo-Seon;Kim, Hyun-Duk;Choi, Won-Ik;Kwon, Dong-Seop
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.5
    • /
    • pp.669-680
    • /
    • 2011
  • R-trees are widely used in various areas such as geographical information systems, CAD systems and spatial databases in order to efficiently index multi-dimensional data. As data sets used in these areas grow in size and complexity, however, range query operations on R-tree are needed to be further faster to meet the area-specific constraints. To address this problem, there have been various research efforts to develop strategies for acceleration query processing on R-tree by using the buffer mechanism or parallelizing the query processing on R-tree through multiple disks and processors. As a part of the strategies, approaches which parallelize query processing on R-tree through Graphics Processor Units(GPUs) have been explored. The use of GPUs may guarantee improved performances resulting from faster calculations and reduced disk accesses but may cause additional overhead costs caused by high memory access latencies and low data exchange rate between GPUs and the CPU. In this paper, to address the overhead problems and to adapt GPUs efficiently, we propose a novel approach which uses a GPU as a buffer to parallelize query processing on R-tree. The use of buffer algorithm can give improved performance by reducing the number of disk access and maximizing coalesced memory access resulting in minimizing GPU memory access latencies. Through the extensive performance studies, we observed that the proposed approach achieved up to 5 times higher query performance than the original CPU-based R-trees.

Efficient Multiple Joins using the Synchronization of Page Execution Time in Limited Processors Environments (한정된 프로세서 환경에서 체이지 실행시간 동기화를 이용한 효율적인 다중 결합)

  • Lee, Kyu-Ock;Weon, Young-Sun;Hong, Man-Pyo
    • Journal of KIISE:Databases
    • /
    • v.28 no.4
    • /
    • pp.732-741
    • /
    • 2001
  • In the relational database systems the join operation is one of the most time-consuming query operations. Many parallel join algorithms have been developed 개 reduce the execution time Multiple hash join algorithm using allocation tree is one of the most efficient ones. However, it may have some delay on the processing each node of allocation tree, which is occurred in tuple-probing phase by the difference between one page reading time of outer relation and the processing time of already read one. This delay problem was solved by using the concept of synchronization of page execution time with we had proposed In this paper the effects of the performance improvements in each node of the allocation tree are extended to the whole allocation tree and the performance evaluation about that is processed. In addition we propose an efficient algorithm for multiple hash joins in limited number of processor environments according to the relationship between the number of input relations in the allocation tree and the number of processors allocated to the tree. Finally. we analyze the performance by building the analytical cost model and verify the validity of it by various performance comparison with previous method.

  • PDF