• Title/Summary/Keyword: 파이프라이닝 병렬처리

Search Result 18, Processing Time 0.037 seconds

Programmable Multimedia Platform for Video Processing of UHD TV (UHD TV 영상신호처리를 위한 프로그래머블 멀티미디어 플랫폼)

  • Kim, Jaehyun;Park, Goo-man
    • Journal of Broadcast Engineering
    • /
    • v.20 no.5
    • /
    • pp.774-777
    • /
    • 2015
  • This paper introduces the world's first programmable video-processing platform for the enhancement of the video quality of the 8K(7680x4320) UHD(Ultra High Definition) TV operating up to 60 frames per second. In order to support required computing capacity and memory bandwidth, the proposed platform implemented several key features such as symmetric multi-cluster architecture for parallel data processing, a ring-data path between the clusters for data pipelining and hardware accelerators for computing filter operations. The proposed platform based on RP(Reconfigurable Processor) processes video quality enhancement algorithms and handles effectively new UHD broadcasting standards and display panels.

Radix-2 16 Points FFT Algorithm Accelerator Implementation Using FPGA (FPGA를 사용한 radix-2 16 points FFT 알고리즘 가속기 구현)

  • Gyu Sup Lee;Seong-Min Cho;Seung-Hyun Seo
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.34 no.1
    • /
    • pp.11-19
    • /
    • 2024
  • The increased utilization of the FFT in signal processing, cryptography, and various other fields has highlighted the importance of optimization. In this paper, we propose the implementation of an accelerator that processes the radix-2 16 points FFT algorithm more rapidly and efficiently than FFT implementation of existing studies, using FPGA(Field Programmable Gate Array) hardware. Leveraging the hardware advantages of FPGA, such as parallel processing and pipelining, we design and implement the FFT logic in the PL (Programmable Logic) part using the Verilog language. We implement the FFT using only the Zynq processor in the PS (Processing System) part, and compare the computation times of the implementation in the PL and PS part. Additionally, we demonstrate the efficiency of our implementation in terms of computation time and resource usage, in comparison with related works.

A Study on the Technology and the way of Speed Up on the Pentium Processor (Pentium 프로세서에 적용된 기술과 속도 향상기법 연구)

  • Kim Soo Hong
    • Proceedings of the KAIS Fall Conference
    • /
    • 2004.11a
    • /
    • pp.206-209
    • /
    • 2004
  • Pentium4의 가장 큰 특징은 병렬처리의 최적화이다. 인텔사의 최신 마이크로 프로세서 Pentium4에 적용된 기술들은 SSE2, Intel NetBurst Micro-Architecture, Hyper-Threading Technology 등이다. CPU 속도의 향상 기법은 크게 클럭 속도의 증가, IPC의 증가, 파이프 라이닝의 길이를 길게 하고, 트랜지스터 집적도를 높이는 것 등이다. 인텔이 Pentium4에 적용한 기술들은 구조론적인 관점에 입각해서 원칙을 잘 지켰다고 할 수 있다. 메모리 차원에서의 속도 향상 기법은 보다 큰 메모리를 사용하고, 넓은 데이터 전송 대역폭을 가지게 하고, 그리고 전송속도를 빠르게 하는 방법이 있다. 각 방법은 물리학적인 법칙에서 빛의 속도 보다 빨라 질 수 없다. 그러므로 속도 증가에는 한계가 있다. 이것을 최소화하기 위한 방책으로는 멀티프로세서와 분산처리로 다소 얼마간의 속도 차를 해결할 수 있을 것이다.

  • PDF

Preprocessing Methods for Effective Modulo Scheduling on High Performance DSPs (고성능 디지털 신호 처리 프로세서상에서 효율적인 모듈로 스케쥴링을 위한 전처리 기법)

  • Cho, Doo-San;Paek, Yun-Heung
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.5
    • /
    • pp.487-501
    • /
    • 2007
  • To achieve high resource utilization for multi-issue DSPs, production compiler commonly includes variants of iterative modulo scheduling algorithm. However, excessive cyclic data dependences, which exist in communication and media processing loops, unduly restrict modulo scheduling freedom. As a result, replicated functional units in multi-issue DSPs are often under-utilized. To address this resource under-utilization problem, our paper describes a novel compiler preprocessing strategy for effective modulo scheduling. The preprocessing strategy proposed capitalizes on two new transformations, which are referred to as cloning and dismantling. Our preprocessing strategy has been validated by an implementation for StarCore SC140 DSP compiler.

Implementing Swing Modulo Scheduler for VLIW Processor (VLIW 프로세서를 위한 Swing Modulo Scheduler 구현)

  • Shin, Jangseop;Han, Sangjun;Jung, Hyungyun;Ahn, Minwook;Youn, Jonghee M.;Paek, Yunheung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.04a
    • /
    • pp.12-14
    • /
    • 2014
  • 하드웨어가 해저드(hazard) 검출을 지원하지 않는 멀티이슈 VLIW 프로세서의 성능을 높이기 위해서는 컴파일러가 명령어 의존성과 하드웨어 자원의 제약을 지키는 범위 안에서 최대한 명령어수준 병렬성(ILP)을 활용하는 것이 중요하다. 기본 블록(basic block) 스케쥴링은 Branch 등 제어 흐름(control flow)의 경계를 넘어선 스케쥴링을 행하지 않아 그 효과가 제한적이다. 소프트웨어 파이프라이닝(software pipelining)은 루프(loop)의 경계를 허물어 여러반본(iteration)의 명령어가 동시에 수행되도록 하는 것으로 모듈로 스케쥴링(modulo scheduling)은 그 중에 한 범주의 스케쥴링 기법들을 일컫는다. 본 연구에서는 그 중 한가지인 스윙 모듈로 스케쥴러(swing modulo scheduler)[1]를 구현하여 그 효과를 알아보고자 한다.

High-Throughput QC-LDPC Decoder Architecture for Multi-Gigabit WPAN Systems (멀티-기가비트 WPAN 시스템을 위한 고속 QC-LDPC 복호기 구조)

  • Lee, Hanho;Ajaz, Sabooh
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.2
    • /
    • pp.104-113
    • /
    • 2013
  • A high-throughput Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) decoder architecture is proposed for 60GHz multi-gigabit wireless personal area network (WPAN) applications. Two novel techniques which can apply to our selected QC-LDPC code are proposed, including a four block-parallel layered decoding technique and fixed wire network. Two-stage pipelining and four block-parallel layered decoding techniques are used to improve the clock speed and decoding throughput. Also, the fixed wire network is proposed to simplify the switch network. A 672-bit, rate-1/2 QC-LDPC decoder architecture has been designed and implemented using 90-nm CMOS standard cell technology. Synthesis results show that the proposed QC-LDPC decoder requires a 794K gate and can operate at 290 MHz to achieve a data throughput of 3.9 Gbps with a maximum of 12 iterations, which meet the requirement of 60 GHz WPAN applications.

Priority Data Handling in Pipeline-based Workflow (파이프라인 기반 워크플로우의 우선 데이터 처리 방안)

  • Jeon, Wonpyo;Heo, Daeyoung;Hwang, Suntae
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.12
    • /
    • pp.691-697
    • /
    • 2017
  • Volcanic ash has been predicted to be the main source of damage caused by a potential volcanic disaster around Mount Baekdu and the regions of the Korean peninsula. Computer simulations to predict the diffusion of volcanic ash should be performed according to prevalent meteorological situations within a predetermined time. Therefore, a workflow using pipelining is proposed to parallelize the software used for this computation. Due to the nature of volcanic calamities, the simulations need to be carried out for various plausible conditions given that the parameters cannot be precisely determined during the simulations, even at the time of a volcanic eruption. Among the given conditions, computations need to be first performed for the condition with the highest probability so that a response to the volcanic disaster can be provided using these results. Further action can then be performed later based on subsequent results. The computations need to be performed using a volcanic disaster damage prediction system on a computing server with limited computing performance. Hence, an optimal distribution of the computing resources is required. We propose a method through which specific data can be provided first to the proposed pipeline-based workflow.

Implementation of High-radix Modular Exponentiator for RSA using CRT (CRT를 이용한 하이래딕스 RSA 모듈로 멱승 처리기의 구현)

  • 이석용;김성두;정용진
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.10 no.4
    • /
    • pp.81-93
    • /
    • 2000
  • In a methodological approach to improve the processing performance of modulo exponentiation which is the primary arithmetic in RSA crypto algorithm, we present a new RSA hardware architecture based on high-radix modulo multiplication and CRT(Chinese Remainder Theorem). By implementing the modulo multiplier using radix-16 arithmetic, we reduced the number of PE(Processing Element)s by quarter comparing to the binary arithmetic scheme. This leads to having the number of clock cycles and the delay of pipelining flip-flops be reduced by quarter respectively. Because the receiver knows p and q, factors of N, it is possible to apply the CRT to the decryption process. To use CRT, we made two s/2-bit multipliers operating in parallel at decryption, which accomplished 4 times faster performance than when not using the CRT. In encryption phase, the two s/2-bit multipliers can be connected to make a s-bit linear multiplier for the s-bit arithmetic operation. We limited the encryption exponent size up to 17-bit to maintain high speed, We implemented a linear array modulo multiplier by projecting horizontally the DG of Montgomery algorithm. The H/W proposed here performs encryption with 15Mbps bit-rate and decryption with 1.22Mbps, when estimated with reference to Samsung 0.5um CMOS Standard Cell Library, which is the fastest among the publications at present.