• Title/Summary/Keyword: Pipelined scheduling

Search Result 33, Processing Time 0.023 seconds

Design and Implementation of An I/O System for Irregular Application under Parallel System Environments (병렬 시스템 환경하에서 비정형 응용 프로그램을 위한 입출력 시스템의 설계 및 구현)

  • No, Jae-Chun;Park, Seong-Sun;;Gwon, O-Yeong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.26 no.11
    • /
    • pp.1318-1332
    • /
    • 1999
  • 본 논문에서는 입출력 응용을 위해 collective I/O 기법을 기반으로 한 실행시간 시스템의 설계, 구현 그리고 그 성능평가를 기술한다. 여기서는 모든 프로세서가 동시에 I/O 요구에 따라 스케쥴링하며 I/O를 수행하는 collective I/O 방안과 프로세서들이 여러 그룹으로 묶이어, 다음 그룹이 데이터를 재배열하는 통신을 수행하는 동안 오직 한 그룹만이 동시에 I/O를 수행하는 pipelined collective I/O 등의 두 가지 설계방안을 살펴본다. Pipelined collective I/O의 전체 과정은 I/O 노드 충돌을 동적으로 줄이기 위해 파이프라인된다. 이상의 설계 부분에서는 동적으로 충돌 관리를 위한 지원을 제공한다. 본 논문에서는 다른 노드의 메모리 영역에 이미 존재하는 데이터를 재 사용하여 I/O 비용을 줄이기 위해 collective I/O 방안에서의 소프트웨어 캐슁 방안과 두 가지 모형에서의 chunking과 온라인 압축방안을 기술한다. 그리고 이상에서 기술한 방안들이 입출력을 위해 높은 성능을 보임을 기술하는데, 이 성능결과는 Intel Paragon과 ASCI/Red teraflops 기계 상에서 실험한 것이다. 그 결과 응용 레벨에서의 bandwidth는 peak point가 55%까지 측정되었다.Abstract In this paper we present the design, implementation and evaluation of a runtime system based on collective I/O techniques for irregular applications. We present two designs, namely, "Collective I/O" and "Pipelined Collective I/O". In the first scheme, all processors participate in the I/O simultaneously, making scheduling of I/O requests simpler but creating a possibility of contention at the I/O nodes. In the second approach, processors are grouped into several groups, so that only one group performs I/O simultaneously, while the next group performs communication to rearrange data, and this entire process is pipelined to reduce I/O node contention dynamically. In other words, the design provides support for dynamic contention management. Then we present a software caching method using collective I/O to reduce I/O cost by reusing data already present in the memory of other nodes. Finally, chunking and on-line compression mechanisms are included in both models. We demonstrate that we can obtain significantly high-performance for I/O above what has been possible so far. The performance results are presented on an Intel Paragon and on the ASCI/Red teraflops machine. Application level I/O bandwidth up to 55% of the peak is observed.he peak is observed.

Verification Platform with ARM- and DSP-Based Multiprocessor Architecture for DVB-T Baseband Receivers

  • Cho, Koon-Shik;Chang, June-Young;Cho, Han-Jin;Cho, Jun-Dong
    • ETRI Journal
    • /
    • v.30 no.1
    • /
    • pp.141-151
    • /
    • 2008
  • In this paper, we introduce a new verification platform with ARM- and DSP-based multiprocessor architecture. Its simple communication interface with a crossbar switch architecture is suitable for a heterogeneous multiprocessor platform. The platform is used to verify the function and performance of a DVB-T baseband receiver using hardware and software partitioning techniques with a seamless hardware/software co-verification tool. We present a dual-processor platform with an ARM926 and a Teak DSP, but it cannot satisfy the standard specification of EN 300 744 of DVB-T ETSI. Therefore, we propose a new multiprocessor strategy with an ARM926 and three Teak DSPs synchronized at 166 MHz to satisfy the required specification of DVB-T.

  • PDF

Investigation of Digital Filter Design using Improved Simulated-Annealing Technique (개선된 시뮬레이티드어닐링 기법에 의한 디지탈필터 설계의 고찰)

  • Song, Nag-Un;Yun, Bok, Sik
    • The Transactions of the Korea Information Processing Society
    • /
    • v.2 no.1
    • /
    • pp.106-118
    • /
    • 1995
  • In this work, the optimized design methodology in high-level synthesis related with scheduling and hardware allocation is developed by simulated annealing technique effectively modified . Applying this method to digital filter design, the optimized tradeoff problem of speed and hardware costs in pipelined digital filter case and array digital filter case are investigated. While, it is confirmed that the suggested method gives the improved cost function value faster and can be used in complicated digital filter design.

  • PDF

Hardware Implementation for SEED Cipher Processor of Pipeline Architecture (Pipeline 구조의 SEED 암호화 프로세서 구현 및 설계)

  • 채봉수;김기용;조용범
    • Proceedings of the IEEK Conference
    • /
    • 2002.06b
    • /
    • pp.125-128
    • /
    • 2002
  • This paper designed a cipher process, which used SEED-Algorithm that is totally domestic technique. This cipher processor is implemented by using SEED-cipher-Algorithm and pipeline scheduling architecture. The cipher is 16-round Feistel architecture but we show just 16-round Feistel architecture for brevity in this thesis. Of course, we can get the result of the 16-round processing by addition of control part simply. Furthermore, it has pipelined architecture, so the speed of cipher process is the faster than others when we performed a cipher a lot of data. The schedule-function can performed the two-cipher process simultaneously, such as using two-cipher processors.

  • PDF

Estimation of scheduling algorithm's performance for the synthesis of pipelined data path (파이프라인 데이터패스 합성을 위한 스케쥴링 알고리즘의 성능평가)

  • 오주영;박도순
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1999.10c
    • /
    • pp.30-32
    • /
    • 1999
  • 본 논문에서는 자원제약을 목적함수로 하여 파이프라인 실행이 가능하게 하는 데이터패스 합성을 위해 개발되어진 스케쥴링 알고리즘들의 실행시간과 실행결과를 도표를 기준으로 정렬한다. 평가의 대상이 되는 알고리즘들은 스케쥴을 위해 제안되는 함수의 계산시점, 함수의 역할과 적용방법에 의해 구분되어지는 논문 [1],[2],[3]에 대하여 수행되어지며, 충돌을 발생시키는 파티션 내에 위치하는 모빌리티를 가지는 각각의 연산에 대해 다음 파티션으로의 지연시 충돌수 변이와 각 연산의 모빌리티를 요소로 계산되는 우선 순위 함수를 정의하여 스케쥴 순열을 정렬하는 결정하는 논문[1]과 자원 할당 가능성 판단함수를 제안하고 이를 기준으로 배정가능 범위를 축소해 나가며 연산을 스케쥴하는 논문[2]와, 논문[2]의 자원할당 가능성 판단시 부과되는 시간감소를 위해 현재의 스케쥴 상황 값들을 정량화 하여 연산이 선택되도록하여 결과적 실행 시간을 감소시키는 논문[3]에 대하여 벤치마크 성능평가와 알고리즘 실행시간 결과 비교를 수행하고 향후 연구 진행 방향을 제시한다.

  • PDF

A Scheduling algorithm for pipelined data path synthesis with variable initiation intervals under resource constraints (자원 제약하에서 가변 데이터 입력의 파이프라인 데이터 패스 함성을 위한 스케줄링 알고리즘)

  • 오주영;박도순
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.10c
    • /
    • pp.34-36
    • /
    • 2001
  • 상위 수준 합성 과정에서 스케줄링은 하드웨어 동작을 표현한 연산들이 주어진 제약 조건을 만족하며 최적의 제어단계에 배정되도록 하는 과정이며 스케줄 결과는 목적 하드웨어의 면적과 실행속도에 많은 영향을 준다. 파이프 라인은 순차적인 데이터 입력을 중첩 수행하여 실행 속도와 자원 이용률을 동시에 증가시키는 방법이다. 상위 수준에서 파이프라인 데이터 패스를 합성하기 위한 기존의 스케줄링 알고리즘들은 고정된 데이터 입력 간 격열을 기반으로 제안된 것이 대부분이며, 가변 데이터 입력 간격을 지원하는 스케줄링 알고리즘으로는 시간 제약 하의 자원최소화 알고리즘[5]이 제안되었다. 본 논문에서는 가변데이터 입력 간격을 지원하는 자원 제약하의 실행 시간 최소화 알고리즘을 제안한다. 이를 위해 연산의 스테이지 인덱스가 초기에 고정되는 시간제약하의 스케줄링 알고리즘[5]을 응용하여 자원제약하의 스케줄 진행과정에서 증가되는 제어단계에 따라 스테이지 인덱스가 변경 될 수 있도록 하고 점진적인 모빌리티 축소에 의해 스케줄한다. 제안된 스케줄링 알고리즘의 실험 결과는 다양한 자원제약과 입력 간격렬에 대하여 제약조건을 만족하는 효과적인 스케줄 결과를 유도한다.

  • PDF

Topology of High Speed System Emulator and Its Software (초고속 시스템 에뮬레이터의 구조와 이를 위한 소프트웨어)

  • Kim, Nam-Do;Yang, Se-Yang
    • The KIPS Transactions:PartA
    • /
    • v.8A no.4
    • /
    • pp.479-488
    • /
    • 2001
  • As the SoC designs complexity constantly increases, the simulation that uses their software models simply takes too much time. To solve this problem, FPGA-based logic emulators have been developed and commonly used in the industry. However, FPGA-based logic emulators are facing with the problems of which not only very low FPGA resource usage rate due to the very limited number of pins in FPGAs, but also the emulation speed getting slow drastically as the complexity of designs increases. In this paper, we proposed a new innovative emulation architecture and its software that has high FPGA resource usage rate and makes the emulation extremely fast. The proposed emulation system has merits to overcome the FPGA pin limitation by pipelined ring which transfers multiple logic signal through a single physical pin, and it also makes possible to use a high speed system clock through the intelligent ring topology. In this topology, not only all signal transfer channels among EPGAs are totally separated from user logic so that a high speed system clock can be used, but also the depth of combinational paths is kept swallow as much as possible. Both of these are contributed to achieve high speed emulation. For pipelined singnals transfer among FPGAs we adopt a few heuristic scheduling having low computation complexity. Experimental result with a 12 bit microcontroller has shown that high speed emulation possible even with these simple heuristic scheduling algorithms.

  • PDF

High-throughput and low-area implementation of orthogonal matching pursuit algorithm for compressive sensing reconstruction

  • Nguyen, Vu Quan;Son, Woo Hyun;Parfieniuk, Marek;Trung, Luong Tran Nhat;Park, Sang Yoon
    • ETRI Journal
    • /
    • v.42 no.3
    • /
    • pp.376-387
    • /
    • 2020
  • Massive computation of the reconstruction algorithm for compressive sensing (CS) has been a major concern for its real-time application. In this paper, we propose a novel high-speed architecture for the orthogonal matching pursuit (OMP) algorithm, which is the most frequently used to reconstruct compressively sensed signals. The proposed design offers a very high throughput and includes an innovative pipeline architecture and scheduling algorithm. Least-squares problem solving, which requires a huge amount of computations in the OMP, is implemented by using systolic arrays with four new processing elements. In addition, a distributed-arithmetic-based circuit for matrix multiplication is proposed to counterbalance the area overhead caused by the multi-stage pipelining. The results of logic synthesis show that the proposed design reconstructs signals nearly 19 times faster while occupying an only 1.06 times larger area than the existing designs for N = 256, M = 64, and m = 16, where N is the number of the original samples, M is the length of the measurement vector, and m is the sparsity level of the signal.

Bounding Worst-Case DRAM Performance on Multicore Processors

  • Ding, Yiqiang;Wu, Lan;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.7 no.1
    • /
    • pp.53-66
    • /
    • 2013
  • Bounding the worst-case DRAM performance for a real-time application is a challenging problem that is critical for computing worst-case execution time (WCET), especially for multicore processors, where the DRAM memory is usually shared by all of the cores. Typically, DRAM commands from consecutive DRAM accesses can be pipelined on DRAM devices according to the spatial locality of the data fetched by them. By considering the effect of DRAM command pipelining, we propose a basic approach to bounding the worst-case DRAM performance. An enhanced approach is proposed to reduce the overestimation from the invalid DRAM access sequences by checking the timing order of the co-running applications on a dual-core processor. Compared with the conservative approach, which assumes that no DRAM command pipelining exists, our experimental results show that the basic approach can bound the WCET more tightly, by 15.73% on average. The experimental results also indicate that the enhanced approach can further improve the tightness of WCET by 4.23% on average as compared to the basic approach.

Hardware Design of Bilateral Filter Based on Window Division (윈도우 분할 기반 양방향 필터의 하드웨어 설계)

  • Hyun, Yongho;Park, Taegeun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.41 no.12
    • /
    • pp.1844-1850
    • /
    • 2016
  • The bilateral filter can reduce the noise while preserving details computing the filtering output at each pixels as the average of neighboring pixels. In this paper, we propose a real-time system based on window division. Overall performance is increased due to the parallel architectures which computes five rows in the kernel window simultaneously but with pipelined scheduling. We consider the tradeoff between the filter performance and the hardware cost and the bit allocation has been determined by PSNR analysis. The proposed architecture is designed with verilogHDL and synthesized using Dongbu Hitek 110nm standard cell library. The proposed architecture shows 416Mpixels/s (397fps) of throughput at 416MHz of operating frequency with 132K gates.