• 제목/요약/키워드: parallel architecture

검색결과 885건 처리시간 0.038초

Lifting scheme을 이용한 고속 병렬 2D-DWT 하드웨어 구조 (A High Speed 2D-DWT Parallel Hardware Architecture Using the Lifting Scheme)

  • 김종욱;정정화
    • 대한전자공학회논문지SD
    • /
    • 제40권7호
    • /
    • pp.518-525
    • /
    • 2003
  • 본 논문은 리프팅 스킴(lifting scheme)의 분할 방법을 개선하여 고속 병렬 처리가 가능한 2차원 DWT(Discrete Wavelet Transform) 하드웨어 구조를 제안한다. 2차원 DWT 변환은 2차원 입력 데이터 전체에 대하여 연산이 수행되고 순차적으로 2차원 처리가 됨에 따라서 초기 및 전체 지연시간(latency)이 많이 걸린다. 본 논문에서는 처리속도와 지연 시간을 향상시키기 위해 개선된 분할 방법과 새로운 자원 공유 하드웨어 구조를 제안한다. 상호 연관성이 없는 데이터들을 4 개의 데이터 집합으로 분할하여 병렬 처리에 적합하도록 새로운 분할 방법을 제안하였다. 병렬처리 하드웨어 구조는 하드웨어의 자원 공유가 가능하도록 하기 위해 필터연산의 중간 값을 메모리에 저장할 수 있는 파이프라인 구조를 갖도록 설계하였다. 제안된 구조를 효율적으로 동작시킬 수 있도록 하드웨어 자원의 공유를 스케쥴링하여 초기지연과 전체지연 시간을 줄였다. 제안하는 구조는 기존의 병렬 처리 구조에 비해 초기 지연 및 전체 지연 시간을 각각 50%와 66%감소시키는 결과를 얻을 수 있었다.

100Gb/s급 광통신시스템을 위한 3-병렬 Reed-Solomon 기반 FEC 구조 설계 (Three-Parallel Reed-Solomon based Forward Error Correction Architecture for 100Gb/s Optical Communications)

  • 최창석;이한호
    • 대한전자공학회논문지SD
    • /
    • 제46권11호
    • /
    • pp.48-55
    • /
    • 2009
  • 본 논문에서는 차세대 100-Gb/s급 광통신 시스템을 위한 3-병렬 Reed-Solomon (RS) 디코더 기반의 고속 Forward Error Correction (FEC) 구조를 제안한다. 제안된 16채널 RS기반 FEC 구조는 4개의 신드롬 계산 블록이 1개의 Key Equation Solver (KES) 블록을 공유하는 3-병렬 4채널 RS 기반 FEC 구조 4개로 구성되어 있다. 제안하는 100-Gb/s RS 기반 FEC는 1.2V의 공급전압의 $0.13{\mu}m$ CMOS 공정을 이용하여 구현하였다. 구현 결과 제안된 RS기반 FEC 구조는 300MHz의 동작 주파수에서 115-Gb/s 의 데이터 처리율을 가지며, 기존의 RS 기반 FEC 구조에 비해 높은 데이터 처리율과 낮은 하드웨어 복잡도를 보여주고 있다.

액티브 엔터프라이즈 워크플로우 그리드 아키텍처 (An Active Enactment Architecture for Enterprise Workflow Grid)

  • 백수기
    • 한국IT서비스학회지
    • /
    • 제7권4호
    • /
    • pp.167-178
    • /
    • 2008
  • This paper addresses the issue of workflow on Grid and P2P, and proposes a layered workflow architecture and its related workflow models that are used for not only distributing workflows' information onto Grid or P2P resources but also scheduling the enactment of workflows. Especially, the most critical rationale of this paper is on the fact that the nature of Grid computing environment is fitted very well into building a platform for the maximally parallel and very large scale workflows that are frequently found in very large scale enterprises. The layered architecture proposed in this paper, which we call Enterprise Workflow Grid Architecture, is targeting on maximizing the usability of computing facilities in the enterprise as well as the scalability of its underlined workflow management system in coping with massively parallel and very large scale workflow applications.

A dynamic analysis algorithm for RC frames using parallel GPU strategies

  • Li, Hongyu;Li, Zuohua;Teng, Jun
    • Computers and Concrete
    • /
    • 제18권5호
    • /
    • pp.1019-1039
    • /
    • 2016
  • In this paper, a parallel algorithm of nonlinear dynamic analysis of three-dimensional (3D) reinforced concrete (RC) frame structures based on the platform of graphics processing unit (GPU) is proposed. Time integration is performed using Newmark method for nonlinear implicit dynamic analysis and parallelization strategies are presented. Correspondingly, a parallel Preconditioned Conjugate Gradients (PCG) solver on GPU is introduced for repeating solution of the equilibrium equations for each time step. The RC frames were simulated using fiber beam model to capture nonlinear behaviors of concrete and reinforcing bars. The parallel finite element program is developed utilizing Compute Unified Device Architecture (CUDA). The accuracy of the GPU-based parallel program including single precision and double precision was verified in comparison with ABAQUS. The numerical results demonstrated that the proposed algorithm can take full advantage of the parallel architecture of the GPU, and achieve the goal of speeding up the computation compared with CPU.

Numerical discrepancy between serial and MPI parallel computations

  • Lee, Sang Bong
    • International Journal of Naval Architecture and Ocean Engineering
    • /
    • 제8권5호
    • /
    • pp.434-441
    • /
    • 2016
  • Numerical simulations of 1D Burgers equation and 2D sloshing problem were carried out to study numerical discrepancy between serial and parallel computations. The numerical domain was decomposed into 2 and 4 subdomains for parallel computations with message passing interface. The numerical solution of Burgers equation disclosed that fully explicit boundary conditions used on subdomains of parallel computation was responsible for the numerical discrepancy of transient solution between serial and parallel computations. Two dimensional sloshing problems in a rectangular domain were solved using OpenFOAM. After a lapse of initial transient time sloshing patterns of water were significantly different in serial and parallel computations although the same numerical conditions were given. Based on the histograms of pressure measured at two points near the wall the statistical characteristics of numerical solution was not affected by the number of subdomains as much as the transient solution was dependent on the number of subdomains.

An FPGA-based Parallel Hardware Architecture for Real-time Eye Detection

  • Kim, Dong-Kyun;Jung, Jun-Hee;Nguyen, Thuy Tuong;Kim, Dai-Jin;Kim, Mun-Sang;Kwon, Key-Ho;Jeon, Jae-Wook
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • 제12권2호
    • /
    • pp.150-161
    • /
    • 2012
  • Eye detection is widely used in applications, such as face recognition, driver behavior analysis, and human-computer interaction. However, it is difficult to achieve real-time performance with software-based eye detection in an embedded environment. In this paper, we propose a parallel hardware architecture for real-time eye detection. We use the AdaBoost algorithm with modified census transform(MCT) to detect eyes on a face image. We parallelize part of the algorithm to speed up processing. Several downscaled pyramid images of the eye candidate region are generated in parallel using the input face image. We can detect the left and the right eye simultaneously using these downscaled images. The sequential data processing bottleneck caused by repetitive operation is removed by employing a pipelined parallel architecture. The proposed architecture is designed using Verilog HDL and implemented on a Virtex-5 FPGA for prototyping and evaluation. The proposed system can detect eyes within 0.15 ms in a VGA image.

High-Performance Low-Power FFT Cores

  • Han, Wei;Erdogan, Ahmet T.;Arslan, Tughrul;Hasan, Mohd.
    • ETRI Journal
    • /
    • 제30권3호
    • /
    • pp.451-460
    • /
    • 2008
  • Recently, the power consumption of integrated circuits has been attracting increasing attention. Many techniques have been studied to improve the power efficiency of digital signal processing units such as fast Fourier transform (FFT) processors, which are popularly employed in both traditional research fields, such as satellite communications, and thriving consumer electronics, such as wireless communications. This paper presents solutions based on parallel architectures for high throughput and power efficient FFT cores. Different combinations of hybrid low-power techniques are exploited to reduce power consumption, such as multiplierless units which replace the complex multipliers in FFTs, low-power commutators based on an advanced interconnection, and parallel-pipelined architectures. A number of FFT cores are implemented and evaluated for their power/area performance. The results show that up to 38% and 55% power savings can be achieved by the proposed pipelined FFTs and parallel-pipelined FFTs respectively, compared to the conventional pipelined FFT processor architectures.

  • PDF

David II: 효과적인 메모리 시스템을 가지는 병렬 렌더링 프로세서 (David II: A new architecture for parallel rendering processors with effective memory system)

  • 이길환;박우찬;김일산;한탁돈
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2004년도 춘계학술발표대회
    • /
    • pp.1655-1658
    • /
    • 2004
  • Current rendering processors are organized mainly to process a triangle as fast as possible and recently parallel 3D rendering processors, which can process multiple triangles in parallel with multiple rasterizers, begin to appear. For high performance in processing triangles, it is desirable for each rasterizer have its own local pixel cache. However, the consistency problem may occur in accessing the data at the same address simultaneously by more than one rasterizer. In this paper, we propose a parallel rendering processor architecture, called DAVID II, resolving such consistency problem effectively. Moreover, the proposed architecture reduces the latency due to a pixel cache miss significantly. The experimental results show that DAVID II achieves almost linear speedup at best case even in sixteen rasterizers.

  • PDF

비공유 병렬구조를 이용한 선형적 재귀규칙의 병렬평가 (Parallel Evaluation of Linearly Recursive Rules using a Shared-Nothing Paralled Architecture)

  • 조우현;김항준
    • 한국정보처리학회논문지
    • /
    • 제4권12호
    • /
    • pp.3069-3077
    • /
    • 1997
  • 이 논문에서는 비공유 병렬구조에서 이행적 종속성을 갖는 선형적 재귀규칙의 병렬평가에 대한 패러다임을 제안한다. 병렬평가를 위해 우리는 모든 노드가 메시지 교환을 위해 연결망만을 공유하는 비공유 병렬구조를 고려한다. 여기서 정규화된 규칙의 평가는 그 규칙의 중명-이론적 의미의 계산이다. 이행적 종속성올 갖는 정규 화된 선형적 재귀규칙을 정의하고, 그 규칙이 등가의 표현식으로 변환될 수 있음을 보이고, 등가의 표현식을 근거로 결합, 분할, 이행성폐포 연산을 이용하여 정규화된 규칙에 대한 병렬평가를 위한 패러다임을 제안하고 시간 복잡도를 분석하였다.

  • PDF

Parallel Implementation of the Recursive Least Square for Hyperspectral Image Compression on GPUs

  • Li, Changguo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제11권7호
    • /
    • pp.3543-3557
    • /
    • 2017
  • Compression is a very important technique for remotely sensed hyperspectral images. The lossless compression based on the recursive least square (RLS), which eliminates hyperspectral images' redundancy using both spatial and spectral correlations, is an extremely powerful tool for this purpose, but the relatively high computational complexity limits its application to time-critical scenarios. In order to improve the computational efficiency of the algorithm, we optimize its serial version and develop a new parallel implementation on graphics processing units (GPUs). Namely, an optimized recursive least square based on optimal number of prediction bands is introduced firstly. Then we use this approach as a case study to illustrate the advantages and potential challenges of applying GPU parallel optimization principles to the considered problem. The proposed parallel method properly exploits the low-level architecture of GPUs and has been carried out using the compute unified device architecture (CUDA). The GPU parallel implementation is compared with the serial implementation on CPU. Experimental results indicate remarkable acceleration factors and real-time performance, while retaining exactly the same bit rate with regard to the serial version of the compressor.