• Title/Summary/Keyword: Parallel processor

Search Result 482, Processing Time 0.026 seconds

A architecture for parallel rendering processor with by effective memory organization (효과적인 메모리 구조를 갖는 병렬 렌더링 프로세서 구조)

  • Kim, Kyung-Su;Yoon, Duk-Ki;Kim, Il-San;Park, Woo-Chan
    • Journal of Korea Game Society
    • /
    • v.5 no.3
    • /
    • pp.39-47
    • /
    • 2005
  • Current rendering processors are organized mainly to process a triangle as fast as possible and recently parallel 3D rendering processors, which can process multiple triangles in parallel with multiple rasterizers, begin to appear. For high performance in processing triangles, it is desirable for each rasterizer have its own local pixel cache. However, the consistency problem may occur in accessing the data at the same address simulaneously by more than one rasterizer. In this paper, we propose a parallel rendering processor architecture resolving such consistency problem effectively. Moreover, the proposed architecture reduces the latency due to a pixel cache miss significantly. The experimental results show that proposed architecture achieves almost linear speedup at best case even in sixteen rasterizer

  • PDF

Design of a Parallel Rendering Processor Architecture with Effective Memory System (효과적인 메모리 구조를 갖는 병렬 렌더링 프로세서 설계)

  • Park Woo-Chan;Yoon Duk-Ki;Kim Kyoung-Su
    • The KIPS Transactions:PartA
    • /
    • v.13A no.4 s.101
    • /
    • pp.305-316
    • /
    • 2006
  • Current rendering processors are organized mainly to process a triangle as fast as possible and recently parallel 3D rendering processors, which can process multiple triangles in parallel with multiple rasterizers, begin to appear. For high performance in processing triangles, it is desirable for each rasterizer have its own local pixel cache. However, the consistency problem may occur in accessing the data at the same address simultaneously by more than one rasterizer. In this paper, we propose a parallel rendering processor architecture resolving such consistency problem effectively. Moreover, the proposed architecture reduces the latency due to a pixel cache miss significantly. For the above two goals, effective memory organizations including a new pixel cache architecture are presented. The experimental results show that the proposed architecture achieves almost linear speedup at best case even in sixteen rasterizers.

On The Parallel Inplementation of a Static/Explicit FEM Program for Sheet Metal Forming (판금형 해석을 위한 정적/외연적 유한요소 프로그램의 병령화에 관한 연구)

  • ;;G.P.Nikishikov
    • Proceedings of the Korean Society of Precision Engineering Conference
    • /
    • 1995.10a
    • /
    • pp.625-628
    • /
    • 1995
  • A static/implicit finite element code for sheet forming (ITAS3D) is parallelized on IBM SP 6000 multi-processor computer. Computing-load-balanced domain decomposition method and the direct solution method at each subdomain (and interface) equation are developed. The system of equations for each subdomain are constructed by condensation and calculated on each processor. Approximated operation counts are calculated to set up the nonlinear equation system for balancing the compute load on each subdomain. Th esquare cup tests with several numbers of elements are used in demonstrating the performance of this parallel implementation. This procedure are proved to be efficient for moderate number of processors, especially for large number of elements.

  • PDF

A Parallel Algorithm For Rectilinear Steiner Tree Using Associative Processor (연합 처리기를 이용한 직교선형 스타이너 트리의 병렬 알고리즘)

  • Taegeun Park
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.32B no.8
    • /
    • pp.1057-1063
    • /
    • 1995
  • This paper describes an approach for constucting a Rectilinear Steiner Tree (RST) derivable from a Minimum Spanning Tree (MST), using Associative Processor (AP). We propose a fast parallel algorithm using AP's basic algorithms which can be realized by the processing capability of rudimentary logic and the selective matching capability of Content- Addressable Memory (CAM). The main idea behind the proposed algorithm is to maximize the overlaps between the consecutive edges in MST, thus minimizing the cost of a RST. An efficient parallel linear algorithm with O(n) complexity to construct a RST is proposed using an algorithm to find a MST, where n is the number of nodes. A node insertion method is introduced to allow the Z-type layout. The routing process which only depends on the neighbor edges and the no-rerouting strategy both help to speed up finding a RST.

  • PDF

Study on Real-time Parallel Processing Simulator for Performance Analysis of Missiles (유도탄 성능분석을 위한 실시간 병렬처리 시뮬레이터 연구)

  • Kim Byeong-Moon;Jung Soon-Key
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.11 no.1
    • /
    • pp.84-91
    • /
    • 2005
  • In this paper, we describe the real-time parallel processing simulator developed for the use of performance analysis of rolling missiles. The real-time parallel processing simulator developed here consists of seeker emulator generating infrared image signal on aircraft, real-time computer, host computer, system unit, and actual equipments such as auto-pilot processor and seeker processor. Software is developed from mathematic models, 6 degree-of-freedom module, aerodynamic module which are resided in real-time computer, and graphic user interface program resided in host computer. The real-time computer consists of six TIC-40 processors connected in parallel. The seeker emulator is designed by using analog circuits coupled with mechanical equipments. The system unit provides interface function to match impedance between the components and processes very small electrical signals. Also real launch unit of missiles is interfaced to simulator through system unit. In order to apply the real-time parallel processing simulator to performance analysis equipment of rolling missiles it is essential to perform the performance verification test of simulator.

Parallel Deblocking Filter Based on Modified Order of Accessing the Coding Tree Units for HEVC on Multicore Processor

  • Lei, Haiwei;Liu, Wenyi;Wang, Anhong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.3
    • /
    • pp.1684-1699
    • /
    • 2017
  • The deblocking filter (DF) reduces blocking artifacts in encoded video sequences, and thereby significantly improves the subjective and objective quality of videos. Statistics show that the DF accounts for 5-18% of the total decoding time in high-efficiency video coding. Therefore, speeding up the DF will improve codec performance, especially for the decoder. In view of the rapid development of multicore technology, we propose a parallel DF scheme based on a modified order of accessing the coding tree units (CTUs) by analyzing the data dependencies between adjacent CTUs. This enables the DF to run in parallel, providing accelerated performance and more flexibility in the degree of parallelism, as well as finer parallel granularity. We additionally solve the problems of variable privatization and thread synchronization in the parallelization of the DF. Finally, the DF module is parallelized based on the HM16.1 reference software using OpenMP technology. The acceleration performance is experimentally tested under various numbers of cores, and the results show that the proposed scheme is very effective at speeding up the DF.

3-Way 32 bit VLIW Multimedia Signal Processor

  • Park, Jaebok;Jaehee You
    • Proceedings of the IEEK Conference
    • /
    • 2001.06b
    • /
    • pp.97-100
    • /
    • 2001
  • A 3-way VLIW multimedia signal processor capable of efficient repeated operations as well as both load/store and type transformations for various data types is presented. It is composed of a 32-bit execution unit that can execute two instructions in parallel, an independent load/store unit and a control unit. The processor is implemented with 0.6${\mu}{\textrm}{m}$ gate array and the results are discussed.

  • PDF

Design of an HIGHT Processor Employing LFSR Architecture Allowing Parallel Outputs (병렬 출력을 갖는 LFSR 구조를 적용한 HIGHT 프로세서 설계)

  • Lee, Je-Hoon;Kim, Sang-Choon
    • Convergence Security Journal
    • /
    • v.15 no.2
    • /
    • pp.81-89
    • /
    • 2015
  • HIGHT is an 64-bit block cipher, which is suitable for low power and ultra-light implementation that are used in the network that needs the consideration of security aspects. This paper presents a key scheduler that employs the presented LFSR and reverse LFSR that can generate four outputs simultaneously. In addition, we construct new key scheduler that generates 4 subkey bytes at a clock since each round block requires 4 subkey bytes at a time. Thus, the entire HIGHT processor can be controlled by single system clock with regular control mechanism. We synthesize the HIGHT processor using the VHDL. From the synthesis results, the logic size of the presented key scheduler can be reduced as 9% compared to the counterpart that is employed in the conventional HIGHT processor.

A Study On Improving the Performance of One Dimensional Systolic Array Processor for Matrix.Vector Operation using Sub-Matrix (부분행렬을 사용한 행렬.벡터 연산용 1차원 시스톨릭 어레이 프로세서 설계에 관한 연구)

  • Kim, Yong-Sung
    • The Journal of Information Technology
    • /
    • v.10 no.3
    • /
    • pp.33-45
    • /
    • 2007
  • Systolic Array Processor is used for designing the special purpose processor in Digital Signal Processing, Computer Graphics, Neural Network Applications etc., since it has the characteristic of parallelism, pipeline processing and architecture of regularity. But, in case of using general design method, it has intial waiting period as large as No. of PE-1. And if the connected system needs parallel and simultaneous outputs, processor has some problems of the performance, since it generates only one output at each clock in output state. So in this paper, one dimensional Systolic Array Processor that is designed according to the dependance of data and operations using the partitioned sub-matrix is proposed for the purpose of improving the performance. 1-D Systolic Array using 4 partitioned sub-matrix has efficient method in case of considering those two problems.

  • PDF

A Fast SIFT Implementation Based on Integer Gaussian and Reconfigurable Processor

  • Su, Le Tran;Lee, Jong Soo
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.2 no.3
    • /
    • pp.39-52
    • /
    • 2009
  • Scale Invariant Feature Transform (SIFT) is an effective algorithm in object recognition, panorama stitching, and image matching, however, due to its complexity, real time processing is difficult to achieve with software approaches. This paper proposes using a reconfigurable hardware processor with integer half kernel. The integer half kernel Gaussian reduces the Gaussian pyramid complexity in about half [] and the reconfigurable processor carries out a parallel implementation of a full search Fast SIFT algorithm. We use a low memory, fine grain single instruction stream multiple data stream (SIMD) pixel processor that is currently being developed. This implementation fully exposes the available parallelism of the SIFT algorithm process and exploits the processing and I/O capabilities of the processor which results in a system that can perform real time image and video compression. We apply this novel implementation to images and measure the effectiveness. Experimental simulation results indicate that the proposed implementation is capable of real time applications.

  • PDF