• Title/Summary/Keyword: Parallel processor

Search Result 482, Processing Time 0.034 seconds

A Ray-Tracing Algorithm Based On Processor Farm Model (프로세서 farm 모델을 이용한 광추적 알고리듬)

  • Lee, Hyo Jong
    • Journal of the Korea Computer Graphics Society
    • /
    • v.2 no.1
    • /
    • pp.24-30
    • /
    • 1996
  • The ray tracing method, which is one of many photorealistic rendering techniques, requires heavy computational processing to synthesize images. Parallel processing can be used to reduce the computational processing time. A parallel algorithm for the ray tracing has been implemented and executed for various images on transputer systems. In order to develop a scalable parallel algorithm, a processor farming technique has been exploited. Since each image is divided and distributed to each farming processor, the scalability of the parallel system and load balancing are achieved naturally in the proposed algorithm. Efficiency of the parallel algorithm is obtained up to 95% for nine processors. However, the best size of a distributed task is much higher in simple images due to less computational requirement for every pixel. Efficiency degradation is observed for large granularity tasks because of load unbalancing caused by the large task. Overall, transputer systems behave as good scalable parallel processing system with respect to the cost-performance ratio.

  • PDF

Design to Chip with Multi-Access Memory System and Parallel Processor for 16 Processing Elements of Image Processing Purpose (영상처리용 16개의 처리기를 위한 다중접근기억장치 및 병렬처리기의 칩 설계)

  • Lim, Jae-Ho;Park, Seong-Mi;Park, Jong-Won
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.11
    • /
    • pp.1401-1408
    • /
    • 2011
  • This dissertation present a chip with Multi-Access Memory System(MAMS) and parallel processor for 16 Processing Elements of image processing purpose. MAMS is a kind of parallel access memory system and can simultaneously access to random pixel datas with eight types. It is possible to set a interval about pixel datas to access, too. The parallel processor built-in MAMS actually has been realized in 2003 but its performance fell short of a real time process for high-definition images. I designed a improved parallel processing system by means of addition and expansion of Memory Modules and Processing Elements of previous one. It is feasible to perform a Morphological Closing at the speed of 3 times of the previous one and 6 times of serial system.

High Speed 8-Parallel Fft/ifft Processor using Efficient Pipeline Architecture and Scheduling Scheme (효율적인 파이프라인 구조와 스케줄링 기법을 적용한 고속 8-병렬 FFT/IFFT 프로세서)

  • Kim, Eun-Ji;SunWoo, Myung-Hoon
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.3C
    • /
    • pp.175-182
    • /
    • 2011
  • This paper presents a novel eight-parallel 128/256-point mixed-radix multi-path delay commutator (MRMDC) FFT/IFFT processor for orthogonal frequency-division multiplexing (OFDM) systems. The proposed FFT architecture can provide a high throughput rate and low hardware complexity by using an eight-parallel data-path scheme, a modified mixed-radix multi-path delay commutator structure and an efficient scheduling scheme of complex multiplications. The efficient scheduling scheme can reduce the number of complex multipliers at the second stage from 88 to 40. The proposed FFT/IFFT processor has been designed and implemented with the 90nm CMOS technology. The proposed eight-parallel FFT/IFFT processor can provide a throughput rate of up to 27.5Gsample/s at 430MHz.

A Parallel Loop Scheduling Algorithm on Multiprocessor System Environments (다중프로세서 시스템 환경에서 병렬 루프 스케쥴링 알고리즘)

  • 이영규;박두순
    • Journal of Korea Multimedia Society
    • /
    • v.3 no.3
    • /
    • pp.309-319
    • /
    • 2000
  • The purpose of a parallel scheduling under a multiprocessor environment is to carry out the scheduling with the minimum synchronization overhead, and to perform load balance for a parallel application program. The processors calculate the chunk of iteration and are allocated to carry out the parallel iteration. At this time, it frequently accesses mutually exclusive global memory so that there are a lot of scheduling overhead and bottleneck imposed. And also, when the distribution of the parallel iteration in the allocated chunk to the processor is different, the different execution time of each chunk causes the load imbalance and badly affects the capability of the all scheduling. In the paper. we investigate the problems on the conventional algorithms in order to achieve the minimum scheduling overhead and load balance. we then present a new parallel loop scheduling algorithm, considering the locality of the data and processor affinity.

  • PDF

Implementation of the SIMT based Image Signal Processor for the Image Processing (영상처리를 위한 SIMT 기반 Image Signal Processor 구현)

  • Hwang, Yun-Seop;Jeon, Hee-Kyeong;Lee, Kwan-ho;Lee, Kwang-yeob
    • Journal of IKEEE
    • /
    • v.20 no.1
    • /
    • pp.89-93
    • /
    • 2016
  • In this paper, we proposed SIMT based Image Signal Processor which can apply various image preprocessing algorithms and allow parallel processing of application programs such as image recognition. Conventional ISP has the hard-wired image enhancement algorithm of which the processing speed is fast, but there was difficult to optimize performance depending on various image processing algorithms. The proposed ISP improved the processing time applying SIMT architecture and processed a variety of image processing algorithms as an instruction based processor. We used Xilinx Virtex-7 board and the processing time compared to cell multicore processor, ARM Cortex-A9, ARM Cortex-A15 was reduced by about 71 percent, 63 percent and 33 percent, respectively.

An Echo Processor for Medical Ultrasound Imaging Using a GPU with Massively Parallel Processing Architecture (병렬 처리 구조의 GPU를 이용한 의료 초음파 영상용 에코 신호 처리기)

  • Seo, Sin-Hyeok;Sohn, Hak-Yeol;Song, Tai-Kyong
    • Proceedings of the IEEK Conference
    • /
    • 2008.06a
    • /
    • pp.871-872
    • /
    • 2008
  • The method and results of the software implementation of a echo processor for medical ultrasound imaging using a GPU (NVIDIA G80) is presented. The echo signal processing functions are modified in a SIMD manner suitable for the GPU's massively parallel processing architecture so that the GPU's 128 ALUs are utilized nearly 100%. The preliminary result for a frame of image composed of 128 scan lines, each having 10240 16-bit samples, shows that the echo processor can be inplemented at a high rate of 30 frames per second when implemented in C, which is close to the optimized assembly codes running on the TI's TMS320C6416 DSP.

  • PDF

A Memory Intensive Real-time 3x3 Neighborhood processor for Image Processing (Memory Intensive 실시간 영상신호처리용 3 $\times$ 3 Neighborhood VLSI 처리기)

  • 김진홍;남철우;우성일;김용태
    • Journal of the Korean Institute of Telematics and Electronics
    • /
    • v.27 no.6
    • /
    • pp.963-971
    • /
    • 1990
  • This paper proposes a memory intensive VLSI architecture for the realization of real-time 3x3 neighborhood processor based on the distributed arithmetic. The proposed architecture is characterized by a bit serial and multi-kernel parallel processing which exploits the pixel kernel parallelism and concurrency. The chip implements 8 neighborhood processing elements in parallel with efficirnt input and output modules which operate concurrently. Besides the a4chitectural design of a neighborhood processor, the design methodology using module generator concept has been considered and MOGOT(MOdule Generator Oriented VLSI design Tool) has been constructed based on the workstation. Based on these design environments MOGOT, it has been shown that the main part of the suggested architecture can be designed efficiently using 2\ulcorner double metal CMOS technology. It includes design of input delay and data conversion module, look-up table for inner product operation, carry save accumulator, output data converter and delay module, and control module.

  • PDF

Implemantation of Micro-Web Server Using ARM Processor and Linux (ARM 프로세서와 LINUX를 이용한 마이크로 웹서버 구현)

  • Lee, Dong-Hoon;Han, Kyong-Ho
    • Proceedings of the KIPE Conference
    • /
    • 2002.07a
    • /
    • pp.697-700
    • /
    • 2002
  • In this paper, we proposed the micro web-server Implementation on Strong ARM processor with embedded Linux. The parallel port connecting parallel I/O is controlled via HPPT protocol and web browser program. HTTP protocol is ported into Linux and the micro web server program and port control program are installed on-board memory using CGI to be accessed by web browser, such as Internet Explore and Netscape. 8bit LED and DIP switches are connected to the processor port and the switch input status is monitored and the LED output is controlled from remote hosts vie internet. The result of the proposed embedded micro-web server can be used in automation systems, remote distributed control via internet using web browser.

  • PDF

Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing

  • Pei, Songwen;Zhang, Junge;Jiang, Linhua;Kim, Myoung-Seo;Gaudiot, Jean-Luc
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.7
    • /
    • pp.3231-3244
    • /
    • 2016
  • As processor design has been transiting from homogeneous multicore processor to heterogeneous multicore processor, traditional Amdahl's law cannot meet the new challenges for asymmetric multicore system. In order to further investigate the impact factors related to the Overhead of Data Preparation (ODP) for Asymmetric multicore systems, we evaluate an asymmetric multicore system built with CPU-GPU by measuring the overheads of memory transfer, computing kernel, cache missing and synchronization. This paper demonstrates that decreasing the overhead of data preparation is a promising approach to improve the whole performance of heterogeneous system.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.