• Title/Summary/Keyword: parallel architecture

Search Result 885, Processing Time 0.029 seconds

Low Complexity Digit-Parallel/Bit-Serial Polynomial Basis Multiplier (저복잡도 디지트병렬/비트직렬 다항식기저 곱셈기)

  • Cho, Yong-Suk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.35 no.4C
    • /
    • pp.337-342
    • /
    • 2010
  • In this paper, a new architecture for digit-parallel/bit-serial GF($2^m$) multiplier with low complexity is proposed. The proposed multiplier operates in polynomial basis of GF($2^m$) and produces multiplication results at a rate of one per D clock cycles, where D is the selected digit size. The digit-parallel/bit-serial multiplier is faster than bit-serial ones but with lower area complexity than bit-parallel ones. The most significant feature of the digit-parallel/bit-serial architecture is that a trade-off between hardware complexity and delay time can be achieved. But the traditional digit-parallel/bit-serial multiplier needs extra hardware for high speed. In this paper a new low complexity efficient digit-parallel/bit-serial multiplier is presented.

A Pipelined Parallel Optimized Design for Convolution-based Non-Cascaded Architecture of JPEG2000 DWT (JPEG2000 이산웨이블릿변환의 컨볼루션기반 non-cascaded 아키텍처를 위한 pipelined parallel 최적화 설계)

  • Lee, Seung-Kwon;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.7
    • /
    • pp.29-38
    • /
    • 2009
  • In this paper, a high performance pipelined computing design of parallel multiplier-temporal buffer-parallel accumulator is present for the convolution-based non-cascaded architecture aiming at the real time Discrete Wavelet Transform(DWT) processing. The convolved multiplication of DWT would be reduced upto 1/4 by utilizing the filter coefficients symmetry and the up/down sampling; and it could be dealt with 3-5 times faster computation by LUT-based DA multiplication of multiple filter coefficients parallelized for product terms with an image data. Further, the reutilization of computed product terms could be achieved by storing in the temporal buffer, which yields the saving of computation as well as dynamic power by 50%. The convolved product terms of image data and filter coefficients are realigned and stored in the temporal buffer for the accumulated addition. Then, the buffer management of parallel aligned storage is carried out for the high speed sequential retrieval of parallel accumulations. The convolved computation is pipelined with parallel multiplier-temporal buffer-parallel accumulation in which the parallelization of temporal buffer and accumulator is optimize, with respect to the performance of parallel DA multiplier, to improve the pipelining performance. The proposed architecture is back-end designed with 0.18um library, which verifies the 30fps throughput of SVGA(800$\times$600) images at 90MHz.

A 4-parallel Scheduling Architecture for High-performance H.264/AVC Deblocking Filter (고성능 H.264/AVC 디블로킹 필터를 위한 4-병렬 스케줄링 아키텍처)

  • Ko, Byung-Soo;Kong, Jin-Hyeung
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.49 no.8
    • /
    • pp.63-72
    • /
    • 2012
  • In this paper, we proposed a parallel architecture of line & block edge filter for high-performance H.264/AVC deblocking filter for Quad Full High Definition(Quad FHD) video real time processing. To improve throughput, we designed 4-parallel block edge filter with 16 line edge filter. To reduce internal buffer size and processing cycle, we scheduled 4-parallel zig-zag scan order as deblocking filtering order. To avoid data conflicts we placed 1 delay cycle between block edge filtering. We implemented interleaving buffer, as internal buffer of block edge filter, to sharing buffer for reducing buffer size. The proposed architecture was simulated in 0.18um standard cell library. The maximum operation frequency is 108MHz. The gate count is 140.16Kgates. The proposed H.264/AVC deblocking filter can support Quad FHD at 113.17 frames per second by running at 90MHz.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

A architecture for parallel rendering processor with by effective memory organization (효과적인 메모리 구조를 갖는 병렬 렌더링 프로세서 구조)

  • Kim, Kyung-Su;Yoon, Duk-Ki;Kim, Il-San;Park, Woo-Chan
    • Journal of Korea Game Society
    • /
    • v.5 no.3
    • /
    • pp.39-47
    • /
    • 2005
  • Current rendering processors are organized mainly to process a triangle as fast as possible and recently parallel 3D rendering processors, which can process multiple triangles in parallel with multiple rasterizers, begin to appear. For high performance in processing triangles, it is desirable for each rasterizer have its own local pixel cache. However, the consistency problem may occur in accessing the data at the same address simulaneously by more than one rasterizer. In this paper, we propose a parallel rendering processor architecture resolving such consistency problem effectively. Moreover, the proposed architecture reduces the latency due to a pixel cache miss significantly. The experimental results show that proposed architecture achieves almost linear speedup at best case even in sixteen rasterizer

  • PDF

A Study on the Parallel Drive of Cold Cathode Fluorescent Lamp (CCFL) (냉음극 형광램프의 병렬구동에 관한 연구)

  • Kim, Cherl-Jin;Park, Hyun-Cherl;Park, Jung-Oh
    • Proceedings of the KIEE Conference
    • /
    • 2008.04c
    • /
    • pp.149-151
    • /
    • 2008
  • This paper presents an architecture for driving multiple parallel cold cathode fluorescent lamps (CCFLs) for back lighting applications. The key to the architecture is a proposed capacitive coupling approach for lamp ignition. This system is consist of a flyback converter, a single inverter to drive multiple lamps and conductive floating reflector. The topology is capable of driving a number of parallel lamps with independent accurate lamp, current regulation and improving cost effectiveness with significant reduction in size and weight, compared to typical high frequency ac ballast. Experimental demonstration results for ten of parallel CCFLs with simultaneous ignition.

  • PDF

Integer-Pel Motion Estimation for HEVC on Compute Unified Device Architecture (CUDA)

  • Lee, Dongkyu;Sim, Donggyu;Oh, Seoung-Jun
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.6
    • /
    • pp.397-403
    • /
    • 2014
  • A new video compression standard called High Efficiency Video Coding (HEVC) has recently been released onto the market. HEVC provides higher coding performance compared to previous standards, but at the cost of a significant increase in encoding complexity, particularly in motion estimation (ME). At the same time, the computing capabilities of Graphics Processing Units (GPUs) have become more powerful. This paper proposes a parallel integer-pel ME (IME) algorithm for HEVC on GPU using the Compute Unified Device Architecture (CUDA). In the proposed IME, concurrent parallel reduction (CPR) is introduced. CPR performs several parallel reduction (PR) operations concurrently to solve two problems in conventional PR; low thread utilization and high thread synchronization latency. The proposed encoder reduces the portion of IME in the encoder to almost zero with a 2.3% increase in bitrate. In terms of IME, the proposed IME is up to 172.6 times faster than the IME in the HEVC reference model.

Pipelined Parallel Processing System for Image Processing (영상처리를 위한 Pipelined 병렬처리 시스템)

  • Lee, Hyung;Kim, Jong-Bae;Choi, Sung-Hyk;Park, Jong-Won
    • Journal of IKEEE
    • /
    • v.4 no.2 s.7
    • /
    • pp.212-224
    • /
    • 2000
  • In this paper, a parallel processing system is proposed for improving the processing speed of image related applications. The proposed parallel processing system is fully synchronous SIMD computer with pipelined architecture and consists of processing elements and a multi-access memory system. The multi-access memory system is made up of memory modules and a memory controller, which consists of memory module selection module, data routing module, and address calculating and routing module, to perform parallel memory accesses with the variety of types: block, horizontal, and vertical access way. Morphological filter had been applied to verify the parallel processing system and resulted in faithful processing speed.

  • PDF

Parallel Processing of the Fuzzy Fingerprint Vault based on Geometric Hashing

  • Chae, Seung-Hoon;Lim, Sung-Jin;Bae, Sang-Hyun;Chung, Yong-Wha;Pan, Sung-Bum
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.4 no.6
    • /
    • pp.1294-1310
    • /
    • 2010
  • User authentication using fingerprint information provides convenience as well as strong security. However, serious problems may occur if fingerprint information stored for user authentication is used illegally by a different person since it cannot be changed freely as a password due to a limited number of fingers. Recently, research in fuzzy fingerprint vault system has been carried out actively to safely protect fingerprint information in a fingerprint authentication system. In addition, research to solve the fingerprint alignment problem by applying a geometric hashing technique has also been carried out. In this paper, we propose the hardware architecture for a geometric hashing based fuzzy fingerprint vault system that consists of the software module and hardware module. The hardware module performs the matching for the transformed minutiae in the enrollment hash table and verification hash table. On the other hand, the software module is responsible for hardware feature extraction. We also propose the hardware architecture which parallel processing technique is applied for high speed processing. Based on the experimental results, we confirmed that execution time for the proposed hardware architecture was 0.24 second when number of real minutiae was 36 and number of chaff minutiae was 200, whereas that of the software solution was 1.13 second. For the same condition, execution time of the hardware architecture which parallel processing technique was applied was 0.01 second. Note that the proposed hardware architecture can achieve a speed-up of close to 100 times compared to a software based solution.

Code Size Reduction and Execution performance Improvement with Instruction Set Architecture Design based on Non-homogeneous Register Partition (코드감소와 성능향상을 위한 이질 레지스터 분할 및 명령어 구조 설계)

  • Kwon, Young-Jun;Lee, Hyuk-Jae
    • The Transactions of the Korean Institute of Electrical Engineers A
    • /
    • v.48 no.12
    • /
    • pp.1575-1579
    • /
    • 1999
  • Embedded processors often accommodate two instruction sets, a standard instruction set and a compressed instruction set. With the compressed instruction set, code size can be reduced while instruction count (and consequently execution time) can be increased. To achieve code size reduction without significant increase of execution time, this paper proposes a new compressed instruction set architecture, called TOE (Two Operations Execution). The proposed instruction set format includes the parallel bit that indicates an instruction can be executed simultaneously with the next instruction. To add the parallel bit, TOE instruction format reduces the destination register field. The reduction of the register field limits the number of registers that are accessible by an instruction. To overcome the limited accessibility of registers, TOE adapts non-homogeneous register partition in which registers are divided into multiple subsets, each of which are accessed by different groups of instructions. With non-homogeneous registers, each instruction can access only a limited number of registers, but an entire program can access all available registers. With efficient non-homogeneous register allocator, all registers can be used in a balanced manner. As a result, the increase of code size due to register spills is negligible. Experimental results show that more than 30% of TOE instructions can be executed in parallel without significant increase of code size when compared to existing Thumb instruction set.

  • PDF