• Title/Summary/Keyword: Parallel pipeline

Search Result 172, Processing Time 0.029 seconds

FPGA-Based Low-Power and Low-Cost Portable Beamformer Design (FPGA 기반 저전력 및 저비용 휴대용 빔포머 설계)

  • Jeong, GabJoong;Park, CheolYoung
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.24 no.1
    • /
    • pp.31-38
    • /
    • 2019
  • In this paper, we develop a beamforming front end platform with pipeline circuit configuration method that can apply various clinical diagnostic applications of ultrasound image technology. Hardware design targets compression applications as well as scalable applications where power, integration levels and replication possibilities are important. Firmware design was implemented to achieve optimal FPGA parallel processing level by constructing new IP and system-oriented design environment to accelerate design productivity with maximum productivity improvement using Vivado HLS tool, which is a next generation high level synthesis tool. Former supports the high-speed management function of scan data that can create an image area arbitrarily and can be appropriately corrected and supplemented when reconfiguring or changing system specifications in the future.

An Architecture of a high efficient ALU for 3D Graphics Shader Processor (3D 그래픽 쉐이더 프로세서를 위한 고효율 연산기 구조)

  • Kim, Woo-Young;Lee, Bo-Haeng;Lee, Kwang-Yeob;Park, Tae-ryung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2009.05a
    • /
    • pp.229-232
    • /
    • 2009
  • In this paper, we propose a new programmable shader architecture based on an effective ALU operation. Today's mobile devices need the programmable shader processor for a three-dimensional(3D) graphics. The programmable shader processors require a lager ALU than a fixed pipeline ALU used previously. The proposed ALU architecture is able to execute two different arithmetic operations at the same time. Two instructions which need exclusive ALU operations are inserted into instruction decoders in parallel. Experimental results show the number of instruction cycles can be substantially reduced up to 40%.

  • PDF

Energy Efficient and Low-Cost Server Architecture for Hadoop Storage Appliance

  • Choi, Do Young;Oh, Jung Hwan;Kim, Ji Kwang;Lee, Seung Eun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.12
    • /
    • pp.4648-4663
    • /
    • 2020
  • This paper proposes the Lempel-Ziv 4(LZ4) compression accelerator optimized for scale-out servers in data centers. In order to reduce CPU loads caused by compression, we propose an accelerator solution and implement the accelerator on an Field Programmable Gate Array(FPGA) as heterogeneous computing. The LZ4 compression hardware accelerator is a fully pipelined architecture and applies 16 dictionaries to enhance the parallelism for high throughput compressor. Our hardware accelerator is based on the 20-stage pipeline and dictionary architecture, highly customized to LZ4 compression algorithm and parallel hardware implementation. Proposing dictionary architecture allows achieving high throughput by comparing input sequences in multiple dictionaries simultaneously compared to a single dictionary. The experimental results provide the high throughput with intensively optimized in the FPGA. Additionally, we compare our implementation to CPU implementation results of LZ4 to provide insights on FPGA-based data centers. The proposed accelerator achieves the compression throughput of 639MB/s with fine parallelism to be deployed into scale-out servers. This approach enables the low power Intel Atom processor to realize the Hadoop storage along with the compression accelerator.

Fully parallel low-density parity-check code-based polar decoder architecture for 5G wireless communications

  • Dinesh Kumar Devadoss;Shantha Selvakumari Ramapackiam
    • ETRI Journal
    • /
    • v.46 no.3
    • /
    • pp.485-500
    • /
    • 2024
  • A hardware architecture is presented to decode (N, K) polar codes based on a low-density parity-check code-like decoding method. By applying suitable pruning techniques to the dense graph of the polar code, the decoder architectures are optimized using fewer check nodes (CN) and variable nodes (VN). Pipelining is introduced in the CN and VN architectures, reducing the critical path delay. Latency is reduced further by a fully parallelized, single-stage architecture compared with the log N stages in the conventional belief propagation (BP) decoder. The designed decoder for short-to-intermediate code lengths was implemented using the Virtex-7 field-programmable gate array (FPGA). It achieved a throughput of 2.44 Gbps, which is four times and 1.4 times higher than those of the fast-simplified successive cancellation and combinational decoders, respectively. The proposed decoder for the (1024, 512) polar code yielded a negligible bit error rate of 10-4 at 2.7 Eb/No (dB). It converged faster than the BP decoding scheme on a dense parity-check matrix. Moreover, the proposed decoder is also implemented using the Xilinx ultra-scale FPGA and verified with the fifth generation new radio physical downlink control channel specification. The superior error-correcting performance and better hardware efficiency makes our decoder a suitable alternative to the successive cancellation list decoders used in 5G wireless communication.

A Study on Machine Learning Compiler and Modulo Scheduler (머신러닝 컴파일러와 모듈로 스케쥴러에 관한 연구)

  • Doosan Cho
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.27 no.1
    • /
    • pp.87-95
    • /
    • 2024
  • This study is on modulo scheduling algorithms for multicore processor in machine learning applications. Machine learning algorithms are designed to perform a large amount of operations such as vectors and matrices in order to quickly process large amounts of data stream. To support such large amounts of computations, processor architectures to support applications such as artificial intelligence, neural networks, and machine learning are designed in the form of parallel processing such as multicore. To effectively utilize these multi-core hardware resources, various compiler techniques are being used and studied. In this study, among these compiler techniques, we analyzed the modular scheduler, which is especially important in one core's computation pipeline. This paper looked at and compared the iterative modular scheduler and the swing modular scheduler, which are the most widely used and studied. As a result, both schedulers provided similar performance results, and when measuring register pressure as an indicator, it was confirmed that the swing modulo scheduler provided slightly better performance. In this study, a technique that divides recurrence edge is proposed to improve the minimum initiation interval of the modulo schedulers.

Parallel Inverse Transform and Small-sized Inverse Quantization Architectures Design of H.264/AVC Decoder (H.264/AVC 복호기의 병렬 역변환 구조 및 저면적 역양자화 구조 설계)

  • Jung, Hong-Kyun;Cha, Ki-Jong;Park, Seung-Yong;Kim, Jin-Young;Ryoo, Kwang-Ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2011.10a
    • /
    • pp.444-447
    • /
    • 2011
  • In this paper, parallel IT(inverse transform) architecture and IQ(inverse quantization) architecture with common operation unit for the H.264/AVC decoder are proposed. By using common operation unit, the area cost and computational complexity of IQ are reduced. In order to take four execution cycles to perform IT, the proposed IT architecture has parallel architecture with one horizontal DCT unit and four vertical DCT units. Furthermore, the execution cycles of the proposed architecture is reduced to five cycles by applying two state pipeline architecture. The proposed architecture is implemented to a single chip by using Magnachip 0.18um CMOS technology. The gate count of the proposed architecture is 14.3k at clock frequency of 13MHz and the area of proposed IQ is reduced 39.6% compared with the previous one. The experimental result shows that execution cycle the proposed architecture is about 49.09% higher than that of the previous one.

  • PDF

Cross-Lingual Style-Based Title Generation Using Multiple Adapters (다중 어댑터를 이용한 교차 언어 및 스타일 기반의 제목 생성)

  • Yo-Han Park;Yong-Seok Choi;Kong Joo Lee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.8
    • /
    • pp.341-354
    • /
    • 2023
  • The title of a document is the brief summarization of the document. Readers can easily understand a document if we provide them with its title in their preferred styles and the languages. In this research, we propose a cross-lingual and style-based title generation model using multiple adapters. To train the model, we need a parallel corpus in several languages with different styles. It is quite difficult to construct this kind of parallel corpus; however, a monolingual title generation corpus of the same style can be built easily. Therefore, we apply a zero-shot strategy to generate a title in a different language and with a different style for an input document. A baseline model is Transformer consisting of an encoder and a decoder, pre-trained by several languages. The model is then equipped with multiple adapters for translation, languages, and styles. After the model learns a translation task from parallel corpus, it learns a title generation task from monolingual title generation corpus. When training the model with a task, we only activate an adapter that corresponds to the task. When generating a cross-lingual and style-based title, we only activate adapters that correspond to a target language and a target style. An experimental result shows that our proposed model is only as good as a pipeline model that first translates into a target language and then generates a title. There have been significant changes in natural language generation due to the emergence of large-scale language models. However, research to improve the performance of natural language generation using limited resources and limited data needs to continue. In this regard, this study seeks to explore the significance of such research.

New VLSI Architecture of Parallel Multiplier-Accumulator Based on Radix-2 Modified Booth Algorithm (Radix-2 MBA 기반 병렬 MAC의 VLSI 구조)

  • Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.4
    • /
    • pp.94-104
    • /
    • 2008
  • In this paper, we propose a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator which has the largest delay in MAC was removed and its function was included into CSA, the overall performance becomes to be elevated. The proposed CSA tree uses 1's complement-based radix-2 modified booth algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of operands. The CSA propagates the carries by the least significant bits of the partial products and generates the least significant bits in advance for decreasing the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits not the output of the final adder for improving the performance by optimizing the efficiency of pipeline scheme. The proposed architecture was synthesized with $250{\mu}m,\;180{\mu}m,\;130{\mu}m$ and 90nm standard CMOS library after designing it. We analyzed the results such as hardware resource, delay, and pipeline which are based on the theoretical and experimental estimation. We used Sakurai's alpha power low for the delay modeling. The proposed MAC has the superior properties to the standard design in many ways and its performance is twice as much than the previous research in the similar clock frequency.

The Hardware Design of Effective Deblocking Filter for HEVC Encoder (HEVC 부호기를 위한 효율적인 디블록킹 하드웨어 설계)

  • Park, Jae-Ha;Park, Seung-yong;Ryoo, Kwang-ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.10a
    • /
    • pp.755-758
    • /
    • 2014
  • In this paper, we propose effective Deblocking Filter hardware architecture for High Efficiency Video Coding encoder. we propose Deblocking Filter hardware architecture with less processing time, filter ordering for low area design, effective memory architecture and four-pipeline for a high performance HEVC(High Efficiency Video Coding) encoder. Proposed filter ordering can be used to reduce delay according to preprocessing. It can be used for realtime single-port SRAM read and write. it can be used in parallel processing by using two filters. Using 10 memory is effective for solving the hazard caused by a single-port SRAM. Also the proposed filter can be used in low-voltage design by using clock gating architecture in 4-pipeline. The proposed Deblocking Filter encoder architecture is designed by Verilog HDL, and implemented by 100k logic gates in TSMC $0.18{\mu}m$ process. At 150MHz, the proposed Deblocking Filter encoder can support 4K Ultra HD video encoding at 30fps, and can be operated at a maximum speed of 200MHz.

  • PDF

Highly Efficient and Low Power FIR Filter Chip for PRML Read Channel (PRML Read Channel용 고효율, 저전력 FIR 필터 칩)

  • Jin Yong, Kang;Byung Gak, Jo;Myung Hoon, Sunwoo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.9
    • /
    • pp.115-124
    • /
    • 2004
  • This paper proposes a high efficient and low power FIR filter chip for partial-response maximum likelihood (PRML) disk drive read channels; it is a 6-bit, 8-tap digital FIR filter. The proposed filter employs a parallel processing architecture and consists of 4 pipeline stages. It uses the modified Booth algorithm for multiplication and compressor logic for addition. CMOS pass-transistor logic is used for low power consumption and single-rail logic is used to reduce the chip area. The proposed filter is actually implemented and the chip dissipates 120mV at 100MHz, uses a 3.3V power supply and occupies 1.88 ${\times}$ 1.38 $\textrm{mm}^2$. The implemented filter requires approximately 11.7% less power compared with the existing architectures that use the similar technology.