• Title/Summary/Keyword: Hardware Architecture

Search Result 1,324, Processing Time 0.027 seconds

Digit-serial VLSI Architecture for Lifting-based Discrete Wavelet Transform (리프팅 기반 이산 웨이블렛 변환의 디지트 시리얼 VLSI 구조)

  • Ryu, Donghoon;Park, Taegeun
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.1
    • /
    • pp.157-165
    • /
    • 2013
  • In this paper, efficient digit-serial VLSI architecture for 1D (9,7) lifting-based discrete wavelet transform (DWT) filter has been proposed. The proposed architecture computes the DWT in digit basis, so that the required hardware is reduced. Also, the multiplication is replaced with the shift and add operation to minimize the hardware requirement. Bit allocation for input, output, and the internal data has been determined by analyzing the PSNR. We have carefully designed the data feedback latency not to degrade the performance in the recursive folded scheduling. The proposed digit-serial architecture requires small amount of hardware but achieve 100% of hardware utilization, so we try to optimize the tradeoffs between the hardware cost and the performance. The proposed architecture has been designed and verified by VerilogHDL and synthesized by Synopsys Design Compiler with a DongbuHitek $0.18{\mu}m$ STD cell library. The maximum operating frequency is 330MHz with 3,770 gates in equivalent two input NAND gates.

A study on the Cost-effective Architecture Design of High-speed Soft-decision Viterbi Decoder for Multi-band OFDM Systems (Multi-band OFDM 시스템용 고속 연판정 비터비 디코더의 효율적인 하드웨어 구조 설계에 관한 연구)

  • Lee, Seong-Joo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.43 no.11 s.353
    • /
    • pp.90-97
    • /
    • 2006
  • In this paper, we present a cost-effective architecture of high-speed soft-decision Viterbi decoder for Multi-band OFDM(MB-OFDM) systems. In the design of modem for MB-OFDM systems, a parallel processing architecture is general]y used for the reliable hardware implementation, because the systems should support a very high-speed data rate of at most 480Mbps. A Viterbi decoder also should be designed by using a parallel processing structure and support a very high-speed data rate. Therefore, we present a optimized hardware architecture for 4-way parallel processing Viterbi decoder in this paper. In order to optimize the hardware of Viterbi decoder, we compare and analyze various ACS architectures and find the optimal one among them with respect to hardware complexity and operating frequency The Viterbi decoder with a optimal hardware architecture is designed and verified by using Verilog HDL, and synthesized into gate-level circuits with TSMC 0.13um library. In the synthesis results, we find that the Viterbi decoder contains about 280K gates and works properly at the speed required in MB-OFDM systems.

An Intra Prediction Hardware Design for High Performance HEVC Encoder (고성능 HEVC 부호기를 위한 화면내 예측 하드웨어 설계)

  • Park, Seung-yong;Guard, Kanda;Ryoo, Kwang-ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.875-878
    • /
    • 2015
  • In this paper, we propose an intra prediction hardware architecture with less processing time, computations and reduced hardware area for a high performance HEVC encoder. The proposed intra prediction hardware architecture uses common operation units to reduce computational complexity and uses $4{\times}4$ block unit to reduce hardware area. In order to reduce operation time, common operation unit uses one operation unit to generate predicted pixels and filtered pixels in all prediction modes. Intra prediction hardware architecture introduces the $4{\times}4$ PU design processing to reduce the hardware area and uses intemal registers to support $32{\times}32$ PU processmg. The proposed hardware architecture uses ten common operation units which can reduce execution cycles of intra prediction. The proposed Intra prediction hardware architecture is designed using Verilog HDL(Hardware Description Language), and has a total of 41.5k gates in TSMC $0.13{\mu}m$ CMOS standard cell library. At 150MHz, it can support 4K UHD video encoding at 30fps in real time, and operates at a maximum of 200MHz.

  • PDF

Separating VNF and Network Control for Hardware-Acceleration of SDN/NFV Architecture

  • Duan, Tong;Lan, Julong;Hu, Yuxiang;Sun, Penghao
    • ETRI Journal
    • /
    • v.39 no.4
    • /
    • pp.525-534
    • /
    • 2017
  • A hardware-acceleration architecture that separates virtual network functions (VNFs) and network control (called HSN) is proposed to solve the mismatch between the simple flow steering requirements and strong packet processing abilities of software-defined networking (SDN) forwarding elements (FEs) in SDN/network function virtualization (NFV) architecture, while improving the efficiency of NFV infrastructure and the performance of network-intensive functions. HSN makes full use of FEs and accelerates VNFs through two mechanisms: (1) separation of traffic steering and packet processing in the FEs; (2) separation of SDN and NFV control in the FEs. Our HSN prototype, built on NetFPGA-10G, demonstrates that the processing performance can be greatly improved with only a small modification of the traditional SDN/NFV architecture.

Design and Implementation of Unified Hardware for 128-Bit Block Ciphers ARIA and AES

  • Koo, Bon-Seok;Ryu, Gwon-Ho;Chang, Tae-Joo;Lee, Sang-Jin
    • ETRI Journal
    • /
    • v.29 no.6
    • /
    • pp.820-822
    • /
    • 2007
  • ARIA and the Advanced Encryption Standard (AES) are next generation standard block cipher algorithms of Korea and the US, respectively. This letter presents an area-efficient unified hardware architecture of ARIA and AES. Both algorithms have 128-bit substitution permutation network (SPN) structures, and their substitution and permutation layers could be efficiently merged. Therefore, we propose a 128-bit processor architecture with resource sharing, which is capable of processing ARIA and AES. This is the first architecture which supports both algorithms. Furthermore, it requires only 19,056 logic gates and encrypts data at 720 Mbps and 1,047 Mbps for ARIA and AES, respectively.

  • PDF

Efficient Algorithm and Architecture for Elliptic Curve Cryptographic Processor

  • Nguyen, Tuy Tan;Lee, Hanho
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.16 no.1
    • /
    • pp.118-125
    • /
    • 2016
  • This paper presents a new high-efficient algorithm and architecture for an elliptic curve cryptographic processor. To reduce the computational complexity, novel modified Lopez-Dahab scalar point multiplication and left-to-right algorithms are proposed for point multiplication operation. Moreover, bit-serial Galois-field multiplication is used in order to decrease hardware complexity. The field multiplication operations are performed in parallel to improve system latency. As a result, our approach can reduce hardware costs, while the total time required for point multiplication is kept to a reasonable amount. The results on a Xilinx Virtex-5, Virtex-7 FPGAs and VLSI implementation show that the proposed architecture has less hardware complexity, number of clock cycles and higher efficiency than the previous works.

KAIST ARM의 고속동작제어를 위한 하드웨어 좌표변환기의 개발

  • 박서욱;오준호
    • Proceedings of the Korean Society of Precision Engineering Conference
    • /
    • 1992.04a
    • /
    • pp.127-132
    • /
    • 1992
  • To relize the future intelligent robot the development of a special-purpose processor for a coordinate transformation is evidently challenging task. In this case the complexity of a hardware architecture strongly depends on the adopted algorithm. In this paper we have used an inverse kinemetics algorithm based on incremental unit computation method. This method considers the 3-axis articulated robot as the combination of two types of a 2-axis robot: polar robot and 2-axis planar articulated one. For each robot incremental units in the joint and Cartesian spaces are defined. With this approach the calculation of the inverse Jacobian matrix can be realized through a simple combinational logic gate. Futhermore, the incremental computation of the DDA integrator can be used to solve the direct kinematics. We have also designed a hardware architecture to implement the proposed algorithm. The architecture consists of serveral simple unitsl. The operative unit comprises several basic operators and simple data path with a small bit-length. The hardware architecture is realized byusing the EPLD. For the straight-line motion of the KAIST arm we have obtained maximum end effector's speed of 12.6 m/sec by adopting system clock of 8 MHz.

Balancing Speed, Precision, and Flexibility

  • Tanaka, Yoke
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1993.06a
    • /
    • pp.937-940
    • /
    • 1993
  • A new hardware architecture achieves high speed, high precision fuzzy inference capabilities while maintaining Flexibility on par with software approaches. This flexibility allows unmodified, uncompromised porting of fuzzy system designs into hardware. The architecture is also scalable and offers data resolutions from 8 bits to 32 bits.

  • PDF

Energy Efficient and Low-Cost Server Architecture for Hadoop Storage Appliance

  • Choi, Do Young;Oh, Jung Hwan;Kim, Ji Kwang;Lee, Seung Eun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.12
    • /
    • pp.4648-4663
    • /
    • 2020
  • This paper proposes the Lempel-Ziv 4(LZ4) compression accelerator optimized for scale-out servers in data centers. In order to reduce CPU loads caused by compression, we propose an accelerator solution and implement the accelerator on an Field Programmable Gate Array(FPGA) as heterogeneous computing. The LZ4 compression hardware accelerator is a fully pipelined architecture and applies 16 dictionaries to enhance the parallelism for high throughput compressor. Our hardware accelerator is based on the 20-stage pipeline and dictionary architecture, highly customized to LZ4 compression algorithm and parallel hardware implementation. Proposing dictionary architecture allows achieving high throughput by comparing input sequences in multiple dictionaries simultaneously compared to a single dictionary. The experimental results provide the high throughput with intensively optimized in the FPGA. Additionally, we compare our implementation to CPU implementation results of LZ4 to provide insights on FPGA-based data centers. The proposed accelerator achieves the compression throughput of 639MB/s with fine parallelism to be deployed into scale-out servers. This approach enables the low power Intel Atom processor to realize the Hadoop storage along with the compression accelerator.

Training-Free Hardware-Aware Neural Architecture Search with Reinforcement Learning

  • Tran, Linh Tam;Bae, Sung-Ho
    • Journal of Broadcast Engineering
    • /
    • v.26 no.7
    • /
    • pp.855-861
    • /
    • 2021
  • Neural Architecture Search (NAS) is cutting-edge technology in the machine learning community. NAS Without Training (NASWOT) recently has been proposed to tackle the high demand of computational resources in NAS by leveraging some indicators to predict the performance of architectures before training. The advantage of these indicators is that they do not require any training. Thus, NASWOT reduces the searching time and computational cost significantly. However, NASWOT only considers high-performing networks which does not guarantee a fast inference speed on hardware devices. In this paper, we propose a multi objectives reward function, which considers the network's latency and the predicted performance, and incorporate it into the Reinforcement Learning approach to search for the best networks with low latency. Unlike other methods, which use FLOPs to measure the latency that does not reflect the actual latency, we obtain the network's latency from the hardware NAS bench. We conduct extensive experiments on NAS-Bench-201 using CIFAR-10, CIFAR-100, and ImageNet-16-120 datasets, and show that the proposed method is capable of generating the best network under latency constrained without training subnetworks.