Search | Korea Science

SIMD MAC Unit Design for Multimedia Data Processing (멀티미디어 데이터 처리에 적합한 SIMD MAC 연산기의 설계)

Hong, In-Pyo;Jeong, Woo-Kyong;Jeong Jae-Won;Lee Yong-Surk
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.38 no.12
- /
- pp.44-55
- /
- 2001
MAC(Multiply and ACcumulate) is the core operation of multimedia data processing. Because MAC units implemented on traditional DSP units or embedded processors have latency of three cycles and cannot operate on multiple data simultaneously, then, performances are seriously limited. Many high end general purpose microprocessors have SIMD MAC unit as a functional unit. But these high end MAC units must support pipeline structure for various operation modes and high clock frequency, which makes control logic complex and increases chip area. In this paper, a 64bit SIMD MAC unit for embedded processors is designed. It is implemented to have a latency of one clock cycle to remove pipeline control logics and a minimal area overhead for SIMD support is added to existing Booth multipliers.
PDF

A Efficient Architecture of MBA-based Parallel MAC for High-Speed Digital Signal Processing (고속 디지털 신호처리를 위한 MBA기반 병렬 MAC의 효율적인 구조)

서영호;김동욱
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.41 no.7
- /
- pp.53-61
- /
- 2004
In this paper, we proposed a new architecture of MAC(Multiplier-Accumulator) to operate high-speed multiplication-accumulation. We used the MBA(Modified radix-4 Booth Algorithm) which is based on the 1's complement number system, and CSA(Carry Save Adder) for addition of the partial products. During the addition of the partial product, the signed numbers with the 1's complement type after Booth encoding are converted in the 2's complement signed number in the CSA tree. Since 2-bit CLA(Carry Look-ahead Adder) was used in adding the lower bits of the partial product, the input bit width of the final adder and whole delay of the critical path were reduced. The proposed MAC was applied into the DWT(Discrete Wavelet Transform) filtering operation for JPEG2000, and it showed the possibility for the practical application. Finally we identified the improved performance according to the comparison with the previous architecture in the aspect of hardware resource and delay.
PDF KSCI

Design and Implementation of a Power Aware Scalable Pipelined Booth Multiply & Accumulate Unit (소비전력 인지형 곱셈 연산 누적기의 설계 및 구현)

Shin, Min-Hyuk;Lee, Han-Ho
- Proceedings of the IEEK Conference
- /
- 2006.06a
- /
- pp.573-574
- /
- 2006
A low-power power-aware scalable pipelined Booth recoded multiply & Accumulate unit (PA-MAC) detects the input operands for their dynamic range and accordingly implements a 16-bit, 8-bit or 4-bit multiplication and accumulation operation. The multiplication mode is determined by the dynamic - range detection unit. For the computations, although an area of the proposed PA-MAC is lager than a non-scalable MAC respectively, the proposed PA-MAC proves to be globally more power efficient than a non-scalable MAC.
PDF

Design of a Floating Point Multiplier for IEEE 754 Single-Precision Operations (IEEE 754 단정도 부동 소수점 연산용 곱셈기 설계)

Lee, Ju-Hun;Chung, Tae-Sang
- Proceedings of the KIEE Conference
- /
- 1999.11c
- /
- pp.778-780
- /
- 1999
Arithmetic unit speed depends strongly on the algorithms employed to realize the basic arithmetic operations.(add, subtract multiply, and divide) and on the logic design. Recent advances in VLSI have increased the feasibility of hardware implementation of floating point arithmetic units and microprocessors require a powerful floating-point processing unit as a standard option. This paper describes the design of floating-point multiplier for IEEE 754-1985 Single-Precision operation. Booth encoding algorithm method to reduce partial products and a Wallace tree of 4-2 CSA is adopted in fraction multiplication part to generate the $32{\times}32$ single-precision product. New scheme of rounding and sticky-bit generation is adopted to reduce area and timing. Also there is a true sign generator in this design. This multiplier have been implemented in a ALTERA FLEX EPF10K70RC240-4.
PDF

A new scheme for VLSI implementation of fast parallel multiplier using 2x2 submultipliers and ture 4:2 compressors with no carry propagation (부분곱의 재정렬과 4:2 변환기법을 이용한 VLSI 고속 병렬 곱셈기의 새로운 구현 방법)

이상구;전영숙
- Journal of the Korean Institute of Telematics and Electronics C
- /
- v.34C no.10
- /
- pp.27-35
- /
- 1997
In this paper, we propose a new scheme for the generation of partial products for VLSI fast parallel multiplier. It adopts a new encoding method which halves the number of partial products using 2x2 submultipliers and rearrangement of primitive partial products. The true 4-input CSA can be achieved with appropriate rearrangement of primitive partial products out of 2x2 submultipliers using the newly proposed theorem on binary number system. A 16bit x 16bit multiplier has been desinged using the proposed method and simulated to prove that the method has comparable speed and area compared to booth's encoding method. Much smaller and faster multiplier could be obtained with far optimization. The proposed scheme can be easily extended to multipliers with inputs of higher resolutions.
PDF

17$\times$17-b Multiplier for 32-bit RISC/DSP Processors (32 비트 RISC/DSP 프로세서를 위한 17 비트 $\times$ 17 비트 곱셈기의 설계)

박종환;문상국;홍종욱;문병인;이용석
- Proceedings of the IEEK Conference
- /
- 1999.06a
- /
- pp.914-917
- /
- 1999
The paper describes a 17 $\times$ 17-b multiplier using the Radix-4 Booth’s algorithm. which is suitable for 32-bit RISC/DSP microprocessors. To minimize design area and achieve improved speed, a 2-stage pipeline structure is adopted to achieve high clock frequency. Each part of circuit is modeled and optimized at the transistor level, verification of functionality and timing is performed using HSPICE simulations. After modeling and validating the circuit at transistor level, we lay it out in a 0.35 ${\mu}{\textrm}{m}$ 1-poly 4-metal CMOS technology and perform LVS test to compare the layout with the schematic. The simulation results show that maximum frequency is 330MHz under worst operating conditions at 55$^{\circ}C$ , 3V, The post simulation after layout results shows 187MHz under worst case conditions. It contains 9, 115 transistors and the area of layout is 0.72mm by 0.97mm.
PDF

An Architecture for $32{\times}32$ bit high speed parallel multiplier ($32{\times}32 $ 비트 고속 병렬 곱셈기 구조)

김영민;조진호
- Journal of the Korean Institute of Telematics and Electronics B
- /
- v.31B no.10
- /
- pp.67-72
- /
- 1994
In this paper we suggest a 32 bit high speed parallel multiplier which plays an important role in digital signal processing. We employ a bit-pair recoding Booth algoritham that gurantees n/2 partial product terms, which uniformly handles the signed-operand case. While partial product terms are generated, a special method is suggested to reduce time delay by employing 1's complement instead of 2's complement. Later when partial products are added, the additional 1 bit's are packed in a single partial product term and added to in the parallel counter. Then 16 partial product terms are reduced to two summands by using successive parallel counters. Final multiplication value is obtained by a BLC adder. When this multiplier is simulated under 0.8$\mu$CMOS standard cell we obtain 30ns multiplier speed.
PDF

Time-Multiplexed FIR Filter Design Using Group CSD(GCSD) Multipliers (Group CSD(GCSD) 곱셈기를 이용한 Time-Multiplexed FIR 필터 설계)

Jeon, Chang-Ha;Seo, Dong-Hyun;Chung, Jin-Gyun;Kim, Yong-Eun;Lee, Chul-Dong
- The Transactions of The Korean Institute of Electrical Engineers
- /
- v.59 no.2
- /
- pp.452-456
- /
- 2010
Multiplication is a fundamental arithmetic operation in many digital signal processing (DSP) and communication algorithms. The group CSD (GCSD) multiplier was recently proposed based on the variation of canonical signed digit (CSD) encoding and partial product sharing. This multiplier provides an efficient design when the multiplications are performed only with a few predetermined coefficients (e.g., FFT). In this paper, it is shown that, by exploiting the characteristics of the filter coefficients, GCSD multipliers can be used for the efficient implementation of time-multiplexed FIR filters.
https://doi.org/10.5370/KIEE.2010.59.2.452 인용 PDF KSCI

New VLSI Architecture of Parallel Multiplier-Accumulator Based on Radix-2 Modified Booth Algorithm (Radix-2 MBA 기반 병렬 MAC의 VLSI 구조)

Seo, Young-Ho;Kim, Dong-Wook
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.45 no.4
- /
- pp.94-104
- /
- 2008
In this paper, we propose a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator which has the largest delay in MAC was removed and its function was included into CSA, the overall performance becomes to be elevated. The proposed CSA tree uses 1's complement-based radix-2 modified booth algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of operands. The CSA propagates the carries by the least significant bits of the partial products and generates the least significant bits in advance for decreasing the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits not the output of the final adder for improving the performance by optimizing the efficiency of pipeline scheme. The proposed architecture was synthesized with $250{\mu}m,\;180{\mu}m,\;130{\mu}m$ and 90nm standard CMOS library after designing it. We analyzed the results such as hardware resource, delay, and pipeline which are based on the theoretical and experimental estimation. We used Sakurai's alpha power low for the delay modeling. The proposed MAC has the superior properties to the standard design in many ways and its performance is twice as much than the previous research in the similar clock frequency.
PDF KSCI

JPEG2000 IP Design and Implementation for SoC Design (SoC를 위한 JPEG2000 IP 설계 및 구현)

정재형;한상균;홍성훈;김영철
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2002.11a
- /
- pp.63-68
- /
- 2002
JPEG2000은 기존의 정지영상압축부호화 방식에 비해 우수한 비트율-왜곡(Rate-Distortion)특성과 향상된 주관적 화질을 제공하며 인터넷, 디지털 영상카메라, 이동단말기, 의학영상 등 다양한 분야에서 적용될 수 있는 새로운 정지영상압축 표준이다. 본 논문에서는 SoC(System on a Chip)설계를 고려한 JPEG2000 인코더의 구조를 제안하고 IP(Intellectual Property)를 설계 및 검증하였다. 구현된 JPEG2000 IP는 DWT(Discrete Wavelet Transform)블록, 스칼라양자화블록, EBCOT(Embedded Block Coding with Optimized Truncation)블록으로 구성되어 있다. IP는 모의실험을 통해 구현 구조에 대한 타당성을 검증하였고, 반도체설계자산연구센터에서 제시한 'RTL Coding Guideline'에 따라 HDL을 설계하였다. 특히, DWT블록은 구현시 많은 연산과 메모리 용량이 필요하므로 영상을 저장할 외부 메모리를 사용하였고, 빠른 곱셈과 덧셈연산을 위한 3단 파이프라인 부스곱셈기(3-state pipeline booth multiplier)와 캐리예측 덧셈기(carry lookahead adder)를 사용하였다. 설계된 JPEG2000 IP들은 삼성 0.35$\mu\textrm{m}$ 라이브러리를 이용하여 Synopsys사 Design Analyzer 틀을 통해 논리 합성하였으며, Xillinx 100만 게이트 FPGA칩에 구현하여 그 동작을 검증하였다. 또한, Hard IP 설계를 위해 Avanti사의 Apollo툴을 이용하여 Layout을 수행하였다.
PDF

Search Result 44, Processing Time 0.029 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)