# 가산기와 MIPS CPU 사례를 이용한 현대 FPGA의 특성연구

## 이보선<sup>†</sup> • 서태원<sup>††</sup>

#### 약 ደ

ASIC설계에서 FPGA를 이용한 에뮬레이션은 설계 검증을 위한 필수 단계이다. ASIC으로 설계된 모 델을 가능한 최대 동작주파수로 에뮬레이션하기 위해서는 FPGA의 특성을 이해해야 하다. 본 논문은 FPGA의 주요 제조사인 Xilinx와 Altera의 여러 디바이스에 다양한 가산기와 MIPS CPU를 포팅하여, 디자인 복잡도에 따른 현대 FPGA의 특성을 연구하였다. 실험 결과, 일반적인 통넘과는 다르게 1-bit 가 산기를 기반으로 디자인한 RCA는 FPGA 내부의 carry-chain을 활용하지 못했고, 그 결과 다른 타입의 가산기보다 낮은 성능을 보였다. 또한, 본 연구를 통해 Xilinx와 Altera 제조사 별 FPGA 특성에 확연한 차이가 있음을 확인하였다. 즉, 동작속도에 최적화하여 설계된 Prefix 가산기를 Xilinx 디바이스에 포팅 했을 때 저조한 동작주파수를 보였으나, Altera 디바이스에서는 IP Core와 비슷한 성능을 보였다. 이는 Altera 디바이스에서는 FPGA의 면적만 허락한다면 ASIC에 최적화된 설계를 그대로 사용하여도 에뮬레 이션 성능에 영향을 미치지 않음을 시사한다. MIPS CPU를 통한 실험은 이를 뒷받침한다.

**주제어**: 에뮬레이션, FPGA, Xilinx, Altera, 주문형반도체, MIPS CPU, 가산기

## Towards Characterization of Modern FPGAs: A Case Study with Adders and MIPS CPU

Boseon Lee<sup> $\dagger$ </sup> · Taewon Suh<sup> $\dagger$ †</sup>

#### ABSTRACT

The FPGA-based emulation is an essential step in ASIC design for validation. For emulation with maximal frequency, it is crucial to understand the FPGA characteristics. This paper attempts to analyze the performance characteristics of the modern FPGAs from renowned vendors, Xilinx and Altera, with a case study utilizing various adders and MIPS CPU. Unlike the common wisdom, ripple-carry adder (RCA) does not utilize the inherent carry-chain inside FPGAs when structurally designed based on 1-bit adders. Thus, the RCA shows the inferior performance to the other types of adders in FPGAs. Our study also reveals that FPGAs from Xilinx exhibit different characteristics from the ones from Altera. That is, the prefix adder, which is optimized for speed in ASIC design, shows the poor performance on Xilinx devices, whereas it provides a comparable speed to the IP core on Altera devices. It suggests that error-prone manual change of the original design can be avoided on Altera devices if area is permitted. Experiments with MIPS CPU confirm the arguments.

Keywords : Emulation, FPGA, Xilinx, Altera, ASIC, MIPS CPU, Adders

<sup>\*</sup> 정 회 원: 고려대학교 컴퓨터교육과 석박통합과정

<sup>\* \*</sup> 종신회원: 고려대학교 컴퓨터교육과 부교수(교신저자)

<sup>·</sup> 동연되면, 고디에너프 심비어프너지 구교 (요료 2017) 논문접수: 2013년 02월 18일, 심사완료: 2013년 05월 05일, 게재확정: 2013년 05월 27일 \* 본 연구는 2013학년도 고려대학교 사범대학 특별연구비 지원을 받아 수행되었음

#### 1. Introduction

The FPGA-based emulation is an essential step for verification in hardware design process. It allows a much bigger window in time compared to RTL simulations because the design is ported and downloaded to physical devices. Nevertheless, the operating frequency of emulation is still too far slower than that of ICs such custom as Application-Specific Integrated Circuit (ASIC) [1, 2]. For example, a recent work [1] reports that the Intel's Nehalem core is operating at 520KHz when ported to FPGAs, whereas the off-the-shelf Nehalem core is running at higher than 2GHz. For the design with modest complexity, the FPGA synthesizable design usually reaches to the speed of 10's MHz with proper optimization and partitioning, making it possible to work as software development vehicle before silicon becomes available [1]. The inferior operating frequency is the inevitable consequence from regulated logic at fixed locations inside FPGAs, and it forces the scale-down of target hardware systems in the verification phase, which incurs separate engineering endeavor.

To enhance the emulation performance, FPGA vendors provides IP cores, which are considered to provide the best performance since they are optimized for their devices, but it often times requires error-prone manual changes and replacements of the original design due to the incompatible I/O ports and/or insufficient number of ports. For example, the reorder buffer in out-of-order machines typically demands highly ported memories whereas BRAM on Xilinx FPGAs provides at most two read and write ports. It requires the manual duplication of BRAMs with specialized logic in the design to meet the number of input and output ports [1].

To facilitate the utilization of FPGAs'

maximal performance. it crucial is to understand the devices' performance characteristics depending on the coding diversity of the target hardware design. This paper attempts to address this issue with a case study of adders and CPU, according to a range of devices from renowned FPGA vendors, Xilinx and Altera. Unlike the common wisdom that ripple-carry adder (RCA) provides a high performance on FPGAs due to the inherent carry-chain utilization inside FPGAs [3], the structural description of RCA does not take advantage of the carry-chain, falling far short performance. We of expected have also discovered that Xilinx FPGAs exhibit а completely different characteristic from Altera devices; On Altera devices, an ASIC design optimized for speed provided a comparable performance to the IP core, which is not the case on Xilinx devices. It implies that the error-prone manual modification of the original design can be avoided on the Altera devices if area is permitted.

#### 2. Related Works

There are prior works [3–6] on the performance evaluation of adders on FPGAs. However, most of the work [3–5] reported the adders' performance on specific Xilinx FPGA devices. They also lack experiments and discussion depending on the RCA description.

Hoe et al [3] studied the prefix adders' performance on a Xilinx Spartan-3E FPGA. Utilizing both the timing simulation and actual measurement, the study reported that the parallel prefix adders are not as effective as the simple RCA on the FPGA. It is due to the fast carry-chain inside FPGAs that optimizes the RCA performance. Xing et al [4] focused on outdated Xilinx 4000 FPGAs, and proposed timing models and optimization schemes for



#### <Fig. 1> 5-stage MIPS pipeline

carry-skip and carry-select adders. Vitoroulis et al [5] also studied the parallel prefix adders' performance, focusing on a Xilinx Virtex 5 FPGA. The work reported the area requirements and critical path delay on the Virtex 5 for a variety of prefix adders. It proved that the algorithmic superiority of certain adders was lost due to the software tool optimizations. Especially, the study reported that the simple RCA performs faster than the prefix adders by using specialized resources in **FPGAs** 

This paper extends our previous work [6], and differs from the related works in that we report diverse adders' performance on a range of low-end to high-end FPGAs from Xilinx and Altera. It also reports a new finding on the RCA's performance on FPGAs according to its description. We discuss its implications for the FPGA-based emulation, and the experiment with MIPS CPU confirms our arguments.

#### 3. Experiments and Evaluation

We have experimented with three types of adders and MIPS CPU on FPGA devices. Table 1 shows the experimented FPGAs and their capacities. The three types of adders include RCA, 4-bit based carry-lookahead adder (CLA), and prefix adders. The widths of experimented adders are 32-bit, 64-bit and 128-bit. RCA was designed in two ways denoted as RCA+ and RCA-s. RCA+ refers to a design with a + operator in Verilog. RCA-s is a structural 1-bit design based on adders with instantiations. CLA and prefix adders were structurally described in the design according to their algorithms. Experimented prefix adders include Brent-Kung (BK), Kogge-Stone (KS), Han-Carlson, Ladner-Fishcher and Sklansky adders. This paper only reports BK and KS results because the other prefix adders show the similar characteristics on the experimented FPGAs. We have also used a MIPS CPU for experiments with a larger scale design. The original MIPS design is based on a 32-bit 5-stage pipeline, as shown in Fig. 1. To quantify the impact of adders of various widths in MIPS, the EX stage in Fig. 1 was implemented with 32-bit, 64-bit, and 128-bit utilizing the aforementioned adders. Xilinx ISE 13.4 and Altera Quartus-II 12.0 were used to synthesize, place and route the design, and measure the critical path delay. All the experiments were performed with speed options turned on in the tools.

Fig. 2 shows the critical path delay of the experimented adders according to the bit widths when synthesized with the TSMC 90nm technology. As anticipated, RCA shows the worst performance in ASIC due to the critical

| Altera [7]  |                             |                            |                   | Xilinx [8] |                             |                                        |                   |
|-------------|-----------------------------|----------------------------|-------------------|------------|-----------------------------|----------------------------------------|-------------------|
| Devices     | Semiconductor<br>technology | LEs<br>(Logic<br>Elements) | Total<br>RAM bits | Devices    | Semiconductor<br>technology | CLBs<br>(Configurable<br>Logic Blocks) | Block<br>RAM bits |
| Arria II GZ | 40nm                        | 348,500                    | 20,772 k          | Kintex 7   | 28nm                        | 10,250                                 | 4,860 k           |
| Stratix III | 60nm                        | 337,500                    | 18,381 k          | Virtex 5   | 65nm                        | 3,120                                  | 936 k             |
| Cyclone II  | 90nm                        | 33,216                     | 484 k             | Spartan 3E | 90nm                        | 2,168                                  | 504 k             |

<Table 1> Experimented FPGAs from Xilinx and Altera





path delay of the carry-chain described in RTL. CLA follows RCA, and prefix adders provide the best performance.

Fig. 3 shows the experiment results on FPGA devices. RCA+ and RCA-s clearly show the difference in performance; А simple description with a + operator utilizes the carry-chain inside FPGAs as shown in Fig. 4(a). However, the structural design based on 1-bit adders does not utilize the inherent carry-chain and incurs the worst delay, especially on Altera devices. Its trend is clearly shown with 128-bit adders. It implies that FPGA-based emulation favors а simpler description than the structural model. Given that the ASIC design favors the structural model for reusability and regularity, it could

involve the error-prone manual modification of the original design if the emulation performance is demanding.

Prefix adders are optimized for speed and widely used when the performance is required in the ASIC design. Nonetheless, they are known to provide the worst performance among various adders on FPGAs since the regular structure of the FPGA fabric does not favor its algorithmic implementation [1,3]. That is, prefix adders are not designed to utilize the inherent carry-chain on FPGAs. Our experiments show that, unlike the common wisdom, prefix adders provide comparable performance to  $RCA^+$  and IP core on Altera devices. Xilinx devices exhibit a completely different trend from Altera devices, as shown in Fig. 3. Prefix adders report the worst performance among three types of adders on the Xilinx devices. It hints that the speed-optimized design for ASIC is fairly well accommodated on Altera devices whereas it should be modified on Xilinx devices if the emulation performance matters.

IP cores created with CoreGen and Megafunction from FPGA vendors utilize the carry-chain inside FPGAs and provide the



<Fig. 3> Adder delays on FPGAs according to adder widths



<Fig. 4> Implementation difference between RCA+ and RCA-s on Xilinx Spartan3E. The while line shows the critical path.

superior performance to the other types of adders, as shown in Fig. 3. However, IP cores tend to provide only the basic functionality, and the insufficient number and/or lack of input and ports in IPcores requires the output error-prone manual modification and validation of the ASIC-targeted original design [1]. For IΡ the adder created example, with Megafunction from Xilinx does not provide the overflow output that is used to detect signed underflow and overflow condition in the ALU

design. Thus, the structural description of adder with the overflow output should be considered instead when targeting for the Xilinx FPGAs.

Fig. 5 and Fig. 6 report the experiment results performed with 32-bit and 128-bit MIPS CPU, respectively. The right side of Fig. 5 and Fig. 6 shows the EX stage delay of the MIPS pipeline according to the adder types. The EX stage is composed of various components such as multiplexers and ALU. The adder is a central building block in ALU influencing the most of the critical path delay. The EX stage delay shows similar trends to the adder delays reported on the left side of the Fig. 5 and Fig. 6. It confirms that the different adder description has a similar influence on the performance of a design with a larger scale on FPGAs. The fastest adder, IP adder, takes a 63% of the EX stage delay on Altera devices. The slowest adder, RCA-s, take a 96 % of the EX stage delay. On Xilinx devices, the IP adder and RCA-s take 45%  $\sim$  77 % of the EX stage delay.



<Fig. 5> 32-bit Adders and EX-stage delay in MIPS



<Fig. 6> 128-bit Adders and EX-stage delay in MIPS

#### 4. Discussion

This study makes the following contributions and suggestions; First, the performance of RCAs shows the stark contrast on FPGAs depending on its description (RCA+and RCA-s). It implies that FPGA-based emulation favors a simpler description than the structural model. Thus, the manual change and/or modification of the original design is required if the emulation performance matters. Second, FPGAs from Xilinx and Altera exhibit different characteristics, which should be taken into account when validating the ASIC design with FPGAs. The Altera devices fairly well accommodate the speed-optimized design, whereas the Xilinx devices favor the design that takes advantage of its regular fabric structure.

### 5. Conclusion

To quantify the performance characteristics of modern FPGAs, our case study used various adders and MIPS CPU with a range of devices from Xilinx and Altera. Unlike the common wisdom, RCA does not always guarantee the best performance among various adders in FPGAs. The RCA-s does not utilize the inherent carry-chain inside the FPGA fabric, resulting in the worst performance especially on Altera devices. We also found that the devices from Xilinx and Altera showed a completely different characteristic with the prefix adders. On the Altera devices, the prefix adders surprisingly show a comparable performance to the IP core, whereas they report the worst performance among the experimented adders on Xilinx devices. The experiment with MIPS CPU also reports the similar trend, confirming that the different adder description has a similar influence on the performance of a larger scale design with complexity on FPGAs.

Since the manufacturing cost of ASIC design hardly allows a flaw in hardware design, the FPGA-based emulation is an inevitable step for the system-level validation. However, the regular FPGA fabric often times demands the error-prone modification and/or change of the original ASIC design, due in part to the inferior operating frequency and insufficient input/output ports. Our study attempted to analyze the performance characteristics of the modern FPGAs depending on the coding diversity. It also hints that the error-prone replacement and design change with IP cores can be avoided whenever possible if area is permitted in the modern Altera FPGAs.

#### References

- [1] Schelle G, Collins J, Schuchman E, Wang P, Zou X, Chinya G, Plate R, Mattner T, Olbrich F, Hammarlund P, Singhal R, Brayton J, Steibl S, Wang H. (2010). Intel Nehalem Processor Core Made FPGA Synthesizable. Proc. 18th ACM/SIGDA Int. Sym. on Field Programmable Gate Arrays., Monterey, California, USA, 3-12.
- [2] Kim, M & Kong, J. & Suh, T. & Chung, S.W. (2011). Latch-based FPGA Emulation Method for Design Verification: a case study with а microprocessor. IET Electronics Letters, 532-533.
- [3] Hoe, D & Martinez, C & Vundavalli, S.J. (2011). Design and Characterization of Parallel Prefix Adders using FPGAs. Proc. IEEE 43rd Southeastern Symposium on System Theory, 168-172.
- [4] Xing, S & Yu, W. (1998). FPGA Adders: Performance Evaluation and Optimal Design. IEEE Design & Test of Computers, 15(1), 24-29.
- [5] Vitoroulis, K & Al-Khalili, A.J. (2007). Performance of Parallel Prefix Adders implemented with FPGA technology. Proc. IEEE Northeast Workshop on Circuits and Systems, 498-501.
- [6] Lee, S & Lee, B & Suh, T. (2011). FPGA Performance Evaluation According to HDL Coding Style. Proc. 36th Conf. of the KIPS, 62-65.

[7] Altera Corporation, http://www.altera.com/

[8] Xilinx Corporation, http://www.xilinx.com/



이 보 선

2011 전북대학교 전자공학과 (학사) 2011~현재 고려대학교 컴퓨터교육과석·박사통합과정

관심분야: 임베디드 시스템, 컴퓨터 구조 E-Mail: l2bs@korea.ac.kr



### 서 태 원

| 120             | 전기공학과 (학사)                          |
|-----------------|-------------------------------------|
|                 | 1995 서울대학교                          |
|                 | 전자공학과 (석사)                          |
| 995~1998        | LG종합기술원 주임연구원                       |
| 998~2001        | 하이닉스반도체 선임연구원                       |
| $004 \sim 2004$ | Intel Corporation, Research         |
|                 | Intern, CA, USA.                    |
| 005~2006        | Intel Corporation, Research Intern, |
|                 | OR, USA. 2006년 Georgia Institute    |
|                 | of Technology, Computer             |
|                 | Engineering, 박사                     |
| $007 \sim 2008$ | Intel Corporation, Systems          |
|                 | Engineeer, OR, USA.                 |
|                 |                                     |

- 2008~현재 고려대학교 컴퓨터교육과 부교수
- 관심분야: 임베디드 시스템, 컴퓨터 구조,

멀티프로세서, 컴퓨터교육

E-Mail: suhtw@korea.ac.kr