# RISC와 CISC 구조를 위한 저전력 고속 데이어 전송 Ankur Agarwal\* · A. S. Pandya\* · 노영욱\*\* Low Power High Frequency Design for Data Transfer for RISC and CISC Architecture Ankur Agarwal\* · A. S. Pandya\* · Young-Uhg Lho\*\* #### 요 약 이 논문은 완전설계와 반주문설계 ASIC(Application Specific Integrated Circuit)을 설계할 때 트랜지스터 수준에서 ad-hoc 기술을 사용한 저전력 고속의 명령어들 설계에 대한 것이다. 제안된 설계는 상위 수준은 Verilog-HDL을 사용하여 검증을 하였고, 논리적 정확성을 확인하기 위하여 ModelSim을 사용하여 시뮬레이션 하였다. 그리고 래이어 수준은 0.25 $\mu$ m 기술을 사용하는 LASI를 사용하여 시험하였고, Win-spice 시뮬레이션 환경에서 시간 특성을 분석하였다. 시험을 한 결과에 의하면 RISC와 CISC와 같은 범용 프로세서는 전력 소모를 최대 35%까지 감소되었다. 그리고 전파 지연이 많이 감소되었고 CPU의 반입과 수행 사이클의 빈도수가 증가됨에 따라 연산의 전체 빈도수가 증가되었다. #### **ABSTRACT** This paper presents low power and high frequency design of instructions using ad-hoc techniques at transistor level for full custom and semi-custom ASIC (Application Specific Integrated Circuit) designs. The proposed design has been verified at high level using Verilog-HDL and simulated using ModelSim for the logical correctness. It is then observed at the layout level using LASI using 0.25µm technology and analyzed for timing characteristic under Win-spice simulation environment. The result shows the significant reduction up to 35% in the power consumption by any general purpose processor like RISC or CISC. A significant reduction in the propagation delay is also observed, increasing the frequency for the fetch and execute cycle for the CPU, thus increasing the overall frequency of operation. #### 키워드 low power, high speed data transfer, RISC, CISC, SOC, AND-gate, OR-gate #### I. INTRODUCTION A processor consumes far less energy running tasks requiring a low supply voltage than it does executing highperformance tasks at higher frequency. Effective voltage scheduling techniques take advantage of this situation by using software to dynamically vary supply voltages, thereby minimizing energy consumption and accommodating timing constraints. Advances in deep-submicron technologies have enabled system-on-chip (SOC) designs in which a system's entire functionality rests on a single chip. SOCs are embedded in various electric products, such as portable information terminals, digital audio systems, automobiles. Most of these products are real-time systems with timing constraints. An <sup>\*</sup> Dept of Computer Science and Engineering, FAU <sup>\*\*</sup> 신라대학교 컴퓨터교육과 important consideration in SOC design is minimizing power consumption. Heat due to high power consumption often prevents realization of high-performance SOCs with high transistor density. Moreover portable systems require small battery. Thus the design technology for high performance SOCs with low energy consumption in an important research issue in real-time system design. The problem is realizing both high-speed computation and low energy consumption. A processor is always an integral part of any SOC design, thus it can be argued that by increasing the performance of a processor, a higher performance SOC design can be achieved. Employing a high-performance processor core may satisfy timing constraints, but will probably not foster low energy consumption. Using the variable-voltage processor, tasks with severe real-time constraints can execute at high supply voltages - and, therefore, high execution speeds - and tasks with loose time constraints can execute at low supply voltages. Reducing the supply voltage leads to drastic energy reduction because energy consumption in CMOS circuits typically increases quadratically with supply voltage. Energy consumption integrates power consumption in the time domain. This paper presents a unique design for high speed data transfer. The proposed circuit works at the lower supply voltage without sacrificing on thefrequency of operation to solve the problem of power consumption [1]. In this paper we have proposed a high speed data transfer among the various registers with minimal power consumption. This is one of the basic operations on any CPU including the RISC and CISC architectures [2-4]. Thus the optimization of a basic instruction can optimize the performance of the CPU to a great extent. The new design for data transfer has been employed by using complementary pass transistors [8] at the ASIC level. These days the design can be specified at various levels, like architectural level, layout, process technology and circuit design level [5]. By the proper choice of the technology at various levels the design can be made to work in a more efficient way [6]. At circuit design level, this can be achieved by choosing a kind of combinational logic which employs the least number of gates. This is due to the fact that all the important parameters governing power dissipation are strongly influenced by the chosen logic style [7]. At layout level the wiring complexity is an important factor contributing towards the efficiency [8]. In this paper we have optimized the design at different levels for obtaining a higher performance in terms of power consumption and the frequency of operation which is a paramount concern [9]. At architecture level, we have design one-hot encoded machine implementation for data transfer operation. The resulting architecture has been described with Verilog-HDL at architecture (behavior) level, which results in a circuit level diagram. The FPGA implementation of the circuit is optimized for delay at circuit level. To achieve higher speed and reduce power consumption the circuit level design has been designed at transistor level. At layout level, novel transfer gate technology [7] has been employed to reduce the number of transistors. It can be easily proved that the power consumption directly depends on the number of transistors in the circuit [1,6,10, 11]. The proposed ad-hoc technique of the transistor design has a higher frequency of operation that the normal CMOS design for the equivalent system. At layout level reduction in power consumption is achieved without sacrificing higher frequency of operation. #### II. PROPOSED DESIGN Data transfer circuitry design has been implemented at various levels. Figure 1 describes the design at the high level implementation. Fig 1. High Level Diagram of Data Transfer Circuit #### 2.1 Architecture Level Implementation Figure 2 represents the architecture level implementation of the proposed design. This implementation is done for the Moore machines. 'w' and 'clk'are the two inputs along with reset. A high signal in 'w' signifies the start of the data transfer cycle. Once the data transfer is started the value of 'w' is not important till the end of the cycle, thus we have 'w' as don't care there after. A high on 'DONE' signifies the completion of the data transfer. It takes three clock cycles to transfer data between two registers, where the clock rate can be as high as 800MHz for the FPGA implementation. Data transfer can be stopped at any state by asserting reset (active-high). Table 1 represents the state table implementation of state machine in Figure 2. #### 2.2 Circuit Level Implementation Figure 3 shows the circuit level implementation of the Moore machine used for the implementation of the circuitry. This circuit is the synthesis result from Leonardo Spectrum. The synthesis report is the part of the appendix. Here 'w' and 'clk' are the same input signals as explained under architecture level implementation. It can be seen from the figure that there are sixty eight transistors (6\*6+16\*2=68) need for its implementation. The synthesis result of sate machine gives the output frequency of 830 MHz. For its implementation we have employed Vertex-II FPGA for its target. Fig 2. Moore State Diagram for Proposed Circuit Table 1. State Table for State Machine in Fig. 2 | PRESENT<br>STATE<br>y2y1 | NEXT STATE | | OUTPUTS | | | | | | | |--------------------------|------------|------|---------|----|-------|----|-------|----|------| | | W=0 | W=1 | Rlout | R1 | R2out | R2 | R3out | Rã | done | | | Y2Y1 | Y2Y1 | | | | | | | | | 00 | 00 | 01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 01 | 10 | 10 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | | 10 | 11 | 11 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | | 11 | 00 | 00 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | Fig 3. Circuit Level Diagram for Moore Machine for the Proposed Circuit #### 2.3 Transistor Level Implementation Figure 4 and 5 is the layout analysis for the transistor level implementation of AND, OR and D-FF respectively. It can also be seen from these layouts designed that these circuits are based on transfer gate technology [12-14]. The behavior AND and OR gate is realized in Figure 6 and 7 respectively. The Winspice analysis confirms the behavior of the complete circuit. That there is a sudden spike in output waveform as shown in Winspice result in Figure 8. This spike is due to the fact that we are trying to pass logic-1 from the nMOS transistor and then switching the logic immediately to pMOS transistor. This spike can be avoided by connecting a buffer at the output and thus faithfully restoring the logic back [15,16]. Fig. 4 Layout of Proposed Design of AND-gate Fig 5. Layout of Proposed Design of OR-gate Fig 6. Winspice Analysis of Proposed Design of AND-gate Fig 7. Winspice Analysis of Proposed Design of OR-gate # III. ANALYSIS OF POWER CONSUMED IN THE PROPOSED DESIGNS The main reason for implementing the majority of contemporary high-complexity designs in static CMOS is the almost complete absence of power consumption in steady-state mode. According to equation (1) [1,17,18], the average power dissipation Pv in digital CMOS circuits include three distinct components. $$P_{v} = P_{short circuit} + P_{switching} + P_{leakage} + P_{static}$$ (1) It can be seen from the above equation that the dynamic power dissipation for digital CMOS circuits depends upon clock frequency, transition activity, node capacitance, short circuit current and the power supply VDD [18,19]. #### 3.1 Dynamic Consumption Due to Load Capacitance Nodes in the digital circuit toggle between the two logic states, '0' & '1'. During the transition from one state to another, node capacitances need to be charged & discharged. The current passing through either p-channel or n-channel transistor while charging or discharging node capacitances cause the capacitive component $P_{capof}$ the total power consumed. $P_{cap}$ is given by equation (2) [12]. Here f<sub>clk</sub> is the clock frequency, VDD is the supply voltage, C<sub>L</sub> is the load capacitance and alpha is the transition activity. It can be clearly seen that more is the transition of the states the higher will be the power dissipation. In the proposed model due to the reduced number of transistor count the transition activity is reduced by 60% accounting in the reduced power consumed. $$P_{\text{switching}} = P_{\text{cap}} = a \cdot C_L \cdot VDD^2 \cdot f_{\text{clk}}$$ (2) This capacitive load is originated from the capacitance between the gate and diffusion and the interconnecting metal and Polysilicon layers in our layouts shown. This can be substantially reduced by employing fewer transistors in our design and reducing the size of the transistor i.e. the length and the width of the channel to the minimum possible size. Also as mentioned above, there is the square law dependence of capacitive switching power on the supply voltage. Therefore, reducing the supply voltages is an effective means for reduction of $P_{cap}$ . In many cases, supply voltage reduction and speed-up design techniques go along with reduction of the clock frequency $f_{clk}$ which reduces the capacitive switching power even further. #### 3.2 Leakage Power Ideally, digital CMOS circuits should not exhibit any static power consumption at all. However, due to the non ideal sub threshold behavior of real MOSFETs, there is leakage current I<sub>leak</sub> flowing from the positive power supply to the ground even in the static case resulting in the leakage power Pleak, which is given by the equation (3) [14] $$P_{leak} = I_{leak} \cdot VDD$$ (3) Here, reduction in the requirement for the supply voltage and the number of transistors contributes in a significant reduction of power requirement. #### 3.3 Short Circuit power Short circuit power ( $P_{short}$ ), is expressed by equation (4) [18]: $$P_{\text{short}} = (1/2) \cdot (t_r \cdot I_{\text{short,max,r}} + t_f \cdot I_{\text{short,max,f}}) \cdot V_{\text{dd}} \cdot f_{\text{clk}}$$ (4) Where I<sub>short,max,r</sub> are the peaks of the currents flowing from positive power supply to ground when n- and p-channel transistors are conducting simultaneously for a infinitely small moment during node transition and tr/f are the fall and rise times of the node voltages. This short circuit power decreases with decreasing the switching activity 'alpha' and decreasing the clock frequency 'f<sub>clk</sub>'. However clock frequency is usually regarded constant in order to fulfill some architectural requirement. So reduction in short circuit power can be achieved by reducing the number of transistors and thereby reduction in the switching activity. It can be observed that in all the cases the power consumed directly depends upon VDD i.e. the supply voltage. Due to the absence of the explicit power supply the theoretically the dynamic power consumed will be negligible, whereas static power consumption will be responsible for the total power consumed by the device. #### IV. CONCLUSION According to the equations 1, 2, 3 and 4, power consumption mainly depends upon the VDD, number of transistors and switching frequency. So when there is no explicit VDD connection, the short circuit power consumption and the leakage power consumption are reduced substantially. However, there will be static power consumption, but switching power consumption is negligible. It can also be seen from the proposed design for the data transfer circuitry that it employs lesser transistors in its design against the standard design. It can also be seen from the synthesis results that the data transfer rate is about 830MHz after specifying the constraints. This could be further be increased by the use of local and global constraints. The Winspice simulator result confirms the validity of the proposed design at the transistor level. Further the data transfer rate at ASIC level is about 970MHz. It can be concluded that there is a saving power requirement due to the lesser number of transistors employed for its design and the absence of the explicit power supply at a high speed of operation. This makes the proposed designs an ideal one for devices, which require low power. #### REFERENCES - [1] A. P. Chandrakasan, S. Sheng and R. W. Brodersen, "Low Power CMOS Digital Design," IEEE Journal of Solid State Circuits, Vol. 27, No. 4, pp. 473-483, April 1992. - [2] Chip Weems, "http://www.cs.umass.edu/~weems/research\_ talks/Real-Time\_RISC/index.htm," Techinical Research Presentation September 1998. - [3] www.arm.com, "Techinical Reference Manual," pp. 35-122, April-2001 - [4] Charles Severance, http://www.netfact.com/~crs /faculty/ ann1996.html, "Beyond RISC - The Post RISC Architecture" - MIT Lincoln Labs, May 20, 1996. - [5] Mutoh Shin'ichiro, Takakuni Douseki, Yasukuki matsuya, Takahiro Aoki, Santoshi Shigermatsu, Junzo Yamada, "1-V Power Supply high Speed Digital Circuit Technology With Multithreshold Voltage CMOS," IEEE Journal of Solid State Circuits, Vol. 30, No. 8, pp. 847-854, August 1995. - [6] Agarwal Ankur, Pandya Abhijit, Folleco Andres, "A Novel Low Power Design of an ALU," CCCIT Conference of Microprocessors, July 2003. - [7] K. Yano el at., "A 3.8 ns CMOS 16 \* 16 Multiplier using Complementary Pass Transistor logic,"IEEE Journal of Solid State Circuit, Vol. 25, pp 388-395, April 1990. - [8] JU. Wang. S. Fang and W. Feng, "New Efficient Designs for XOR and XNOR functions on Transistor Level,"IEEE Journal of Solid State Circuits, Vol. 29, No. 7, pp. 780-786, July 1994. - [9] R. Shalem, E. John, L. K. John, "A Novel Low Power - Energy Recovery Full Adder Cell," Proceedings of the IEEE Great Lakes Symposium of VLSI, pp. 380-383, February 1999. - [10] Nagendra, M. J. Irwin and R.M.Owens, "Area Time-Power- Tradeoff in Parallel Adders," IEEE Circuits and System II, Vol 43, No. 10, pp 689-702, 1996. - [11] A. Agarwal, "Low Power Design of an ALU,"MS Thesis, Florida Atlantic University, August 2003. - [12] N. Weste and Eshraghian, Principles of CMOS VLSI Design, A System Perspective, MA Addision- Wesley, 1993. - [13] K. Yano, "Top Down Pass Transistor Logic Design," IEEE Journal of Solid State Circuits, Vol. 32 No. 7, pp. 1079-1089, 1997. - [14] R. Zimmermann and W. Fichtner, "Low Power Logic Styles: CMOS Versus Pass- Transistor Logic,"IEEE Journal of Solid State Circuits, Vol. 32, pp. 1079-1089, 1997. - [15] Pandya Abhijit, Ankur Agarwal, P. K. Kim, "Low Power Design of a Neuro processor," Knowledge-Based Intelligent Information and Engineering Systems, Eds. V. Palade, R.J. Howlett and L. Jain, Springer, Berlin, Vol.2 pp.856-862, 2003. - [16] L. J. M. Veendrick, "Short Circuit Dissipitation of CMOS Circuitry and its Impact on the design of the buffer circuits," IEEE journal of Solid State Circuits, Vol. SC-19, pp. 468-473, August 1984. - [17] A. Al-Sheraidah, "Novel Multiplexer- Based Architecture for Full Adder Design," MS Thesis, Florida Atlantic University, August 2000. - [18] J. M. Rabaey, Digital Integrated Circuits, A Design Perspective, Prentice Hall, 1995. - [19] A. P. Chandrakasan, S. Sheng and R. W. Brodersen, "Low Power CMOS Digital Design," IEEE Journal of Solid State Circuits, Vol. 27, No. 4, pp. 473-483, April 1992. #### **APPENDIX** | FPGA Implementation Report | | | | | |------------------------------------|-----|--|--|--| | Number of ports: | 10 | | | | | Number of nets: | 21 | | | | | Number of instances: | 18 | | | | | Number of references to this view: | 0 | | | | | | | | | | | Number of BUFGP: | 1 | | | | | Number of D-ffs or Latches: | 2 | | | | | Number of Function Generators: | 5 . | | | | | Number of IBUF: | 2 | | | | | Number of OBUF: | 7 | | | | | Number of accumulated instances: | 18 | | | | #### Clock Frequency Report | Clock | : Frequency | | | |-------|--------------|--|--| | clk | : 2791.9 MHz | | | #### Critical Path Report Critical path #1, (unconstrained path) | 0.00 | | | |--------|------------------------|------------------------------------------------------------| | 0.00 | 0.00 up | 0.10 | | IBUF | 1.06 1.06 up | 0.30 | | O LUT3 | 0.75 1.81 up | 0.30 | | OBUF | 2.29 4.10 up | 0.10 | | 0.00 | 4.10 up | 0.00 | | ime | 4.10 | _ | | | O LUT3<br>OBUF<br>0.00 | O LUT3 0.75 1.81 up<br>O OBUF 2.29 4.10 up<br>0.00 4.10 up | ## 저자소개 # Ankur Agarwal is a Visiting Instructor and a Ph.D. student at the Computer Science and Engineering Department, Florida Atlantic University. He pursued his MS in computer engineering from Florida Atlantic University in year 2003. He also holds two post graduate diplomas in VLSI design and real-time embedded system design. He has earned his bachelor of engineering from Pune University, India in year 2000. \* Research Areas: concurrency modeling, system level design, network-on-chip, real-time-operating system and VLSI design. ## A. S. Pandya is a professor at the Computer Science and Engineering Department, Florida Atlantic University. He received his undergraduate education at the Indian Institute of Technology, Bombay. He earned his M.S. and Ph.D. in Computer Science from the Syracuse University, New York. He has worked as a visiting Professor in various countries including Japan, Korea, India, etc. Research Areas: VLSI implementable algorithms, Applications of AI and Image analysis in Medicine, Financial Forecasting using Neural Networks. ## 노영욱(Young-Uhg Lho) 1985년 2월 부산대학교 전산통계학 과 학사 1989년 2월 부산대학교 전산통계학 과 석사 1995년 2월 부산대학교 전자계산학과 박사 1989년 ~ 1996년 한국전자통신연구소(ETRI) 연구원 1996년 ~ 현재 신라대학교 교수 ※ 관심분야: 내장형시스템, 멀티미디어 시스템, 병렬분산시스템, 지능형시스템, 실시간운영체제, 컴퓨터교육