IEIE Transactions on Smart Processing and Computing # A 23.52µW / 0.7V Multi-stage Flip-flop Architecture Steered by a LECTOR-based Gated Clock ## Pritam Bhattacharjee, Alak Majumder, and Bipasha Nath VLSI Design Laboratory, Department of Electronics & Computer Engineering, National Institute of Technology, Arunachal Pradesh, Yupia, District- Papumpare, Arunachal Pradesh 791112, India majumder.alak@gmail.com \* Corresponding Author: Alak Majumder Received February 8, 2017; Revised March 13, 2017; Accepted April 12, 2017; Published June 30, 2017 - \* Regular Paper - \* Extended from a Conference: Preliminary results of this paper were presented at the ICEIC 2017. This paper has been accepted by the editorial board through the regular review process that confirms the original contribution. **Abstract:** Technology development is leading to the invention of more sophisticated electronics appliances that require long battery life. Therefore, saving power is a major concern in current-day scenarios. A notable source of power dissipation in sequential structures of integrated circuits is due to the continuous switching of high-frequency clock signals, which do not carry any information, and hence, their switching is eliminated by a method called clock gating. In this paper, we have incorporated a recent clock-gating style named Leakage Control Transistor (LECTOR)-based clock gating to drive a multi-stage sequential architectures, and we focus on its performance under three different process corners (fast-fast, slow-slow, typical-typical) through Monte Carlo simulation at 18 GHz clock with 90 nm technology. This gating is found to be one of the best gated approaches for multi-stage architectures in terms of total power consumption. Keywords: Clock gating techniques, LECTOR, Single and multi-stage D Flip-flop, D Latch, Power ## 1. Introduction Nowadays, there is a strong inclination amongst integrated circuit users for low-power high-speed applications. Therefore, the necessity for intelligent power reduction techniques is quite extreme in today's research. The major elements of power consumption in chip design are static power due to current leaks through active/inactive devices and dynamic power from logic swing against active devices. The trend of static and dynamic power dissipation is depicted in Fig. 1 as a function of process technology. Although major controlling of static power dissipation is predominant, current leaks from the power supply $(V_{DD})$ to ground at the time of logic transition in the intermediate nodes of a design is inexorable. This situation is commonly referred as a crow-barred condition in a complementary metaloxide semiconductor (CMOS) [1]. The probable way to solve this issue is to reduce V<sub>DD</sub> or to stop it when not required. If V<sub>DD</sub> is reduced, it will lead to a drop in operational speed, opposing the target of achieving high speed. Also, lowering V<sub>DD</sub> will increase the subthreshold Fig. 1. Static and dynamic power consumption as a function of process technology [2]. leakage current. On the other hand, if one goes for controlling the current flow due to $V_{DD}$ by using the concept of sleep transistors, which is popularly referred as power gating [3], then there is a severe chance of degrading the voltage level of the logics on output. In order to solve this issue, Hanchate et al. [4] reported that an efficacious stacking of positive-channel metal oxide semiconductor (PMOS) and negative metal oxide semiconductor (NMOS) transistors between the power lines (V<sub>DD</sub> and ground), called Leakage Control Transistor (LECTOR), will impede current leaks around the power lines, adding a small penalty on delay. The logical swing at the output of any sequential architecture is controlled by the clock nets. However, the fact is that clock signal, which does not contain any information, is continuously triggered, leading to unnecessary power loss. Therefore, the functioning of clock nets is managed by keeping them shut off when they are not contributing to necessary computing. This concept is often referred to as clock gating (CG) [5]. In this way, the toggling rate of clock nets in reference to the operational frequency of a design, known as the activity factor, can be curbed to a great extent. This is a great savior from dynamic power dissipation. Our approach has been to incorporate LECTOR in clock gating to have concurrent optimization in both static and dynamic power dissipation for sequential design [6]. Eventually, the effort showed that LECTOR-based clock gating (LB-CG) outsmarts a variety of prevalent clock gating styles. In this process, a delay flip-flop (D-FF) is the concerned test circuit, which helped us to validate the gating style. Even then, there remains a question as to the performance response of LB-CG to multi-stage applications, as it is important for large-circuit topologies, and the best vogue is to cascade as many stages as possible. Therefore, in this work, our focus is to view the performance of LB-CG for multi-stage operation under various process corners through Monte Carlo simulation, and we observed its superiority against the recent clock gating technique. The rest of this paper is organized as follows. Section 2 is devoted to a quick discussion over the merits and liabilities of various clock-gating styles. In Section 3, the necessity for gated multi-stage architectures is highlighted in the form of a two-stage sequential circuit, which is a register. In Section 4, we discuss the results observed from this assignment, and offer various parametric analyses. Finally, we conclude this paper in Section 5 with a concrete overview based on our observations. ## 2. Prior Art (Clock Gating) The perceptive power optimization technique in very large-scale integration is clock gating, which has been practiced since the development of Intel Pentium IV processors [6]. The incorporation of gating is mostly done to sequential elements like flip-flops, as they are the prime architecture contributing to dynamic power dissipation. The primitive clock gating style reported is latch free-based gating [5] constituted with a simple AND/OR gate, functioning according to the type of edge-triggering employed in the flip-flop. Inputs to the AND/OR gate are an enable signal and the system clock. The system clock gets propagated to the gated clock only when the enable signal is active and, thereby, unnecessary clock triggering is avoided when the flip-flop is not functioning. But the issue is with the high dependency on the enable signal, as the signal is always prone to noise, directly affecting the gated clock. Moreover, if there is frequent logic switching in the enable signal, then there is the possibility of multiple switching in the gated clock as well, which violates the sole purpose of clock gating. This problem could be overcome by implementing a latch to control the enable signal, and the gating methodology is depicted as latchbased clock gating [7]. But the introduction of the latch does not solve the dependency issue with the enable signal. Latches are level-triggered, and happen to be transparent for half a clock period. During transparency of the latch, the enable signal remains open to noise, transferring it directly to the gated clock. In addition, even in the opaque half of the latch, if the enable happens to be active, the gated clock will remain low, leading to data loss. These issues do not allow appreciating the latch-free and latchbased gating styles. The probable solution is to incorporate the enable in such a manner that it does not affect the gated clock directly. In the year 2000, Strollo et al. suggested for the first time comparing the data output of the flip-flop to its input data stream serially, and incorporating that comparative signal as the enable [8]. This eliminates the direct effect of the enable on the gated clock. Moreover, this also offers a way to stop the continuous triggering of the clock in a flip-flop when the same data are repeated. In this context, Strollo et al. proposed two new gating styles, known as double-gated clock gating (DG-CG) and NC<sup>2</sup>MOS clock gating [11]. The proposed design of DG-CG depicted separate comparisons and gating units for the 'master' and 'slave' latch of the flip-flop. It was also reported that even if the gating overhead is doubled here, a significant power reduction occurs when the switching activity ( $\alpha$ ) is low. The increase in $\alpha$ increases the total power dissipation. Due to this, they introduced NC<sup>2</sup>MOS clock gating with dynamic CMOS logic. Although it did not manifest any timing problems, the output node (being dynamic in nature) offers a greater possibility that the gated clock will be contaminated. In 2016, Bhattacharjee et al. presented LB-CG, an improvised form of DG-CG, to intervene in the above issues and also to take hold of the power consumed due to the crow-barred condition, along with the dynamic power consumption [6]. They proposed that as the 'master' latch output follows into 'slave' latch input during its transparency, the gating overhead can be reduced by eliminating the comparison unit in the 'slave' latch. Thereby, there will be less increase in gating circuit overhead compared to DG-CG. On the other hand, they replaced the static AND with LECTOR AND to stop the current leak through the power lines. In Fig. 2, the incorporation of LB-CG on a master-slave D-flip-flop is shown. Both 'master' and 'slave' latch are synchronized using the gated clock, where the gated clock propensity in 'master' and 'slave' latch is different. Therefore, either the 'master' or the 'slave' latch remains active at a particular instant. As the flip-flop is triggered by the gated clock, unnecessary switching activity is avoided when data D Fig. 2. LB-CG implemented on a master-slave D-flip-flop [6]. does not change. As per the architecture shown in Fig. 2, when T1 and T4 are active (if ckg=1), it lets the current data pass through the 'master' latch and stores the previous data in the 'slave' latch. If the current data differ from the previous data, as compared by XOR gate, ckg changes; otherwise, it conserves the previous state. The pros and cons of existing clock gating styles are summarized in Table 1, which clearly delineates LB-CG as the appropriate gating style. Even more, it is found that LB-CG is competent to drive benchmark circuits like serial adders [12]. # 3. Incorporating Clock Gating in Multi-stage Architecture The cascading of flip-flops in multiple stages (MS-FF) is a neat application of multi-stage architectures, which has become the integral part of present-day heavyweight circuits. A major problem in these sorts of architectures is that they add up the delay, power, and area of the individual stages, making it a cumulative process. Therefore, to provide a solution to the problem, we have shown incorporation of LB-CG into a multi-stage circuit, representing a two-bit register, which comprises two flip-flops cascaded in consecutive stages, as shown in Fig. 3 [13]. The register is made of an LB-CG-based D-flip-flop [6]. The data bit D1 and D2 of Register 1 are sent to Register 2, comprising Q1 and Q2. In the process, the system clock is inserted as the global clock pulse, common to both flip-flops. But clock gating is conducted individually. Therefore, ckg and ckg\_1 in Fig. 3 will have different operating frequencies. Instead, we could have used a common gated clock in the architecture, but that may have instigated a logical error, as D1 and D2 change continually, which will allow a gated clock to change accordingly, as it is produced by the XOR comparison output of the 'master' latch. Now, to verify the architecture, we simulated it using both 65 nm Predictive Technology Model (PTM) [14] and 90 nm PTM [15] at an operating frequency of 18 GHz. The transient response to the simulation is displayed in Fig. 4. ### 4. Results and Discussions The incorporation of LB-CG in a two-bit register resulted in a mixed landscape. It brought savings to average power consumption of 3.58% and 7.78%, in comparison to its implementation with DG-CG and a nongated approach, respectively, when simulated in 90 nm PTM. From the power consumption point of view, a gated two-bit register offers qualitatively low power, in comparison to the non-gated approach, even though it introduces extra circuit overhead. This fact has been truly justified in Table 2 and through the switching activity ( $\alpha$ ) versus average power plot given in Fig. 5. However, we see that for less or equal to 6.25% of the switching activity, the average power consumption in DG-CG approach is comparably less than that in LB-CG approach. On the other hand, in order to estimate the performance level of the timing parameters, we incorporated a test 40 90 0.084 RTL-based CG [10] LB-CG [6] | Existing Works | Technology<br>(nm) | Delay<br>(ns) | Average<br>Power<br>(μW) | Transistor<br>Count | Remarks | |-------------------------------|--------------------|---------------|--------------------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | DG-CG<br>[8] | 90 | 1.5287 | 58.53 | 50 | Circuit overhead is pretty high. Possibility to dissipate more power increases with increase in activity factor. | | NC <sup>2</sup> MOS CG<br>[8] | 90 | 0.2588 | 20.54 | | Circuit overhead is less. But as the constructing element is dynamic CMOS, there is high probability of distorted logic. | | CG reported in [9] | 90 | - | 8.00 | 25 | Circuit overhead is minimal. Power dissipation is appreciably low. But it is made of hybrid logic with current-based design. Therefore, its logical stability is a big concern. | 40 Power dissipation is extremely Simultaneously suppresses both static & dynamic power. Gating circuit overhead is moderate. Offers less contradicts the intention of gating. high. Thereby, Table 1. Comparing the performance of predominantly popular CG styles with LB-CG. Table 2. Performance analysis of single-stage versus multi-stage gated architectures using CG styles. 65000 19.06 | | | 65 nm | | 90 nm | | | | | |------------------|-------------------|-------------------|-----------------------|-------------------|-------------------|-----------------------|--|--| | Parameters | LB-CG<br>Approach | DG-CG<br>Approach | No Gating<br>Approach | LB-CG<br>Approach | DG-CG<br>Approach | No Gating<br>Approach | | | | Delay (ns) | 1.3096 | 1.3463 | 1.0817 | 1.4920 | 1.5287 | 1.1641 | | | | Setup Time (ns) | 0.0327 | 0.0271 | 0.0084 | 0.0477 | 0.0421 | 0.0234 | | | | Hold Time (ns) | -0.0241 | -0.0192 | -0.0094 | -0.0391 | -0.0342 | -0.0244 | | | | Latency (ns) | 1.3423 | 1.3734 | 1.0901 | 1.5397 | 1.5708 | 1.1875 | | | | Avg. Power (μW) | 58.58 | 60.55 | 62.30 | 56.43 | 58.53 | 61.19 | | | | Transistor Count | 40 | 50 | 18 | 40 | 50 | 18 | | | | Clock (GHz) | 18 | 18 | 18 | 18 | 18 | 18 | | | Fig. 3. Two-stage sequential circuit: a register. circuit, shown in Fig. 6. The time interval measured between ckg and Q\_flip\_flop is defined as the propagation delay of the clock-gated D-flip-flop in the multi-stage architecture. In fact, the same methodology is true for the two-bit register. The setup and hold time is estimated as the time interval between the positive edges of ckg bar and $D_b$ , where ckg\_bar is the complementary counterpart of ckg. Setup time is defined as the time required for data $D_b$ to stabilize before ckg\_bar arrives at the 'slave' latch. Similarly, hold time is defined as the time span for which data $D_b$ remains unchanged till ckg\_bar toggles. The summation value of setup and propagation delay is Fig. 4. Transient response of LB-CG approach of multi-stage flip-flop architecture in 90 nm PTM. Fig. 5. Average power in 90 nm PTM @switching activity. rendered as latency. The propagation delay and latency, respectively, in the LB-CG approach are 2.4% and 1.98% less than the DG-CG approach, but it is bad in comparison to the non-gated approach. This is quite obvious, as the transistor count in both the LB-CG and DG-CG approaches is quite high with respect to the transistor count in the non-gated approach. Even then, we can consider the LB-CG approach to be optimum for multi-stage architectures as it provides good Fig. 6. Test circuit for timing analysis [6]. reduction in power consumption. To determine the reliability of the LB-CG approach to a two-bit register, we performed process variation and observed the delay and average power behavior as a function of the power supply voltage ( $V_{\rm DD}$ ) shown in Figs. 7(a) and (b), respectively. With a downscaling in $V_{\rm DD}$ , it is found that the architecture also works well with a power supply as small as 0.7 V leading to a small power dissipation that equals only 23.52 $\mu W$ . Though there is no significant change in delay for different temperatures, the average power increases by a tiny margin. Therefore, the LB-CG approach to a multi-stage sequential circuit is set to function properly in any conditions. In order to check the robustness of the LB-CG Fig. 7. (a) Delay, (b) average power as a function of VDD @ 18 GHz. Table 3. Performance of Single-stage from Multi-stage Gated Architecture through Monte-Carlo in 90nm PTM. | Process<br>Corners | Gated MS-<br>FF | No skew | | | 5% Process Skew | | | | | | | |-----------------------|-----------------|------------------|--------|-------------|--------------------|-------|---------------|--------------|-------------|--------------------------|-----------------| | | | Average<br>Power | Delay | PDP<br>(fJ) | Average Power (µW) | | Delay<br>(ns) | | PDP<br>(fJ) | ΔV <sub>OH</sub><br>(mV) | $\Delta V_{OL}$ | | | | (μW) | (ns) | | (X) | (σ) | (X) | ( <b>o</b> ) | (X) | (1111) | (mV) | | FF (0 <sup>0</sup> C) | LB-CG<br>MS-FF | 55.64 | 1.4925 | 83.043 | 55.773 | 0.795 | 1.486 | 0.0167 | 82.87 | 0.041 | 0.0381 | | | DG-CG<br>MS-FF | 60.43 | 1.5381 | 92.947 | 61.143 | 0.633 | 1.533 | 0.0128 | 93.73 | 0.044 | 0.0391 | | TT (27°C) | LB-CG<br>MS-FF | 57.21 | 1.4902 | 85.524 | 57.284 | 0.525 | 1.483 | 0.0137 | 84.95 | 0.035 | 0.0382 | | | DG-CG<br>MS-FF | 62.01 | 1.5269 | 94.683 | 62.038 | 0.429 | 1.519 | 0.0096 | 94.25 | 0.0412 | 0.0392 | | SS (90°C) | LB-CG<br>MS-FF | 58.78 | 1.4881 | 87.470 | 58.724 | 0.337 | 1.489 | 0.0024 | 87.46 | 0.037 | 0.031 | | | DG-CG<br>MS-FF | 63.59 | 1.5248 | 96.962 | 63.851 | 0.242 | 1.520 | 0.0112 | 97.09 | 0.038 | 0.032 | $(X) \rightarrow \text{Mean}; (\sigma) \rightarrow \text{Standard Deviation}$ approach to a two-bit register, we conducted different process corners (fast-fast, typical-typical and slow-slow) through 1000 runs of a Monte Carlo simulation in 90 nm PTM and found it robust, as the architecture provides a very small deviation in terms of average power, delay, and power-delay-product (PDP) to a 5% process skew as depicted in Table 3. It is also evident from Table 3 that the process skew has hardly affected the logic levels ( $\Delta V_{OH}$ and $\Delta V_{OL}$ ) of operation. It offers less delay and average power consumption than the DG-CG approach, as shown in Figs. 8(a) and (b), respectively. The average power consumption in the LB-CG approach is 8.78%, 7.66%, and 8.03% less in comparison to that consumed by the DG-CG approach at FF, TT, and SS corners, respectively. Also, the LB-CG approach gives 3.06%, 2.37%, and 2.04% less delay compared to that exhibited in the DG-CG approach in the same order of process corners. ## 5. Conclusion In this paper, we have explored one of the most recent clock gating styles, called LB-CG, and incorporated it with a multi-stage architecture, i.e. a two-bit register. This architecture was analyzed in 90 nm PTM, which caters to a 12.22% lesser delay and 3.67% higher average power with respect to its 65 nm PTM counterpart. However, the LB-CG based multi-stage architecture offers savings of 5.97% and 7.78% in average power with respect to its non-gated approach in 65 nm and 90 nm PTM, respectively, by giving a decent penalty in transistor count. The various process corner analyses authenticate its robustness and capability to operate even under the worst conditions. Moreover, as the LB-CG approach for a multi-stage architecture operates at 18 GHz clock frequency, it may be possible to employ it for power reduction in various high-frequency electronic systems. Fig. 8. Performance comparison of LB-CG and DG-CG approaches in different process corners for (a) delay, (b) average power. ## **Acknowledgement** This research was supported under the Visvesvaraya PhD Scheme & SMDP-C2SD project grant funded by Ministry of Electronics & Information Technology, Government of India. ### References - [1] Uyemura, John P. "CMOS Logic Circuit Design." (1999). Article (CrossRef Link) - [2] Max Maxfield. "Achronix announces new 22nm Speedster22i FPGAs" Achronix Semiconductors Corporation. Article (CrossRef Link) - [3] Hu, Zhigang, et al. "Microarchitectural techniques for power gating of execution units." Proceedings of the 2004 international symposium on Low power electronics and design. ACM, 2004. Article (CrossRef Link) - [4] Hanchate, Narender, and Nagarajan Ranganathan. "LECTOR: A technique for leakage reduction in CMOS circuits." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 12.2 (2004): 196-205. <u>Article (CrossRef Link)</u> - [5] Shinde, Jitesh, and S. S. Salankar. "Clock gating—A power optimizing technique for VLSI circuits." India Conference (INDICON), 2011 Annual IEEE. IEEE, 2011.Article (CrossRef Link) - [6] P. Bhattacharjee, A. Majumder and T.D. Das, "A 90 nm Leakage Control Transistor Based Clock Gating for Low Power Flip Flop Applications." In IEEE 59th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 381-384. IEEE 2016. Article (CrossRef Link) - [7] Sharma, Dushyant Kumar. "Effects of different clock gating techinques on design." International Journal of Scientific & Engineering Research 3.5 (2012): 1. <u>Article (CrossRef Link)</u> - [8] Strollo, A. G. M., E. Napoli, and D. De Caro. "Low-power flip-flops with reliable clock gating." Microelectronics journal 32.1 (2001): 21-28. <u>Article</u> (CrossRef Link) - [9] Shaker, Mohamed, and Magdy Bayoumi. "Novel clock gating techniques for low power flip-flops and its applications." Circuits and Systems (MWSCAS), 2013 IEEE 56th International Midwest Symposium on. IEEE, 2013. Article (CrossRef Link) - [10] Dev, Mahendra Pratap, et al. "Clock gated low power sequential circuit design." Information & Communication Technologies (ICT), 2013 IEEE Conference on. IEEE, 2013. Article (CrossRef Link) - [11] Strollo, Antonio GM, Ettore Napoli, and Davide De Caro. "New clock-gating techniques for low-power flip-flops." Proceedings of the 2000 international symposium on Low power electronics and design. ACM, 2000. Article (CrossRef Link) - [12] P. Bhattacharjee, B. Nath and A. Majumder. "LECTOR Based Gated Clock Approach to Design Low Power FSM for Serial Adder" In Nanoelectronic and Information Systems (iNIS), 2016 IEEE International Symposium on, pp. 250-254. Article (CrossRef Link) - [13] P. Bhattacharjee, B. Nath and A. Majumder. "LECTOR Based Clock Gating for Low Power Multi-Stage Flip Flop Applications" In 16<sup>th</sup> International Conference on Electronics, Information, and Communication (ICEIC 2017), pp. 106-109. Article (CrossRef Link) - [14] 65nm Predictive Technology Model, Arizona State University Article (CrossRef Link) - [15] 90nm Predictive Technology Model, Arizona State University Article (CrossRef Link) Pritam Bhattacharjee is a PhD Scholar in the Department of Electronics & Computer Engineering, National Institute of Technology, Arunachal Pradesh, India. He received his B.Tech. in Electronics & Communication Engineering and an M.Tech. in Microelectronics & VLSI from Maulana Abul Kalam Azad University of Technology, West Bengal, India, in 2011 and 2013, respectively. He has worked as a Senior Lecturer in the Department of Hardware & Networking Technology at Bidyanidhi Institute of Technology & Management (Durgapur Chapter), West Bengal, India. His research interests are Back-end VLSI design, Clock Distribution Networks, Circuit Modeling in ternary Quantum Dot Cellular Automata, and Low-Power VLSI Design. He is also a Member, R10, Asia Pacific for IEEE Young Professionals, Electron Device Society, Big Data and Cloud Computing Community. He has served as the reviewer for journals like IEEE Transactions on Nanotechnology, Ain Shams Engineering Journal -Elsevier and IETE Journal of Research and for eminent conferences/symposiums like MWSCAS, iNIS, IEEE-NANO and ISCAS. Alak Majumder is working as an Assistant Professor in the Department of Electronics and Communication Engineering at the National Institute of Technology Arunachal Pradesh, India. Before joining the department in September 2013, he served as Assistant Professor in the Department of Electronics and Communication Engineering at ICFAI University, Agartala, India. His current research interests include Analog and Digital VLSI and High-speed Signaling. He is a Member of the IEEE, IAENG, and IACSIT. He has three filed Indian Patents and one U.S. Provisional Patent to his credit. He has served as a reviewer for many IEEE Transactions and Elsevier journals. **Bipasha Nath** is currently pursuing her M-Tech Final Year in Mobile Communication & Computing in the Department of Electronics and Computer Engineering, National Institute of Technology, Arunachal Pradesh, India. She earned her B.Tech. in Electronics and Telecommunication Engineering from Tripura Institute of Technology, Agartala, in 2015. She is an author/co-author of many papers in several journals and conferences. Her recent areas of interest include Digital VLSI.