# Design of Self-Timed Standard Library and Interface Circuit

Hwi-Sung Jung, Moon-Key Lee

VLSI & CAD Lab., Dept. of Electronic Engineering Yonsei University, Seoul 120-749, Korea Tel: +82-2-361-3524, Fax: +82-2-364-8162 E-mail: hsjung@spark.yonsei.ac.kr

#### Abstract

We designed a self-timed interface circuit for efficient communication in IP (Intellectual Property)-based system with high-speed self-timed FIFO and a set of self-timed event logic library with 0.25um CMOS technology. Optimized self-timed standard cell layouts and Verilog models are generated for top-down design methodology. A method for mitigating a design bottleneck when it comes to tolerate clock skew is described. With clock control method and FIFO, we implemented high-speed 32bit-interface chip for self-timed system, which generated maximum system clock is 2.2GHz. The size of the core is about 1.1mm x 1.1mm.

#### I. INTRODUCTION

The distribution of the clock signal and system level timing in a VDSM (Very Deep Sub-Micron) synchronous system has become an essential problem because the complexity and difficulty of the clock lines have increased dramatically [1]. In recent years, the interest in low-power and self-timed systems has grown considerably. Therefore, in order to eliminate timing and clock distribution problem in a large system, the external interface of each block can be made fully self-timed. System timing in self-timed circuit, which is absence of global clock, is performed by the elements themselves [2]. The next-generation on-chip will consist of multiple independently synchronous modules and self-timed modules for higher performance. To perform efficient communication between synchronous modules operating at different clock frequencies or with asynchronous modules, a reliable communication scheme is needed. In this paper, we used the PCC (Pausible Clocking Control) circuit that is one communication scheme to communicate with synchronous modules via self-timed FIFO (First-In First-Out) communication channels [1]. A synchronization failure, timing violence that occurs when the arrival times of an external signal transition and a sampling edge of the clock are indistinguishable by the sampling latch at the module boundary, at the module interface is avoided with PCC scheme by adjusting the local clock.

In order to develop compatibility and usability of circuit design, we designed self-timed event logic library that is useful to perform handshake protocol effectively. This library includes physical layout elements and verilog timing models for efficient top-down design methodology.

This work was supported by a grant from the Korea Research Foundation.

We fitted the self-timed library cell height as 8.8um equally in order to place and route automatically with automatic P&R tool. During library generation, we used layout compaction technique to minimize cell area, especially considering VDSM design methodology [6,7]. Using this library, we implemented high-speed self-timed FIFO and interface chip on ASIC 0.25um CMOS chip. This FIFO operates at 1.8GHz and two separated modules can communicate with each other via FIFO.

The rest of this paper is organized as follows. Section 2 reviews self-timed library, key components used in self-timed circuit design. Section 3 describes the design and implementation of FIFO and interface circuit unit for heterogeneous system. Section 4 describes the simulation results. Section 5 concludes the paper with some remarks on the result.

# II. SELF-TIMED STANDARD LIBRARY

We generated a set of self-timed standard library that is used in AMULET [3] with 0.25um CMOS process design rule. This self-timed standard library is essential to perform handshake protocol scheme with request/acknowledge signals. There are muller c-gate, toggle, select, mutual exclusive, call, arbiter and transparent latch elements and so on in this library [3]. This library includes optimized cell layout with 8.8um-height and verilog timing model for the back-end and front-end design, respectively. The cell height is calculated as 10 tracks multiplied 10 by 0.88um that is one route pitch. One route pitch is minimum metal2 route pitch in dog-bone structure, parallel metal2 line with vial inside, to place and route because the standard library is consisted of only metal1 layer for internal routing. Our implementation of library cell layout shows the equally same height for all logic cells and it enables to place and route effectively for building large blocks. Furthermore, with standard structure of location of signal ports and boundary conditions, the physical library can be ported into the automatic place and route tools such as Silicon Ensemble<sup>TM</sup> and Apollo<sup>TM</sup> software to generate optimized layout effectively. Fig.1 shows the structure of standard cell layout for muller c-gate. The followings are the characteristics of implemented physical library designed with 0.25um CMOS process design rule.

- This library is intended to function with a 2.5V.
- This library is to be designed to support flipping and abutting cell rows.
- We have normalized on a transistor width P/N ratio for

- this library of 1.3:1.
- The horizontal track 1 and track 10 are coincident with the cell AB (Abutment Box) bottom and top, respectively.
- It is preferred to place all input and output ports along one "Central Horizontal Grid" for all cells.
- All input and output ports are located on the valid positions.
- Tap connections placed at the cell ends (left and/or right) are best placed directly beneath their power supply bus metals.



Fig.1. Layout structure for muller c-gate.

In VDSM (Very Deep Sub-Micron) CMOS technology, the minimum space between the P+ and N+ diffusion has been much scaled down to reduce the layout area. With the scaled-down spacing, the p-n-p-n path in the internal cores of CMOS IC's is further sensitive to latchup problem [5]. The cross-sectional view of such a latchup-sensitive path is illustrated in Fig.2. The minimum space between the VDD-



Fig.2. Cross-sectional view of the latchup-sensitive path.

connected P+ diffusion in N-well and the VSS-connected N+ diffusion in the p-substrate is much reduced in the VDSM CMOS technology. Table 1 shows the minimum space of the scaled-down p-n-p-n path in the typical CMOS technologies. In order to check the space of the sensitive path, we used the DRC and ERC function of DRACULA<sup>TM</sup> and then replaced with insensitive layout styles.

| Process | X (um) | Y (um) | X+Y (um) |  |
|---------|--------|--------|----------|--|
| 0.50um  | 0.7    | 1.4    | 2.1      |  |
| 0.35um  | 0.6    | 1.2    | 1.8      |  |
| 0.25um  | 0.5    | 0.7    | 1.2      |  |
| 0.18um  | 0.4    | 0.4    | 0.8      |  |

Table 1. The minimum space of p-n-p-n path.

In order to verify function in front-end design, we generated verilog timing model for Verilog-XL<sup>TM</sup>. Verilog timing model for transparent latch in self-timed library is shown in Fig.3.



Fig.3. Verilog<sup>TM</sup> timing model of transparent latch.

In order to optimize the low power and high-speed, we used following methods. The first method for low power is like this: Ring oscillator simulation shows that as the size of transistor getting smaller, the power consumption also getting down because the parasitic capacitance is getting smaller. But, our suggestion is to optimize power consumption with considering logic threshold voltage of gate. The second method for high-speed is like this: Recursive SPICE simulation determines the optimized transistor size of PMOS and NMOS in order to minimize gate delay. This gate delay is calculated as average value from rise delay and fall delay by simulation that three gates are connected in series with standard load located behind each gate. This standard load is four inverters and routing capacitance. Table 2 shows the optimized layout area and gate delay for some self-timed library cells.

|            | Cell Size                  | Rise Prop. | Fall Prop. | Rise (Tr). | Fall (Tf). |
|------------|----------------------------|------------|------------|------------|------------|
| Muller-C   | 8.8 x 8.8um <sup>2</sup>   | 0.142ns    | 0.110ns    | 0.122ns    | 0.083ns    |
| Translatch | 10.56 x 8.8um <sup>2</sup> | 0.130ns    | 0.203ns    | 0.201ns    | 0.095ns    |
| Mutex      | 15.84 x 8.8um <sup>2</sup> | 0.130ns    | 0.216ns    | 0.152ns    | 0.009ns    |
| Toggle     | 29.92 x 8.8um²             | 0.143ns    | 0.216ns    | 0.196ns    | 0.109ns    |
| Select     | 47.52 x 8.8um <sup>2</sup> | 0.235ns    | 0.248ns    | 0.176ns    | 0.114ns    |

Table 2. Layout area and gate delay for self-timed cells.

#### III. SELF-TIMED INTERFACE DESIGN

SOC (System on a Chip) design approach is increasing with supports of IP (Intellectual Property) block to keep the time-to-market. In this paper, we considered a globally synchronous locally self-timed system such as Fig.4. The peripheral block should be IP block which clock frequency is different from system clock frequency [4]. With this design approach, timing and clock distribution problems in a large system which consist of many IP blocks should be eliminated.



Fig.4. System configuration.

In order to perform communication in self-timed system such as Fig.4, self-timed high-speed FIFO channel is needed for burst transmission. We implemented 4-stages micropipeline [4] FIFO that is self-timed circuits using two-phase or four-phase communication protocol and a bundled data format. Micropipeline falls into one of two types as follows: Transparent latch style and capture-pass latch style [4]. Using self-timed logic library, two types of micropipeline FIFO are designed with optimized method and are compared in aspects of layout area and delay time. The use of transparent latch in transparent latch micropipeline stage greatly simplifies the required control circuit control circuit of capture-pass than the micropipeline stage.

HSPICE™ (BSIM3V3 model with Nominal-Nominal strength) has been performed on extracted layout from the implemented design for the nominal case (Vdd=2.5V, at 25°C temperature). The transparent latch micropipeline is

performed faster than capture-pass micropipeline. Once valid data is presented at the latch input this data will be propagated to the latch output in 0.62ns for transparent latch FIFO operating at 1.6GHz.

The clock control circuit is another key element for self-timed interface system except FIFO to transfer data between two modules that have different clock frequency. We used the pausible clocking scheme, which the local clock is paused or stretched to ensure that the handshaking signal satisfies setup and hold time constraints with respect to the local clock, to implement interface circuit. In this scheme, communication between a module and the FIFO is done using request/acknowledge handshake protocol. This method adjusts individual synchronous module's local clock, when necessary, to avoid synchronization failure. We implemented this pausible clocking scheme as PCC (Pausible Clocking Control) [1] circuit style. Fig.5 shows the block diagram of proposed scheme. In this IP block,



Fig.5. Block diagram for interface using FIFO and PCC.

the clock generator block is composed of inverter chain to generate sysclk for the IP module. Therefore, the clock frequency is dependent on the number of inverter gates. System operation procedure is like this: First, the request signal from the CPU is transmitted into the self-timed finite state machine, and the control signal is generated between finite state machine and mutual exclusive block. Second, rclk signal is stretched and transmitted into clock generator block and request signal is occurred to inform of data transfer. At last, sysclk signal is generated and synchronous block operates with this system clock.

Using this self-timed interface method, many IP blocks can be controlled in a self-timed manner, independently of any clock signal from CPU. Also the communication between distinct IP blocks in a self-timed IP-based system can be performed via self-timed communication channels. In order to minimize power dissipation, the clock generator block and mutual exclusive block should be reset when the IP block is in idle state or waiting for some input event.

### IV. EXPERIMENTAL RESULTS

We used the Virtuoso<sup>TM</sup> layout editor to design in full-custom method and performed simulation with Hspice<sup>TM</sup> with the parasitic information such as resistance and capacitance because the interconnect delay is critical than



Fig.6.  $HSPICE^{TM}$  result on self-timed clock generator system.

the gate delay in VDSM technology. We designed 32bit-interface chip for self-timed system integration as shown in Fig.7. We implemented proposed system using HDL with timing model for functional verification and performed extensive simulation on extracted layout design done by 0.25um CMOS process design rule. The size of the core is



Fig.7. The chip layout.

about 1.1mm x 1.1mm. The timing trace in Fig.6 shows a simulation result including handshake and data signals. This result clearly indicates that the clocks do become stretched. The first event on *Req* (a rising transition) is acknowledged with pausing *sysclk*, and then second event (a falling transition) causes *sysclk* to be paused for about 0.2ns. As a result, this module operates at 2.2GHz, sysclk.

## V. CONCLUSION

In this paper, we presented a globally synchronous locally self-timed IP system configuration and we designed optimized self-timed logic library and implemented self-timed interface chip to communicate with IP which clock frequency is different from CPU system clock frequency.

Generated micropipeline FIFO operates at 1.6GHz speed for data transmission. Self-timed communication scheme, which is based on the pausible clocking scheme, is implemented with FIFO and PCC circuit on 0.25um CMOS chip. The resulting system functions to the local clock frequency of 2.2GHz that is limited by the ring oscillator. The self-timed interface system is a viable approach to overcome synchronization issues and power consumption problems that arise in the globally clocked system.

#### REFERENCES

- [1] K.Y. Yun & A.E. Dooply, "Pausible Clocking-Based Heterogeneous Systems", IEEE Transactions on VLSI System, Vol.7, No.4, Dec. 1999.
- [2] M.R.Greenstreet "Implementing a STARI chip", in *Proc. Int. Conf. Computer Design*, pp.38-43, 1995.
- [3] N.C.Paver. "The design and implementation of an asynchronous microprocessor", *Ph.D thesis, University of Manchester*, 1994.
- [4] Shyh-Jye and I-Yao Chuang. "Low-power Globally Asynchronous Locally Synchronous Design Using Self-Timed Circuit Technology." *IEEE International Symposium on Circuits and Systems*, pp.1808-1811, June 1994.
- [5] Ming-dou Ker and Jeng-Jie Peng, "Layout Design and Verification for Cell library to improve ESD/latchup Reliability in Deep-Submicron CMOS Technology", IEEE custom integrated circuits conference, pp.537 – 540, May 1998.
- [6] Dake Liu and Christer Svensson, "Trading Speed for Low Power by Choice of Supply and Threshold Voltages", *IEEE journal of solid-state circuits*, Vol.28, No.1. pp.10-17, Jan., 1993.
- [7] Richard X.Gu and Mohamed I.Elmasry, "Power Dissipation Analysis and Optimization of Deep Submicron CMOS Digital Circuits", *IEEE journal of solid-state circuits*, Vol.31, No.5, May 1996.