# THE EFFECT OF NUMBER OF VIRTUAL CHANNELS ON NOC EDP M. NADI, M. H. GHADIRY\* AND M. KHALILY DERMANY ABSTRACT. Low scalability and power efficiency of the shared bus in SoCs is a motivation to use on chip networks instead of traditional buses. In this paper we have modified the Orion power model to reach an analytical model to estimate the average message energy in K-Ary n-Cubes with focus on the number of virtual channels. Afterward by using the power model and also the performance model proposed in [11] the effect of number of virtual channels on Energy-Delay product have been analyzed. In addition a cycle accurate power and performance simulator have been implemented in VHDL to verify the results. AMS Mathematics Subject Classification: 94C05 Key words and phrases: Energy model, NoC(Network on Chip), power, performance. #### 1. Introduction As VLSI feature size shrinks the density of transistors increases and it is possible to place many IPs in a single chip. The most important problem of the early on-chip systems is the IPs interconnection. It seems that scalability problem of current interconnection will be solved with proposing the NoCs. In this approach switches are used to connect IPs instead of using shared buses. Some standard interfaces can be defined between network (collection of links and switching elements) and IPs. Therefore the design of IPs will be independent of the network. In the other hand NoC improves bandwidth with the use of concurrent connections and decreases power consumption by removing the long interconnection wires [1]. Wormhole Switching (Also widely known as Wormhole routing) has been very popular in practical multi-computers as it makes latency almost independent of the message distance in the absence of blocking. In wormhole routing a message Received January 18, 2008. Revised June 5, 2008. Accepted December 4, 2009. \*Corresponding author. <sup>© 2010</sup> Korean SIGCAM and KSCAM. broken into flits (a few bytes each) for transmission and flow control. The header flit (containing routing information) governs the route and the remaining data flits follow in a pipelined fashion. If the header blocked, data flits are blocked in situ. Dally and Seitz [13] have used the concept of virtual channels to develop deadlock free deterministic routing. A virtual channel has its own queue, but shares the bandwidth of the physical channel in a time multiplexed fashion. Power efficiency is one of the most important issues in early system design. For current process technologies, dynamic power is the primary power source consumed in CMOS circuits. The power is formulated as $P=Ef_{clk}$ , and the energy $E=0.5\alpha CV_{DD}^2$ , with clock frequency $f_{clk}$ , switching activity $\alpha$ , total switch capacitance C, and supply voltage $V_{DD}$ . Many analytical performance models for interconnection have been proposed so far, but for the case of power consumption more effort is required yet. Wang, et al. [3] have proposed a power and performance interconnection network simulator that is capable of providing detailed power characteristics, in addition to performance characteristics, to enable power-performance trade-off at the architectural level. They proposed an architectural-level parameterized power model as a part of that effort. Two routers have been modeled in [12] using model proposed by [3]. In [8] they introduce a framework to estimate the power consumption on switch fabrics in network routers. They proposed different modeling methodologies for node switches, internal buffers and interconnection wires inside switch fabric architectures. A power model for the Nostrum NoC has been proposed in [15]. For this purpose an empirical power model of links and switches has been formulated and validated with the synopsys power compiler. An architectural power modeling for interconnection networks proposed in [5]. In [2] WK-Recursive and Mesh topology are compared in the case of power and latency. They also proposed a novel approach in high-level power modeling based on latency for these topologies and showed that the power consumption of WK-Recursive topology is less than its equivalent mesh on a chip. In [9] power and performance for various topologies in NoC have been studied. In that paper, some topologies such as BFT, Folded Torus and Mesh have been compared. They proposed a guideline for selection of best topology for a specific application in NoC. A High Level Power Analysis for On-Chip Networks proposed in [9]. Their analysis is based on link utilization as the unit of abstraction for network power, with contention among message flows modeled through propagation of overflow areas in link utilization functions. In [10], [7] several routing algorithms modeled in VHDL and compared in case of power and performance using simulation. Note that this work is based on Orion model [3]. They have proposed model for most of capacitances of a router but they have used simulation to reach to switching activity values. As a part of our work, we have tried to calculate switching activities analytically in limited and averaged situation with some assumption, and then analyze the effect of virtual channel on EDP of NoC using the results as the other part. Although this model have been provided for K-Ary N-cubes but it can be used for another topology using related performance model and changing a few parameters respect to performance model in the formulas. # 2. Energy and performance measures We use two measures to calculate energy delay product; the energy and latency of a packet. **2.1. Energy.** When flits travel on the interconnection networks, both the interswitch wires and the logic gates in the switches toggle and this will result in energy dissipation. Here, we are concerned with the dynamic energy dissipation caused by communication process in the network. The flits from the source nodes need to traverse multiple hops consisting of switches and wires to reach destination. Consequently the energy dissipated by per flits per hop is given by equation 1. $$E_{hop} = E_{switch} + E_{interconnet} \tag{1}$$ Where $E_{switch}$ and $E_{interconnect}$ depend on the total capacitances and signal activities and each section of interconnect wire, respectively. They are determined as follows: $$E_{router} = 0.5\alpha_{router}C_{router}V^2 \tag{2}$$ $$E_{enviroment} = 0.5\alpha_{enviroment}C_{enviroment}V^2$$ (3) $\alpha$ is a parameter between 0 and 1 and demonstrates the switching activity, C is the total switching capacitance and V is the operating voltage. The energy dissipated for transferring a packet with n flits over h hops can be calculated as stated in equation 4. $$E_{packet} = n \sum_{j=1}^{h} E_{hop,j} \tag{4}$$ 2.2. Latency. Message latency is defined as the time (in clock cycle) that elapses from the occurrence of a message header injection into the network at the source node and the occurrence of a tail flit reception at the destination node. We simply refer to this as latency from here on. In order to reach the destination node from some starting source node, flits must travel through a path consisting of set of switches and interconnects, called stages. Depending on the source/destination pair and the routing algorithm, each message may have a different latency. There is also some overhead in the source and destination that also contributes to overall latency. Therefore, for a given message i, the latency Li is: $$Li = SenderOverhead + transportlatency + receiveroverhead$$ (5) We use the average latency as a performance metric. Let P be the total number of message reaching their destination the average latency, $L_{avg}$ , is then calculated FIGURE 1. A simple wormhole router modeled in Orion [3]. accordingly as follows in equation 6. $$L_{avg} = \frac{\sum_{1}^{p} l_i}{P} \tag{6}$$ **2.3. EDP (Energy/delay product).** The EDP obtained by production of the Latency and energy. $$EDP_{ava} = L_{ava}.E_{ava} \tag{7}$$ In on chip networks, low EDP is desired, since it shows low latency and low energy, although these two parameters are in contrast, decreasing energy causes increasing in latency. # 3. Energy of a packet crossing a wormhole router[3] Fig. 1 sketches the module representation of a wormhole router and its neighboring links. The source module injects a header flit into the write port of the input buffer module while $E_{wrt}$ is dissipated. When the flit emerges at the head of the FIFO buffer, it is checked via the read port of the buffer module, its route is read, and a request sent to the arbiter module for the desired output port. The arbiter performs the required arbitration so $E_{arb}$ is dissipated. Assuming the request is granted, the arbitration result is sent to the Config port of cross-bar module. A grant signal also is sent to the grant port of input buffer and therefore the read port of buffer is activated and $E_{read}$ is dissipated. The flit then traverses the crossbar module and dissipates $E_{xb}$ . Finally, the flit leaves the router, enters link and traverses link and dissipates $E_{link}$ The total energy this header flit has consumed at this node and its outgoing link is as described in equation 8. $$E_{flit} = E_{wrt} + E_{arb} + E_{read} + E_{xb} + E_{link} \tag{8}$$ # 4. Proposed model to calculate the average packet energy In this section we present the needed equations to calculate average packet energy when it crosses a router and the outgoing link. In this model we assume K-Ary n-Cubes topology, uniform traffic (each node can send packets to all other nodes with the same probability), random data in each packet such that total number of 1s is almost equal to 0s, and also Duato fully adaptive routing algorithm for routing algorithm. It should be noted that the energy for processing of routing algorithm is not covered and has been neglected. Although it is possible to implement the desired routing algorithm in a hardware description language such as VHDL and obtain the average energy using power simulators (e.g. Power compiler or XPower) and add it to the values derived from model [7]. In the following equations $E_x = 0.5\alpha C_x V_{DD}^2$ (\*) which x can be substituted with desired module and $C_x$ is the total capacitance of that module calculated as is described in appendix. We refer to the equation as \* in the rest of the paper. Note that for calculating this equation we should count each transition from 0 to 1 and 1 to 0 to obtain switching activity ( $\alpha$ ). The average energy dissipated when a packet crosses a switch (router) contains header flit and non-header flit energy. In wormhole switching only header arbitrated and the other flits follow the header in the same route. Here we consider header size is one and the average packet size is $L_p$ flits. Let $\bar{E}_{packet\_hop}$ and $\bar{E}_{packet\_link}$ be the average energy of a packet which is dissipated due to the hop and link crossing respectively. Thus when a packet goes on a hop, its energy is given as described in equations 9 and 10. $$\bar{E}_{packet} = \bar{D}\bar{E}_{packet\_hop} + (\bar{D} - 1)\bar{E}_{packet\_link}$$ (9) $$\bar{E}_{packet,hop} = (L_p - 1)\bar{E}_{body-flit} + \bar{E}_{header-flit}$$ (10) $E_{header\_flit}$ is one hop energy dissipation of a flit, which is described in equation 11 $$E_{header\_flit} = \bar{E}_{write} + \bar{E}_{arb} + \bar{E}_{read} + \bar{E}_{header\_xbar}$$ (11) $$\bar{E}_{body\_flit} = \bar{E}_{write} + \bar{E}_{read} + \bar{E}_{body\_xbar} \tag{12}$$ In above equations $\bar{D}$ is the average distance of source to destination for a given packet. In K-Ary N-Cubes $\bar{D}$ is determined in equation 13 [11]. $$\bar{D} = N \frac{(k-1)}{2} \tag{13}$$ Let W be the data width of link and equal to cross bar port bandwidth. Therefore the average number of bit flips on links is $\frac{W}{2}$ . The link energy then is calculated according to equation 14. $$\bar{E}_{packet\_link} = \frac{1}{2} (L_p C_{link\_init} \frac{W}{2}) V_{DD}^2$$ (14) let F be the flit size in bits, then the average read and write energy is calculated as followed in equations 15 and 16. $$\bar{E}_{read} = E_{wl} + F(E_{br} + 2E_{chq} + E_{amp}) \tag{15}$$ $$\bar{E}_{write} = E_{wl} + \frac{F}{2}(E_{bw} + E_{cell}) \tag{16}$$ which $E_{amp}$ is sense amplifier energy and calculated from [6] and the remaining energies can be calculated using \* equation and appendix. Matrix crossbar switch is used as switching element. Crossbar switch energy for a header traversing is summation of selected input and output lines and control of switches which connect input lines to output lines energy. The switch configurations remains fixed until the end of the packet transfer, therefore when non-header flits traverse the crossbar switch the control energy is omitted and we have: $$\bar{E}_{xbar\_header} = \bar{E}_{xb\_in} + \bar{E}_{xb\_out} + \bar{E}_{xb\_ctr}$$ (17) $$\bar{E}_{xbar\_body} = \bar{E}_{xb\_in} + \bar{E}_{xb\_out} \tag{18}$$ $$\bar{E}_{xb.in} = 0.5V_{DD}^{2}(C_{in.sw}2NVW + C_{a}(T_{id}) + C_{Line.unit}2NVWh_{t})$$ (19) $$\bar{E}_{xb\_out} = 0.5V_{DD}^2(C_{out\_sw}2NVW + C_a(T_{od}) + C_{Line\_unit}2NVWh_t)$$ (20) $$\bar{E}_{xb\_ctr} = 0.5V_{DD}^2(WC_{ctr\_sw} + \frac{C_{Line\_unit}2NVWw_t}{2})$$ (21) In above equations V is the number of virtual channels per physical channel, and W is the bandwidth of each link. $C_{line\_unit}$ is unit width capacitance of crossbar lines. $h_t$ and $w_t$ are vertical and horizontal line distances. $T_{id}$ and $T_{od}$ are the input and output drivers respectively, as shown in Table 3 of appendix. Let $E_{req}$ to be the header request signal energy to grant for an outgoing link, $E_{pri}$ the energy to store grant priorities, $E_{int}$ the energy dissipated in internal nodes, $E_{clk}$ the flip-flop clocking energy and $E_{gnt}$ , to be the grant signal energy of the arbiter, then the arbitration energy is given by equation 22[3]. $$\bar{E}_{arb} = \bar{E}_{req} + \bar{E}_{pri} + \bar{E}_{int} + \bar{E}_{gnt} + \bar{E}_{clk}$$ (22) With proper substitutions of parameters the average energy is calculated as described in equation 23: $$\bar{E}_{arb} = (\bar{E}_{req} + \frac{(2N-1)V - 1}{2}E_{pri} + \frac{(2N-1)V((2N-1)V - 1)}{2}E_{int} + E_{gnt} + \bar{E}_{clk})$$ (23) If we assume there is no U turn in packet path there are totally $\frac{(2N-1)V((2N-1)V-1)}{2}$ flip-flop to store priority in arbiter which clocked due to one time clocking to arbiter. Therefore the average energy of one time clocking to arbiter is calculated as described in equation 24. $$E_{clk} = \frac{1}{2} \left( \frac{(2N-1)V((2N-1)V-1)}{4} C_{FF\_clock} \right) V_{DD}^2$$ (24) In equation 24 $C_{FF\_clock}$ is the flip-flop clock capacitance. The energy relates to more than one packet existing in that hop. The average energy each packet dissipates in clocking is derived from dividing total clocks energy to reach all packets to destination over total number of packets. Let $\bar{N}_{clk}$ be the average packet latency, $\lambda_g$ the packet generation rate per node per cycle, $N_n$ and Np are total number of nodes $(K^N)$ and total number of packets generated respectively. $N_n$ number of packets reach destination after $\bar{N}_{clk}$ clocks, and similarly $2N_n$ packets after $\bar{N}_{clk} + 2\lambda_g$ clocks. Finally $N_p$ packets reach destination after number of clocks calculated in equation 25. $$\left(\frac{N_p}{N_n} - 1\right)\lambda_g + \bar{N}_{clk} \tag{25}$$ Thus total dissipated clock energy for all packets to reach the destination is given by equations 26. $$E_{clktotal} = \left[ \left( \frac{N_p}{N_n} - 1 \right) \lambda_g + \bar{N}_{clk} \right] E_{clk} \tag{26}$$ And the portion of a packet is given by equation 27. $$E_{clktotal} = \left[ \left( \frac{N_p}{N_n} - 1 \right) \lambda_g + \bar{N}_{clk} \right] \frac{E_{clk}}{N_p}$$ (27) The remaining energy can be calculated using \* equation and appendix equations. Let $B_i$ be the average blocking time in $i^{th}$ hop and $W_{ej}$ the average blocking time in destination for ejection from the network, then the average packet latency calculated according to model presented in [11] as described in equation 28. $$\tilde{N}_{clk} = L_p + \tilde{D} + \sum_{i=1}^{\tilde{D}} B_i + W_{ej}$$ (28) Note that $B_i$ and $W_{ej}$ are calculated using [11] and [14]. ### 5. Simulation A cycle accurate simulator is implemented in VHDL. A $8 \times 8$ mesh is used as instance of K-Ary N-Cubes topology with 64 processing elements as the IPs. Totally 10000 packets each one with 32 flits are generated in which each packet has a header flit and 31 body flits and all packets contain random data. Uniform traffic is assumed for destination addresses. And Duato's fully adaptive routing algorithm is implemented. In case of energy calculation we use Orion power model [3]. No power reduction codes is used and assumed no repeater is needed between two nodes. ## 6. Validation and experimental results The above model has been compared and validated with simulation results. In these experiments supply voltage is 2.5V and technology is $0.25\mu m$ . The number of virtual channels has been changed to see the effects on power consumption and performance of the network. (Note that In remaining sections, we may use message and packet interchangeably). Fig. 2 shows the average Message EDP of the network with several virtual channels in various traffic loads. As shown, with increasing the traffic load, EDP is increased respectively. In the other hand increasing the number of virtual channels (VC) results in increasing EDP. The networks with larger VCs are saturated later than the networks with lower VCs. Fig. 3 shows the effect of VC FIGURE 2. Average Message EDP in several traffic loads and number of VCs. on the average message latency. As shown in the figure in high traffic loads more VCs have benefits but in low traffic loads it just increases the latency. In all the cases more than 7 virtual channels do not result in any improvements. The average message energy of a packet is increased due to increasing the number of virtual channels. This is shown in Fig. 4. This figure also shows that simulation and model have the same profile and the results of them are close with smaller VCs, but this is not true when VC increases. In the worst case (VC=20) accuracy is 83% which is acceptable. EDP contains both latency and energy information. In Fig. 5 the average message EDP is shown with two different message generation rates when calculated by the model. In addition the results of simulation are shown too. Results show that in high traffic loads ( $\lambda=0.008$ ) larger number of virtual channels causes improvement (before a limit of VC=7). But it has not any benefit in low traffic loads ( $\lambda=0.004$ ). If we compare Fig. 4 and Fig. 5 it seems that latency and EDP behave similarly and behavior of one of them can be extracted from the other one. So high level EDP profile analysis becomes easy and fast. Fig. 6 shows the effect of VC on each part of NoC individually. As the relationships shown before, VC does not affect FIFO energy as much as arbiter energy in which VC affects it with power of 2. Therefore the FIFO energy has been omitted here. As VC increases the energy of Xbar is changed almost linearly. With smaller values of VC the energy of the arbiter is less than the Xbar, but with more than 9 virtual channels the arbiter energy becomes dominant over the Xbar. Clock energy here is an internal router clock and is a part of arbiter energy. This energy is dissipated due to priority flip flop clocking in the arbiter. Also inter-router clock signal energy is not shown here. FIGURE 3. Average Message Latency in three different loads vs. number of VCs. FIGURE 4. . Average Message Energy vs. number of VCs using simulation and model. ## 7. Conclusion In this paper we proposed a model to calculate average packet energy in K-Ary n-Cubes. This model can be used to avoid time consuming and complex simulations. Also the model shows the relation of the network parameters (e.g. number of virtual channels) with packet energy dissipation. In addition the effect FIGURE 5. Average Message EDP with 2 different loads vs. number of VCs. FIGURE 6. The effect of number of VCs on Xbar and Arbiter Energy. of number of virtual channels studied using both the model and simulation and results compared. It described that increasing the number of virtual channels improves the latency in high traffic loads, but increases the packet energy in the other hand. But for lower traffic loads there isn't any improvement in either latency or energy. It was shown that except in high traffic loads increasing the number of virtual channels has undesirable effects on EDP. In addition it was shown that EDP profile is very similar to latency, but with higher slope. Finally the effect of number of virtual channels on each part of router energy Canonical structure and notation A FIFO buffer with 1 read port and 1 write port . B rows sense amp Architectural Parameters Buffer size in flit $\overline{F}$ Flit size in bit Technological parameters memory cell height $h_{cell}$ memory cell width $w_{cell}$ $\overline{D_w}$ wire spacing Equation $L_{wl} = F(w_{cell} + 4d_w)$ wordline length $\overline{L_{bl} = B(h_{cell} + 2d_w)}$ bitline length $\overline{C_{wl} = 2FC_g(T_n) + C_d(T_{wd} + C_w(L_{bl}))}$ wordline length $C_{br} = BC_d(T_p) + C_d(T_c) + w(L_{bl})$ read bitline cap. $\overline{C_{bw} = BC_d(T_p) + C_a(T_{bd}) + C_w(L_{bl})}$ wirte bitline cap. $C_{chg} = C_g(T_c)$ $C_{cell} = 2(P_r + P_w)C_d(T_p) + 2C_a(T_m)$ precharge cap. memory cell cap. Table 1. Project expenditure to year-end 2006 Table 2. Some equations in context | $C_g(T_x)$ | Gate Cap. Of $T_x$ transistor | |----------------------------------|---------------------------------------------------| | $C_d(T_x)$ | Diffusion Cap. of transistor $T_x$ | | $C_a(T_x) = C_d(T_x) + C_g(T_x)$ | Diffusion Cap. Plus Gate cap. Of transistor $T_x$ | | $C_w(L) = C_{Line\_unit}L$ | Wire cap. | | $C_{Line\_unit}$ | Length unit cap. Of wire | was studied. The results show increasing the number of virtual channels does not affect the FIFO energy, unlike the arbiter energy. Using the proposed model the limiting number of virtual number can be calculated in different situations, in which increasing the number do not have any benefit in EDP graphs. In case of our analysis this threshold was 7 virtual channels per a physical channel. Table 3. Some parameters and equations of Matrix crossbars Model [3] Table 4. Parameters and equations of Arbiter [3] #### REFERENCES - W. J. Dally, B. Towels, Rout Packets, Not Wires: On-Chip Interconnection Networks, DAC (2001). - D. Rahmati, A. Kiasari, S. Hessabi, H. Sarbazi-Azad, A Performance and Power Analysis of WK-Recursive and Mesh Networks for Network-on-Chips, IEEE Int'l Conference in Computer Design, ICCD Proceedings, 2006. - H-S. Wang, X. Zhu, LS. Peh, S. Malik, Orion: A Power Performance Simulator for Interconnection Networks, in Proc. Micro 35, Nov. 2002. - 4. N. Eisley, L. Peh, High-Level Power Analysis for On-Chip Networks, CASES, Sep. 2004. - 5. X. Chen, L-S. Peh, Leakage Power Modeling and Optimization in Interconnection Networks, symp. on Low power electronics and design, 2003 90-95. - V. Zyuban, P.Kogge. The energy complexity of register files, In Proc. International Symposium on Low Power Electronic and Design, 1998. - M. Nadi, M. Hosein Ghadiry, M. T. Manzuri Shalmani, D. Rahmati, Effect of Number of Faults on NoC Power and Performance, ICPADS, IEEE, 2007. - T.T.Ye, L.Benini, G.D.Micheli, Analysis of Power Consumption on Switch Fabrics in Network Routers, DAC, ACM, New Orleans, Louisiana, USA, 2002. - P. P. Pande, C. Grecu, M. Jones, A. Ivanov, R. Saleh, Performance and Design Trade-Offs for Network-on-Chip Interconnect Architectures, IEEE Transaction on Computers, IEEE Computer Society, VOL. 54, NO. 8, 2005. - M. Nadi, M. Hosein Ghadiry, M.T. Manzuri Shalmani, Power and Performance Comparison of Routing Algorithms on NoC, In Proc. ICEE, 2007. - 11. M. Ould-Khaoua, A Performance Model for Duato's Fully Adaptive Routing Algorithm in k-Ary n-Cubes, IEEE Transaction on Computer Design, Vol. 48, No. 12, December 1999. - H. S. Wang, L. S. Peh, S. Malik, A Power Model for Routers: Modeling Alpha 21364, IEEE Micro, 2003. - W. J. Dally, C. L. Seitz, Deadlock-Free Message routing in Multi-computer Interconnection Networks, IEEE Trans. Computers, Vol. 36, No. 5, May 1987 547-553. - N. Alzeidi, M. Ould-Khaoua, A Khonsari, A Queuing Model for Wormhole Routing with Finite Buffers, UKPEW, 2004. - S. Penolazzi, A. Jantsch, A High Level Power Model for the Nostrum NoC, EUROMICRO, 2006. Appendix Mahdieh Nadi Senejani received her BS and MS from Islamic Azad University of Arak under the direction of Dr. M.T. Manzuri. Since 2008 she has been at the Islamic Azad University of Ashtian. Her research interests focus on the analyze power and performance of System On Chips and Network on Chips. Department of Computer, Azad University arak branch. e-mail: nadi\_mahdieh@yahoo.com Mahdiar Hossein Ghadiry received his BS and MS from Islamic Azad University of Arak under the direction of Dr. M.T. Manzuri. Since 2008 he has been at the Islamic Azad University of Arak. His research interests focus on the analyze power and performance of System On Chips and Network on Chips. Department of Computer, Azad University arak branch, e-mail: m.ghadiry@yahoo.com Mohamad Khalily Dermany received his BS from Islamic Azad University arak branch and MS at Amirkabir University under the direction of mehdi Sedighi. Since 2008 he has been at the Islamic Azad University of Khomein. His research interests focus on the performance of System On Chips and Network on Chips and related modules. Department of Computer, Azad University khomein branch, e-mail: mkhalili@iaukhomein.ac.ir