1. Introduction
Internet traffic has been rapidly increased for past several decades and demanded morepowerful forwarding engine. IP address lookup is one of the key functions required forforwarding packets in a router. It is to determine the nexthop information for a givendestination IP address. The IP address lookup consumes much more computational cycles in these days because it requires Longest Prefix Matching (LPM) due to Classless Inter-Domain Routing (CIDR) [1]. The growth of routing information also demands high-performance IPlookup engine. The recent routing table contains more than 600,000 network prefixes.
Many IP address lookup schemes have been researched that generally fall into either of twocategories: trie-based and TCAM-based. In trie-based approach, it often requires multipleaccesses to the trie structure. So most of research focused on reducing the number of memory accesses [2]. Multibit trie is one of the basic enhancements to reduce memory accesses. Aprefix can be expanded to a longer node using called prefix expansion technique in a multibittrie [3]. Multibit trie-based approaches proceed to search with a long stride unlike a binary trie(or unibit trie).
Ternary Content-Addressable Memory (TCAM) is a specialized memory that searches the entire entries in a single cycle in parallel. Each cell can represent * as well as 0 or 1, so it issuitable for storing prefixes. It can search for the longest matching prefix if priority-based selection logic is provided. However, its size is restrictive because it consumes more powerand takes longer latency than SRAM. The power consumption is reduced by routing tablecompaction and partitioning [4]. Furthermore, hybrid architectures of TCAM-SRAM were proposed to exploit both advantages [5][6].
The update overhead is one of the major issues in high-performance IP lookup engine. The update time may affect lookup speed because it consumes computational resources and maylock the lookup operation during the update. The number of memory entries to store a single prefix affects the update time. Longer stride in a multibit trie makes it possible to lookup fast, but it takes longer update time because more entries are often associated with a single prefix.
In this paper, we propose a new novel IP address lookup architecture which especially focuses on memory efficiency and fast updatability. In that architecture, we use both TCAMand SRAM to store multibit trie-based lookup structure. Unlike the conventional multibit trieour scheme performs lookups very fast only requiring at most two memory accesses. The size of each subtrie is controlled to satisfy memory efficiency and updatability criteria. Anelaborate technique is also devised to control the number of TCAM accesses incurred by asingle update.
The rest of this paper is organized as follows. Section 2 reviews related works, and Section 3 explains conventional IP lookup methods employing multibit trie and presents ourmotivation. Section 4 describes our proposed architecture with a lookup scheme, constructionalgorithm for IMT, and an update algorithm. Experiment results are presented and discussed in Section 5. Lastly, Section 6 concludes the paper.
2. Related Works
There has been much research on power consumption of TCAM [4][7-9]. In CoolCAMs, therouting table is partitioned into sub-tables (TCAM buckets) to reduce power consumption [7]. Lu and Sahni made enhanced CoolCAMs and proposed several architectures based on the use of wide SRAM [8]. Reduction of the number of entries in a TCAM-based routing table iscrucial in point of power consumption, cost, and lookup delay. Liu suggested some techniquesto reduce the entries [4].
Several TCAM-based IP lookup engines have been proposed. Akhbarizadeh et al. proposed a TCAM-based IP forwarding engine using prefix segregation scheme [10]. In that scheme, prefixes in the routing table are partitioned into two groups. One consists of non-overlapped prefixes, and the other group covers remaining prefixes. These two groups are contained in two separate TCAMs. The TCAM containing non-overlapped prefixes does not need priority encoder and sorting the prefixes in order of length, so it achieves good lookup performance and fast updates. Akhbarizadeh et al. partitioned prefixes and stored them into several TCAM blocks as well [11]. In this scheme, multiple selectors deliver incoming traffic to those blocks in parallel. It exploits popular prefixes stored in early stage memory to reduce contention in TCAM blocks and enhance the lookup performance.
In TCAM-based lookup schemes, all entries should be sorted to guarantee the longest prefix matching, which makes it difficult to update incrementally. Shah and Gupta proposed a novelprefix ordering scheme, chain-ancestor order, which reduces the number of memory movements on updates [12].
Luo et al. proposed a hybrid architecture which consists of TCAM-based lookup engine and SRAM-based pipelined engine [5]. In that architecture, FIB is partitioned into two sets. Disjoint leaf prefixes are mapped into TCAM-based engine and other overlapping prefixes aremapped into SRAM-based engine to increase throughput and to reduce the update overhead. Kim et al. proposed a hybrid architecture based on SRAM and TCAM [6]. In that architecture, 16-bit indices are stored in SRAM while prefixes longer than 16 bits are distributed over TCAM blocks. Since a TCAM block is selectively activated by an index, high throughput is achieved with low power. Prefixes are evenly distributed over TCAM blocks by elaborated technique so the size of TCAM is minimized in point of entries and width.
Several trie-based or multibit trie-based schemes have been proposed [3][13-20]. Srinivasanand Varghese proposed leaf pushing and multibit trie [3]. Le at al. devised an SRAM-based bidirectional pipeline for high throughput IP lookup engine [13]. In that scheme, leaf-pushed unibit trie is used for IP address lookup. They tried to optimize memory balancing which iscrucial in the pipelined architecture. Basu and Narlikar designed an SRAM-based pipeline toprocess IP lookup [14]. In the pipeline, a leaf-pushed trie is allocated to the stages and balancing the allocation is crucial. Write bubble technique was proposed to handle updates in the pipeline efficiently. Lee and Lim proposed multi-stride decision trie which satisfies somecharacteristics [15]. Wu et al. proposed pipelined IP lookup scheme based on a prefix trie [16]. In that scheme, prefixes are classified into short and long groups. The short group is contained in a quick table and the long group is indexed by the quick table.
Chu et al. proposed a GPU-based parallel architecture [17]. A novel variable-stride multibittrie-based data structure was devised to reduce search steps and to optimize memory accesses in GPU. Li et al. devised a fixed-stride multibit trie to compact the data structure and utilize the GPU’s memory access pattern fully [18]. In that scheme, each unit is either a nexthop or apointer to a child by means of leaf-pushing, i.e., it is a disjoint multibit trie. Li et al. used a multibit trie-based table and pursued the efficiency of on-chip memory containing the lookup table by splitting the table [21]. However, for off-chip access, they should use hashing and itneeds parallel memory accesses or sparse slots to avoid conflicts. It is assumed each traversing step in the multibit trie is pipelined and the split table is accessed in parallel.
3. Conventional Multibit Tries for IP lookup
This section describes in detail the background knowledge about the multibit trie that has been previously developed. A trie is a kind of search tree structure used to find the Longest Matching Prefix (LMP) for a destination IP address. A binary trie (or unibit trie) is the most fundamental trie whose degree is two [2]. A node in a binary trie is represented by a bit string which indicates the path from the root to that node. Fig. 1(a) shows an instance of a binary triein which prefixes are represented by filled circles (a ~ h). The LMP is the last visited prefix when a trie is traversed from the root node. For example, given a destination address 0100100(7-bit long for simplicity), the LMP should be the prefix h though the prefix a is also matched. In a binary trie, the IP lookup time is accounted for by the number of accessed nodes. The number of memory accesses for each prefix is shown in Fig. 1(b). In the worst case, it is the trie depth, which is 32 in IPv4 and 128 in IPv6.
Fig. 1. An example of prefixes and the corresponding memory accesses
The number of memory accesses can be reduced if several bits are checked in a node. In a multibit trie, several bits can be checked at once in a node [3]. The number of bits to bechecked per node is called stride. Fig. 2 shows a multibit trie corresponding to Fig. 1(a). In the figure, each of S1 ~ S4 represents a multibit trie node and we call it a subtrie of a multibit trie. Fig. 2(a) depicts a conceptual view in which each subtrie range is projected to the binary trie. Fig. 2(b) shows the actual memory state for the multibit trie. In that figure, each entry consists of the nexthop associated with a prefix and a child pointer. In Fig. 2(c), prefix a is expanded toleaves by a technique called leaf-pushing [3]. In a leaf-pushed multibit trie, all the prefixes arelocated at leaves and matching always occurs at leaves. Note that each entry contains either thenexthop or a child pointer but not both in a leaf-pushed multibit trie. So a leaf-pushed multibittrie requires less memory space.
Compared to a binary trie, a multibit trie requires less memory accesses for each lookup. In Fig. 1(b), the average number of memory accesses of the multibit trie is 1.88 while that of the binary trie is 4.63. However, there are some disadvantages on a multibit trie. First, it still requires several memory accesses and even the number of memory accesses is considerably high according to the depth of the multibit trie. In Fig. 2, it requires at least three memory accesses to match the prefix g or h. Second, while longer stride makes the number of memory accesses decrease, it causes poor memory efficiency. In Fig. 2(b), among 8 entries only one entry contains a prefix in S2 when the stride is 3. Also, long stride derives too many entries per prefix, so it may experience poor updatability. Third, leaf-pushing causes a lot of memory accesses for a single prefix update. For example, in Fig. 2(c), when the prefix a is updated, all the leaf-pushed 7 entries should be accessed even in different subtries.
Fig. 2. Two types of multibit tries
4. IP Address Lookup Using an Indexed Multibit Trie
4.1 Overall Scheme
The prefix expansion in multibit tries considerably reduces the number of steps to reach the longest matching prefix compared to binary tries. However, all the subtries on the traversed path have to be visited, so quite a time is still taken in matching process if the depth of the trieis high. We propose a new scheme to achieve fast lookup without visiting intermediate subtries. In this scheme, only the last matching subtrie is directly visited using an index without visitingintermediate subtries.
Fig. 3 shows an Indexed Multibit Trie (IMT) which is constructed by our scheme. The construction algorithm will be described in Section 4.4. Unlike Fig. 2, each subtrie isindependent and directly accessible without going through the intermediate subtries in the multibit trie of Fig. 3. All subtries are stored in SRAM and each entry is accessed using directaddressing as the conventional multibit tries. To determine a subtrie, we use the subtrie index which is the root of that subtrie. Since the lengths of subtrie roots are various, we exploitTCAM to store and search those values. Fig. 3 (b) shows the subtrie indexes in TCAM and all the subtries in SRAM. IP address lookup can be performed very fast irrespective of the lengthof the matched prefix since it is always completed by accesses to TCAM and SRAM. Let us consider a straightforward example. Given an input IP address, 0100110 (7-bit address forsimplicity), the longest matching prefix g will be found in two steps without going through the subtrie S1 and S2. First, the TCAM is searched with 0100110, and the fourth entry is returned as the longest matching entry though the first and the third entries are also matched. Since the length of S4’s root is 5, the 5 significant bits of 0100110 are masked. Then, next 2 bits are used as offset because stride is 2. Now, the SRAM is accessed using the subtrie pointer S4 and the offset of 2 (= 102).
Each SRAM entry only has Next Hop Identifier (NHI) field and the field for the pointer to achild subtrie is not required because the intermediate subtries along the path need not to beactually traversed. So the capacity of required SRAM is almost as half as those of othernon-leaf pushing based multibit tries. Note that there remain some null entries in SRAM. In Fig. 3 (b), there are four null entries, A ~ D in the SRAM. In our scheme, the root prefix of asubtrie is not expanded because the corresponding NHI will be stored in TCAM entry. Accordingly, the prefix k is not expanded. When it accesses the entry A or B, the final result is in TCAM. In case of the entry C, it cannot be accessed because the longer one (S4) will be matched instead. In case of the entry D, the final result should be the NHI of prefix a. To obtainthat result the prefix a should have been expaned to the entry D by leaf-pushing. Suchexpansion incurs a large amount of updates when the original prefix is updated. In next section, we will describe how to find the correct matching result while avoiding leaf-pushing and largeupdating overhead.
Fig. 3. Overall lookup scheme
4.2 Organization
We propose an architecture to support the Indexed Multibit Trie scheme in which each subtrieof a multibit trie is indexed and accessed using that index. In that architecture, TCAM contains an index to subtries and the subtrie itself is contained in SRAM. For more efficient use of the multibit trie, we split the TCAM into two parts: pTCAM and nTCAM as shown in Fig. 4. All subtries are divided into two groups. In one group, every subtrie root is a prefix whereas in the other group all the subtrie roots are non-prefix nodes. pTCAM and nTCAM accommodate the indexes to the former group and the latter group, respectively. For a given input IP address, each TCAM part searches for the longest matching result separately in parallel.
In Fig. 4, the selection logic chooses the index result (③) among the both results of TCAM parts (① and ②). It is straightforward to design the selection logic. It determines the result by means of the length of the subtrie roots. The longer (more specific) subtrie root is selected. Ifleng of ② is greater than that of ①, the selection logic chooses ②. In that case, NHI field of ③is filled with that of ①. The SRAM-based search engine finds an entry with leng, ptr, strideand a given IP address, and gets the result of NHI from that SRAM entry. However, if the accessed entry is null, the final result becomes the NHI which comes from ③.
Let us consider an example in Fig. 5. We do not use leaf-pushing in our scheme. When itaccesses the entry D of SRAM in Fig. 3, there is a null entry even if its result should be thenexthop for prefix a. To resolve this problem prefix expansion is partially applied to such prefix in our scheme. To avoid heavily updating problem the expansion is limited to the immediate descendant subtrie roots, which is only the root of subtrie S2 in this example. We call it subtrie-pushing instead of leaf-pushing because the prefix is not expanded to leaves but to subtrie roots.
Consequently, S1 and S2 have prefixes in roots while S3 and S4 do not. So the index for the former is contained in pTCAM, on the other hand, that for the latter is contained in n TCAM. Note that the prefixes in subtrie roots are not expanded in SRAM because the corresponding NHIs are already in pTCAM. Consider null entries A ~ D in SRAM. Whenever a null entry ismet in SRAM, it refers to NHI which was delivered from pTCAM. Let us consider an input IP address, 0100101. The longest matching results for pTCAM and nTCAM are < n1, 4, S2, 1 > and < φ, 5, S4, 2 >, respectively. Since leng (= 5) in nTCAM’s result is higher, < n1, 5, S4, 2 > will be delivered to SRAM-based search engine. Note that n1 comes from pTCAM’s result. Now, the SRAM-based search engine accesses the entry D in S4 using the last two bits 01 of the IP address 0100101. The entry D is null, however, the lookup engine can determine the final matching result because it already has n1 as NHI which comes from pTCAM. When the prefix a is updated, it only needs to access the roots of immediate descendant subtries such as S2. So subtrie-pushing technique can save a lot of memory accesses incurred by update of a prefix.
Fig. 5. An example for IMT-based lookup engine
4.3 Lookup Algorithm
Fig. 6 shows the procedure the lookup engine performs whenever an IP address is given. First, it searches both TCAMs in parallel. Then, it determines which one is longer result. The SRAM is accessed using the longer result. It will return NHI as the final result. If the result of the SRAM is NULL, NHI from pTCAM will be the final result.
Fig. 6. Lookup algorithm
4.4 Construction of an Indexed Multibit Trie
Since a multibit trie is constituted by a set of subtries, we first obtain the subtries from a binary trie to construct an IMT. When constructing subtries, we use some metrics to ensure that each constructed subtrie is memory-efficient and fast updatable.
♦ Memory efficiency of a subtrie
The SRAM efficiency of a subtrie can be measured by
\(\begin{equation} E_{S R A M}=n_{\text {pref}} / n_{e n t}=n_{\text {pref}} / 2^{\text {stride}} \end{equation}\) (1)
where npref is the number of prefixes and nent is the number of expanded entries in the subtrie. When ESRAM is calculated, npref excludes the root prefix of a subtrie because it is notactually stored in SRAM.
If the subtrie is overlapped by other descendant subtries, the SRAM efficiency will be
\(\begin{equation} E_{S R A M}=n_{p r e f} /\left(2^{s t r i d e}-\Delta_{\mathrm{ent}}\right) \end{equation}\) (2)
where Δent is the total number of eclipsed entries in the subtrie. We allow subtries to be overlapped, which makes it possible to construct larger subtries efficiently. For example, S1 and S3 are overlapped, and Δent is 1 in Fig. 5. Also, S2 and S4 are overlapped, and Δent is 1.
The TCAM efficiency of a subtrie is1,
\(\begin{equation} E_{T C A M}=1-1 / n_{p r e f} \end{equation}\) (3)
Both SRAM and TCAM efficiency increase as npref increases. However, ESRAM tends todecrease as stride increases. ETCAM is defined by 0 when npref is 1 because there is no subtrienot having any prefix. In Fig. 5, ESRAM of S1 is 2/3 = 0.67. Likewise, those of S2, S3, and S4 are 1.0, 1.0, and 0.5, respectively. of S1, S2, S3, and S4 are 0.67, 0, 0.5, and 0.5, respectively.
♦ Update overhead of a subtrie
The average update overhead of SRAM for a subtrie can be measured by
\(\begin{equation} U_{S R A M}=t_{a f f} / n_{b t-n o d e}=t_{a f f} /\left(2^{s t r i d e+1}-1\right) \end{equation}\) (4)
where taff is the total number of affected entries by updates on each binary-trie node in the subtrie and nbt-node is the number of expanded binary-trie nodes in the subtrie except the root. In Fig. 5, assuming binary-trie nodes are expanded to all possible position in the subtrie, S4has 7 (expanded) binary-trie nodes. So nbt-node of S4 is 6 excluding the root. taff is the sum of affected entries by means of update on each binary-trie node. taff can be computed by
\(\begin{equation} t_{a f f}=\Sigma A\left(b_{i}\right) \end{equation}\) (5)
where bi is the i-th binary-trie node in the subtrie except the root and ( ) denotes the number of entries affected by binary-trie node . For example, in S4, update on prefix g affects 2 entries in the subtrie, so A(g) = 2. When all \(A\left(b_{i}\right)\) are computed similarly and summed up, \( t_{a f f}\) of S4 is 7.
If the subtrie is overlapped by other descendant subtries, the average update overhead will be
\(U_{S R A M}=t_{a f f} /\left(2^{s t r i d e+1}-1-\Delta_{\mathrm{bt}-\mathrm{node}}\right)\) (6)
where Δbt−node is the total number of eclipsed binary-trie nodes in the subtrie. Whencalculating \( t_{a f f}\) , it excludes updates on eclipsed nodes. Since all binary-trie nodes except theroot affects only SRAM, we do not consider the average update overhead of TCAM for a subtrie.
For a given subtrie, maximum number of affected entries by update is
\(m_{a f f}=\operatorname{Max}\left(A\left(b_{1}\right), A\left(b_{2}\right), \ldots, A\left(b_{n}\right)\right)\) (7)
When constructing each subtrie, we consider \(E_{S R A M}\) but not \(U_{S R A M}\) because does notincrease as \(E_{S R A M}\) increases in general. Also, \(m_{a f f}\) is considered to control the excessiveaccesses to SRAM. Fig. 7 shows our IMT construction algorithm. The algorithm uses twoparameters α and β to decide whether it makes the current subtrie to be enlarged or not. α is the lower bound of \(E_{S R A M}\), i.e., every constructed subtrie will have the value at least α. β is the upper bound of \(m_{a f f}\), i.e., every constructed subtrie will have the value at most β.
In the IMT construction algorithm, among all binary-trie nodes only prefix nodes are visited in reverse-level order, i.e., from bottom to top order. The function next_prefix_node() givessuch nodes in turn. The currently visited prefix node p directly constitutes a subtrie. Then, itrepeats to check if the current subtrie can be enlarged to cover the sibling node with stride incremented by one. If the enlarged subtrie does not satisfy\(\alpha \leq E_{S R A M} \text { and } \beta \geq m_{a f f}\) , the previous subtrie is established. Whenever a subtrie is constituted, it just marks the root node in the binary trie and sets its stride instead of actual prefix expansion. The prefix expansion is straight forward and will be done at later phase. Every prefix is reviewed and contained justonce in a subtrie by dynamic programming.
Fig. 7. IMT construction algorithm
In Fig. 5, we have shown prefix a is expanded to the roots of the immediate descendantsubtries by subtrie-pushing. For a given binary-trie node b, the number of immediatedescendant subtrie roots of b is denoted by δ(b) when there is no prefix node on the path fromb to the immediate descendant subtrie roots including those roots. If b is the root of a subtrie, & delta; (b) = 0. Suppose δ(a) is large in Fig. 5. It implies the update on the prefix a causes a lot ofaccesses to TCAM. In that case, we need to control excessive subtrie-pushing by creating anew subtrie. Note that we don’t have to do subtrie-pushing if the prefix a becomes the root of anew subtrie. Fig. 8 shows such splitting algorithm. γ is a parameter to limit the number of immediate descendant subtries. For some prefix p, if δ(p) is greater than γ, the new subtrie will be split with p as the root.
Fig. 8. Subtrie splitting algorithm
4.5 Update Algorithm
Each update message on a prefix is associated with a binary-trie node. There are three cases in position of that node: at subtrie root, inside subtrie, and outside subtrie. Fig. 9 depicts the algorithm according to the updating node position. First, if the updating node is a subtrie root, only TCAM is accessed for the update. It incurs at most one memory access to pTCAM and nTCAM each except movements to preserve order of prefixes in TCAM. Some technique toreduce memory movements for preserving order in TCAM has been researched [12]. The TCAM entries in our scheme are much less than others, so we expect the memory movement for preserving order is not crucial in our scheme. Second, if a new prefix is inserted outside subtrie, it should be checked whether the new prefix can be merged with the existing subtrie. The criterion for merging is whether is larger than β after merging. If not, a new subtriewill be created and pTCAM is accessed once. Then, several subtrie-pushed entries in pTCAM should be updated. Lastly, in case the updating node is inside some subtrie, it is checked whether the subtrie can be split or not. is used for criterion. In all the cases the number of SRAM and TCAM accesses are limited by β and γ, respectively.
Fig. 9. Update Algorithm
5. Evaluation
In this section, the proposed architecture is evaluated in terms of memory efficiency andupdatability using real-world public routing tables. For the experiment, the routing tables werecollected on three different dates from ripe.net [22]. Since the network prefixes do not vary much in different locations, we used routing data from various dates to observe changes overtime. Table 1 depicts the characteristics of the routing tables. All the routing tables wereconverted into IPv4 Forwarding Information Bases (FIBs) for our experiment. The experiment was performed using Intel(R) Core(TM) i7-2600 (3.4 GHz) with main memory of 4 GB.
Table 1. Routing tables
Table 2 shows the summary of IMT construction results with parameters (α = 0.5, β = 64, γ= 16). In the table, the number of prefixes has been increased sharply over time, so the memory efficiency is crucial in IP lookup engine to accommodate more prefixes. Note that the othercharacteristics in the table have a tendency to be time-invariant except the requirement of pTCAM, nTCAM, and SRAM. For example, the average length of the subtrie roots is about 21, which is irrespective of collecting date.
Table 2. IMT construction results (α = 0.5, β = 64, γ = 16)
In Table 2, ‘multibit-trie depth’ means the highest level of the trie, i.e., maximum number of subtries on the path from the root to each leaf node. It is related to the number of memory accesses when a conventional SRAM-based multibit-trie is used. Its average value is about 5 and the maximum value is 12. On the other hand, in IMT, it takes at most one TCAM accessand one SRAM access.
Moreover, IMT requires less memory compared to the conventional SRAM-based multibit-trie. In rrc0-2017, about 123 K entries are required for TCAM and 1.1 M entries are required for SRAM. Since each SRAM entry stores only NHI field, it requires at most 16 bits. As a result, the total SRAM requirement is 2.2 MB in IMT. Since the conventional SRAM-based multibit-trie does not use TCAM, it only requires SRAM of 2.2 M entries. However, in that case, each SRAM entry needs a pointer to the child subtrie, so each entry requires 32 bits additionally and total SRAM requirement becomes 6.6 MB. In Table 2, ‘eclipsed ’ is the size of an overlapped region by descendant subtries which is explained in Section 4.4. The eclipsed space can be reused like free space, so we can reduce the total SRAM requirement by the amount of the eclipsed entries. As a result, the required SRAM is reduced by 14 %.
TCAM plays an important role in parallel search of subtries, but it may cause much powerto be consumed. Agrawal and Sherwood presented TCAM power consumption and delay timemodel useful for network system design [23]. We obtained TCAM power and search delay of IMT using their model and tool. For calculation of power and delay, TCAM size and technology feature size are used as input parameters. For 90nm technology, the results of TCAM power and delay are shown in Table 2.
In Section 4.4, denotes the maximum number of entries affected by a single nodeupdate. Actually, it represents the maximum number of SRAM accesses incurred by a singleupdate. Fig. 10 shows the distribution of , in which most subtrie updates (99%) incur less than 17 SRAM accesses. The largest value of maff is 1276 when we set the parameters as α =0.5, β = ∞, γ = ∞. In case of α = 0.5, β = 64, γ = 16, every subtrie is constructed to have the value of less than or equal to 64, however, the percentage of maff being 17 ~ 64 is less than 1% as stated above. In addition, the actual SRAM access time can be reduced because the contiguous entries can be accessed in burst mode and each entry is merely 16-bit long. In Fig. 10, some count values are higher than others, especially when the number of SRAM accessesis a power of two. The reason for this is that the number of entries in a subtrie is always apower of two and the maximum number of entries affected by a single update is most often apower of two.
In the earlier section, δ(b) denotes the number of immediate descendant subtrie roots of abinary-trie node b when there is no prefix node on the path from b to the subtrie roots including those roots. It presents how many subtrie roots are affected when a binary trie node is updated. Since the subtrie roots are always stored in TCAM, δ means the number TCAM accessesincurred by a binary-trie node update. Fig. 11 shows the distribution of δ and the number of TCAM accesses incurred by a single update is lower than 4 (99%). The largest value of δ is 890, however, the value can be limited by 16 when we set the parameters as α = 0.5, β = 64, & gamma;= 16.
Fig. 10. Distribution of maff in rrc0-2017
Fig. 11. Distribution of maff δ in rrc0-2017
Fig. 12 shows how the number of subtries and the stride size change with α. The number of subtries steadily increases with α because the size of each subtrie is reduced as α increases. On the other hand, average stride goes down as α increases. When α is higher than 0.5, the average stride becomes about 1. It implies each subtrie has merely two entries in average.
Fig. 13 shows the number of entries in TCAM and SRAM varying with α. As α grows, therequired SRAM size sharply decreases while the required TCAM size gradually increases. Fig. 14 and Fig. 15 show the required TCAM size and SRAM size, respectively. In Fig. 14, the size of pTCAM is significantly increased compared to that of nTCAM when α is higher than 0.5.
This is due to the increase in the number of subtrie roots contained in the pTCAM as the number of small subtries increases. The memory efficiency and the update overhead are depicted in Fig. 16 and Fig. 17, respectively. There are 6 graphs in each figure according to routing data collection dates and the type of memory. Fig. 16 shows memory efficiency of TCAM and SRAM. Both memories are inversely related to each other. However, the trend of the change is independent of therouting data collection dates. Fig. 17 shows the average number of memory accesses incurred by a single update. These values for TCAM and SRAM also change inversely each other, but they do not change over time. It implies that the characteristics of the IMT, such as memory efficiency and update overhead, hardly change with time.
Fig. 12. Subtries and stride in rrc0-2017
Fig. 13. Memory requirements in rrc0-2017
Fig. 14. TCAM requirement in rrc0-2017
Fig. 15. SRAM requirement in rrc0-2017
Fig. 16. Memory efficiency
Fig. 17. Update overheads
IMT is compared to several architectures with respect to the number of accesses to TCAMand SRAM, and also their required size, in Table 3. ‘Uniform TCAM’ denotes the architecture which simply consists of a single TCAM. ‘CoolCAM-subtree’ and ‘CoolCAM-postorder’ are the schemes described in [7]. ‘1-12Wc’ and ‘M-12Wb’ are the best-effort schemes in [8]. Our IMT is constructed with α = 0.5, β = 64, γ = 16. For the compared schemes, we used a bucketsize of 128 entries because power consumption in those schemes becomes the smallest when the bucket size is 128 [8]. In the table, the size of TCAM and SRAM is measured by the number of entries. In IMT, the size of SRAM is reduced by a factor of 144 bits since 144-bit wide SRAM was assumed in [8]. Our scheme shows the smallest memory requirement as well as fewer TCAM searches.
Table 3. Comparison on memory size and accesses in rrc0-2017
Overall memory efficiency can be evaluated by considering the relative cost of TCAM and SRAM. Fig. 18 depicts the overall memory efficiency in each scheme under the general fact that the numbers of transistors per cell of TCAM and SRAM are 16 and 6, respectively. IMT gives better overall memory efficiency than other schemes while α ≤ 0.5. The overall memory efficiency can be also controlled by α in our scheme while the other schemes have little change with their bucket size.
The proposed scheme is designed not only to focus on memory efficiency and updateoverhead, but also has the advantage of being controlled by three parameters, α, β, and γ. The above experiment results are summarized and discussed in terms of memory efficiency andupdate overhead as follows.
First, the TCAM requirement is much lower than the SRAM requirement, though the memory requirements of TCAM and SRAM are inversely related to each other. Considering that the number of transistors per cell and the power consumption are high in the TCAM, it is desirable to make the TCAM size as small as possible. In addition, the TCAM requirementsharply increases when α is larger than 0.5, while the SRAM requirement is generally low when α is 0.4 or more. Therefore, it is thought that an optimal memory requirement can be obtained when α is around 0.5. The memory efficiency of TCAM and SRAM is also inversely related to each other. Considering the number of transistors per cell of TCAM and SRAM, the overall memory efficiency of IMT gradually decreases with α. However, if α is 0.5 or less, ital ways gives better results than other schemes.
Second, the update overhead also shows an inverse relationship between TCAM and SRAM, as does the memory efficiency. With α = 0.5, for a single update, the average numbers of SRAM accesses and TCAM accesses are 1.06 and 0.22, respectively. In other words, the number of SRAM memory accesses is higher than that of TCAM. However, the effective update overhead of SRAM is expected to be relatively low because SRAM latency is muchlower than TCAM latency and it can operate in burst mode.
Fig. 18. Comparison on overall memory efficiency in rrc0-2017
6. Conclusion
TCAM-based IP address lookup engine can find the longest matching prefix with one accessusing parallel search, however, its power consumption and cost have been the problems. Eventhe current FIB size is too large to be contained in a single TCAM. On the other hand trie-based IP address lookup uses SRAM, which is cheaper and also consumes less power than TCAM. However, the trie-based IP address lookup usually has to traverse many nodes, which causes the lookup performance to be degraded. Though multibit trie-based approach can reduce the traversing steps significantly, it still needs to access the SRAM several times. Also, an update incurs access to several SRAM entries and sometimes a large number of entries.
In this paper, we propose a novel multibit trie scheme, Indexed Multibit Trie (IMT) and anarchitecture based on it. In the IMT, each subtrie is indexed and accessed directly withoutgoing through intermediate subtries. We use TCAM to store such index because only the longest matching index can be used to access the target subtrie. In the proposed architecture, IP address lookup is performed very fast requiring maximum two memory accesses. One accessis for a subtrie index and the other is for a subtrie entry regardless of the depth in IMT.
Subtrie partitioning is crucial to save memory and enable fast updatability. Generally, the larger subtrie increases the requirement of SRAM space but decreases the requirement of TCAM space. In this paper, we use three criteria α, β, and γ to construct the IMT efficiently. The size of SRAM and TCAM can be well-controlled using those parameters. Also, using those parameters the update overhead is controlled not to excessively access the memories. Experiment results with real-world FIBs show that the proposed scheme can achieve good memory efficiency as well as fast updatability by setting appropriate parameters.
References
- V. Fuller, T. Li, J. Yu, and K. Varadhan, "Classless Inter-Domain Routing (CIDR): An Address Assignment and Aggregation Strategy," RFC1519, 1993.
- M. A. Ruiz-Sanchez, E. W. Biersack, and W. Dabbous, "Survey and Taxonomy of IP Address Lookup Algorithms," IEEE Network, vol. 15, issue 2, pp. 8-23, March/April 2001. https://doi.org/10.1109/65.912716
- V. Srinivasan and G. Varghese, "Fast Address Lookups Using Controlled Prefix Expansion," ACM Transactions on Computer Systems, vol. 17, no. 1, pp. 1-40, February 1999. https://doi.org/10.1145/296502.296503
- H. Liu, "Routing Table Compaction in Ternary CAM," IEEE Micro, vol. 22, issue 1, pp. 58-64, Jan./Feb. 2002. https://doi.org/10.1109/40.988690
- L. Luo, G. Xie, Y. Xie, L. Mathy, and K. Salamatian, "A Hybrid Hardware Architecture for High-Speed IP Lookups and Fast Route Updates," IEEE/ACM Transactions on Networking, vol. 22, no. 3, June 2014.
- J. Kim, M.-C.l Ko, H.-K. Kang, and J. Kim, "A Hybrid IP Forwarding Engine with High Performance and Low Power," in Proc. of International Conference on Computational Science and Its Applications, pp. 888-899, 2009.
- F. Zane, G. Narlikar, and A. Basu, "CoolCAMs: Power-Efficient TCAMs for Forwarding Engines," in Proc. of IEEE INFOCOM, vol. 1, pp. 42-52, 2003.
- W. Lu and S. Sahni, "Low-Power TCAMs for Very Large Forwarding Tables," IEEE/ACM Transactions on Networking, vol. 18, no. 3, June 2010.
- G. Wang and N.-F Tzeng, "Exact Forwarding Table Partitioning for Efficient TCAM Power Savings," in Proc. of IEEE NCA 2007, pp. 249-252, 2007.
- M. J. Akhbarizadeh, M. Nourani, and C. D. Cantrell, "Prefix Segregation Scheme for a TCAM-Based IP Forwarding Engine," IEEE Micro, vol. 25, issue 4, pp. 48-63, July-August 2005. https://doi.org/10.1109/MM.2005.73
- M. J. Akhbarizadeh, M. Nourani, R. Panigrahy, and S. Sharma, "A TCAM-Based Parallel Architecture for High-Speed Packet Forwarding," IEEE Transactions on Computers, vol. 56, no. 1, pp. 58-72, January 2007. https://doi.org/10.1109/TC.2007.250623
- D. Shah and P. Gupta, "Fast Updating Algorithms for TCAMs," IEEE Micro, vol. 21, no. 1, pp. 36-47, Jan.-Feb. 2001. https://doi.org/10.1109/40.903060
- H. Le, W. Jiang, and V. K. Prasanna, "A SRAM-based Architecture for Trie-based IP Lookup using FPGA," in Proc. of 16th IEEE International Symposium on Field-Programmable Custom Computing Machines, pp. 33-42, 2008.
- Anindya Basu and Girija Narlikar, "Fast Incremental Updates for Pipelined Forwarding Engines," IEEE/ACM Transactions on Networking, vol. 13, no. 3, pp. 690-703, June 2005. https://doi.org/10.1109/TNET.2005.850216
- J. Lee and H. Lim. "Multi-Stride Decision Trie for IP Address Lookup," IEIE Transactions on Smart Processing & Computing, vol. 5, no. 5, pp.331-336, 2016. https://doi.org/10.5573/IEIESPC.2016.5.5.331
- Y. Wu, G. Nong, and M. Hamdi, "Scalable Pipelined IP lookup with Prefix Tries," Computer Networks, vol 120, pp. 1-11, June 2017. https://doi.org/10.1016/j.comnet.2017.03.017
- Hung-Mao Chu, Tsung-Hsien Li, and Pi-Chung Wang, "IP Address Lookup by Using GPU," IEEE Transactions on Emerging Topics in Computing, vol. 4, issue 2, April-June 2016.
- Yanbiao Li, Dafang Zhang, Alex X. Liu, and Jintao Zheng, "GAMT: A Fast and Scalable IP Lookup Engine for GPU-based Software Routers," in Proc.of 9th ACM/IEEE ANCS'13, pp. 1-12, 2013.
- Sartaj Sahni and Kun Suk Kim, "Efficient Construction of Multibit Tries for IP Lookup," IEEE/ACM Trans. on Networking (TON), vol. 11, issue 4, pp. 650-662, August 2003. https://doi.org/10.1109/TNET.2003.815288
- Stefan Nilsson and Gunnar Karlsson, "IP-Address Lookup Using LC-Tries," IEEE Journal on Selected Areas in Communications, vol. 17, no. 6, pp. 1083-1092, June 1999. https://doi.org/10.1109/49.772439
- Yanbiao Li, Dafang Zhang, Kun Huang, Dacheng He, and Weiping Long, "A Memory-Efficient Parallel Routing Lookup Model with Fast Updates," Computer Communications, vol. 38, pp. 60-71, 2014. https://doi.org/10.1016/j.comcom.2013.10.005
- RIS Raw Data.
- B. Agrawal and T. Sherwood, "Ternary CAM Power and Delay Model: Extensions and Uses," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 16, issue 5, pp. 554-564, 2008. https://doi.org/10.1109/TVLSI.2008.917538
Cited by
- A Novel Prefix Cache with Two-Level Bloom Filters in IP Address Lookup vol.10, pp.20, 2019, https://doi.org/10.3390/app10207198