Routing algorithms Yaakov (J) Stein Sept 2009 Chief Scientist RAD Data Communications.
-
Upload
opal-james -
Category
Documents
-
view
218 -
download
1
Transcript of Routing algorithms Yaakov (J) Stein Sept 2009 Chief Scientist RAD Data Communications.
RoutingRoutingalgorithmsalgorithms
Yaakov (J) Stein Sept 2009 Chief ScientistRAD Data Communications
Routing Slide 2
OutlineOutline
Control Plane
principles of IP routing
longest prefix match
RIBs and FIBs
Forwarding plane
classification and lookup mechanisms
switching fabrics
Introduction
Routing Slide 3
RoutersRouters
A router is a combination of hardware, software, and memorythat is responsible for forwarding packets towards their destinations
Routers generally work at ISO layer 3 (network layer)but can also function at layer “2.5” (for MPLS) and may inspect higher layers, but only for optimization (QoS management, load balancing, etc.)
Note that Ethernet switches technically filter rather than forward
In order to correctly fulfill their function (i.e. to know where to forward)routers usually run routing protocols
to exchange information between themselvesEthernet switches do not need such protocols as they learn how
to filter
So the router performs 2 distinct algorithms : forwarding algorithm (forwarding component) routing algorithm (control component)
Routing Slide 4
switch
router
What does a router do ?What does a router do ?
Control plane (routing algorithm) run routing protocols identify interface and next hop L2 addresses populate RIBs (if Link State, perform SPFs) scan all RIBs, and produce FIB (entries map FEC to NH)
Data plane (forwarding algorithm) deframing (CRC/checksum/defragmentation/reassembly/demapping…) parsing (pulling values from appropriate fields – simple IPv4 DA,
complex finding URL or MIB variable) FEC classification (add metadata, based on DA, DA+ToS, MPLS, …) lookup / search packet modification and replication framing traffic management and queuing compression, encryption, etc.
Routing Slide 5
IP networksIP networks
IP networks are made up of hosts middleboxes (e.g., firewalls, NATs, NAPTs, Application Layer GWs) routers (obsolete terminology – gateways)
It will be useful to differentiate between core routers (connect to other routers) edge routers (connect to hosts)
We will see shortly that it is more complex than that
To understand how a router is different from other network elementswe need to know the basic principles the IP protocol architecture
We will mainly deal with IPv4 unicast forwarding
Routing Slide 6Routing Slide 6
The first principle of IP is the end-to-end (E2E) principleAll functionality should be implemented only with the knowledgeand help of the application at the end points
The second principle is the hourglass model IP (l3) is the common layer below IP (L3) is not part of IP suite, above is
Thus : most functionality and state is in the hosts middlebox functionality is severely limited routers are limited to forwarding packets
without extensive packet manipulation (exception - TTL)
The third principle is that forwarding is connectionless on a hop-by-hop basis
Routing Slide 7Routing Slide 7
IP
TCP, UDP, …
ftp, http, …
Ethernet, PPP UTP, optic, wireless
The basics (1)The basics (1)
The basics (2)The basics (2)The fourth principle is that unicast IP forwarding is performed
based on a Destination Address (DA) Addresses must usually be unique (end-to-end principle) Hosts usually have a single IP address, routers have many addresses It is the responsibility of a service provider (SP) to allocate addresses
IP addresses are not arbitrary, like Ethernet MAC addresses
The fifth principle is that IP addresses are aggregated into subnetworks All addresses in a subnetwork share a common prefix Subnetworks may be further aggregated
The sixth principle is that it is the responsibility of the router to forward towards the host’s subnet but it is not its responsibility to deliver the packet on the subnet
the IP suite starts above L2 subnet’s L2 (e.g., Ethernet, PPP) delivers the packet to the host
Routing Slide 8
Y(J)S MPLS Slide 9
IP Routing typesIP Routing types
Distance Vector (Bellman-Ford), e.g. RIP, RIPv2, IGRP, EIGRP
– send <addr,cost> to neighbors– routers maintain cost to all destinations– need to solve “count to problem“
Path Vector, e.g. BGP– send <addr,cost,path> to neighbors– similar to distance vector, but w/o “count to problem“– like distance vector has slow convergence*– doesn’t require consistent topology– can support hierarchical topology => exterior protocol (EGP)
Link State, e.g. OSPF, IS-IS– send <neighbor-addr,cost> to all routers– determine entire flat network topology (SPF - Dijkstra’s algorithm)– fast convergence*, guaranteed loopless => interior routing protocol (IGP)
*convergence time is the time taken until all routers work consistently before convergence is complete packets may be misforwarded, and there may be loops
1.1
Control Plane
(Routing)
Routing Slide 10
FIBsFIBs
Based on the 6 principles we can understand what a router does
1. The router looks at the packet’s DA2. It deduces to which subnet the packet belongs3. If the router can directly interface that subnet
it must use the appropriate L2 to send the packet to the host
4. Otherwise it must retrieve the next hop (router)that sends the packet towards the subnet
5. The next router does the same6. If routing has converged there will be no loops or black holes
but there may be during transients
The information needed by the router to properly forward packetsis stored in the Forwarding Information Base (FIB)
The FIB associates address prefixes with Next Hops (NHs)(and, to save an additional lookup, usually with L2 addresses as well)
Do not confuse the FIB with a Routing Information Base (RIB) Routing Slide 11
More on FIBsMore on FIBs
Simple (and primitive) routers have a routing tablemodern large routers have several different databases
The FIB is designed to be fast to searchwe will talk about its data structure later
RIBs are designed to be fast to updatewhich is quite a different structure
There may be many RIBs, one for each routing protocol runningand static routes may be entered into any of them
There are sometime other databases as wellfor example – link state routing protocols require a LSDB from which the RIB is built
The basic idea is that we build the FIB from the RIBs
Routing Slide 12
Router interfacesRouter interfaces
Routers connect to hosts and to other routers via interfaces (from 1 to many thousands of interfaces per router)
Routers are responsible for forwarding packets arriving at an ingress interface to an egress interface
Interfaces have layer 3 and above propertiesand also contain layer a and 2 properties (ports)
Each interface is assigned a unique IP address
Interfaces are grouped into subnetworksAll interfaces on a subnetwork share the same prefix
Routing Slide 13
0.0.0.0/0
192.0.2.128/25
192.0.2.0/25
.1.2
.3
.129
.130
.131
Prefixes and masksPrefixes and masks
Since 1993 (RFC 1519 - CIDR) subnets can have any length prefix
There are two ways of specifying the prefix length slash notation, e.g., 192.168.16.0/20
note – unspecified bits are set to zero mask notation, e.g., 192.168.16.0 with mask 255.255.240.0
Note that 192.168.16.0/20 means all addresses from 192.168.16.0 through 192.168.31.255
Note that it contains 192.168.16.0/21, 192.168.24.0/22, etc. since they are in the range and have longer prefixes (larger masks)
/32 are fully qualified IP addresses
0.0.0.0/0 matches every IP address – it is the default route
route taken when there is no matching entry in FIB the gateway of last resort
Routing Slide 14
Prefix tablesPrefix tables
Routing Slide 15
slash MaskA.B.C.D/32 255.255.255.255
A.B.C.D/31 255.255.255.254
A.B.C.D/30 255.255.255.252
A.B.C.D/29 255.255.255.248
A.B.C.D/28 255.255.255.240
A.B.C.D/27 255.255.255.224
A.B.C.D/26 255.255.255.192
A.B.C.D/25 255.255.255.128
A.B.C.0/24 255.255.255.0
A.B.C.0/23 255.255.254.0
A.B.C.0/22 255.255.252.0
A.B.C.0/21 255.255.248.0
A.B.C.0/20 255.255.240.0
A.B.C.0/19 255.255.224.0
A.B.C.0/18 255.255.192.0
A.B.0.0/17 255.255.128.0
slash maskA.B.0.0/16 255.255.0.0
A.B. 0.0/15 255.254.0.0
A.B. 0.0/14 255.252.0.0
A.B. 0.0/13 255.248.0.0
A.B. 0.0/12 255.240.0.0
A.B. 0.0/11 255.224.0.0
A.B. 0.0/10 255.192.0.0
A.B.0.0/9 255.128.0.0
A.0.0.0/8 255.0.0.0
A.0.0.0/7 254. 0.0.0
A.0.0.0/6 252. 0.0.0
A.0.0.0/5 248. 0.0.0
A.0.0.0/4 240. 0.0.0
A.0.0.0/3 224. 0.0.0
A.0.0.0/2 192. 0.0.0
A.0.0.0/1 128. 0.0.0
0.0.0.0/0 0.0.0.0
Note :for /25 D=0 or 128for /26 D= 0, 64, 128, or 192etc.
Special IP addressesSpecial IP addresses
Some IP addresses are reserved for special purposesthey are not assigned by IANA and may require special treatment by router
Routing Slide 16
prefix range purpose
0.0.0.0/8 0.0.0.0 – 0.255.255.255 defaults
10.0.0.0/8 10.0.0.0 – 10.255.255.255 private addresses
127.0.0.0/8 127.0.0.0 – 127.255.255.255 loopback addresses
169.254.0.0/16 169.254.0.0 - 169.254.255.255 zeroconf
172.16.0.0/12 172.16.0.0 - 172.31.255.255 private addresses
192.0.2.0/24 192.0.2.0 - 192.0.2.255 Documentation
192.88.99.0/24 192.88.99.0 - 192.88.99.255 IPv6-IPv4 relay
192.168.0.0/16 192.168.0.0 - 192.168.255.255 private addresses
198.18.0.0/15 198.18.0.0 - 198.19.255.255 device benchmark
224.0.0.0/4 224.0.0.0 – 239.255.255.255 multicast
240.0.0.0/4 240.0.0.0 – 255.255.255.255 reserved
Longest prefix match and FECsLongest prefix match and FECs
To find the subnet, we need to look at the packet’s DA and to find the best match for the DA that is known find known the FIB entry that matches the longest prefix of the DA (LPM)
All packets that are forwarded in the same way are grouped into a Forwarding Equivalence Class (FEC)
In the simplest case, a FEC is simply a known IP address prefix i.e., the packet’s subnet
In more complex cases it might get more complexfor example, ToS field, source address, etc.
Every packet has to be looked up in the FIB and classified to a FEC
and this forwarding has to be done fast
Routing Slide 17
ExampleExample
Assume the following FIB
and let’s look up 192.0.2.131 (last byte 10000011)
It matches the first entry with prefix length 0 (everything matches)
It matches the second with length 24 (first three bytes)
It matches the third entry with prefix length 25 (last byte 1xxxxxxx)
It does not match the fourth entry (11xxxxxx )
So the packet is forwarded to next hop C
Routing Slide 18
prefix next hop IP address
0.0.0.0/0 (gateway of last resort) A
192.0.2.0/24 B
192.0.2.128/25 C
192.0.2.192/26 D
Packet processing timePacket processing time
100 Mbps 1 Gbps 10 Gbps
64 B 5 sec 500 nanosec 50 nanosec
256 B 20 sec 2 sec 200 nanosec
1500 B 120sec 12 sec 1.2 sec
Routing Slide 19
How much time do we have to process a packet ?
disregarding L2 overhead, IPGs, etc.
So the FIB data structure has to be optimized for fast look-up
Policy and autonomyPolicy and autonomy
The FIB information is based on : static routes dynamic routes determined by routing protocols gateway of last resort (default route) 0.0.0.0/0
In general we need to apply policy, since there are conflicting sources of information may not want to use, or even believe, information received from peers it is all a matter of autonomy an Autonomous System can request service from another
but can not force it to provide service
Routing Slide 20
Autonomous systemsAutonomous systems
Routers are grouped into Autonomous Systems (ASs)
ASs may be grouped into domains
AS look to the outside world as single entity (they usually have an AS ID)
Routers in the same AS obey a common policy, and trust each other
AS are truly autonomousone AS can request another to forward a packet, but can not force it to
Inside ASs we run Internal Gateway Protocols (IGPs) e.g. OSPF, IS-IS
Between ASs we run External Gateway Protocols (EGPs) e.g., BGP
A router that runs both is called an AS Border Router (ASBR)
Routing Slide 21
More of the storyMore of the story
Actually, it can get a lot more complicated
In general, a router will be running multiple routing protocols
For example : one or more IGPs (RIP,OSPF, IS-IS) between routers in the same AS internal BGP (iBGP) between routers in the same AS
(usually a full mesh, but when too complex we can use route reflectors) external BGP (eBGP) between ASBRs in different ASs
How does a router know if a BGP session is iBGP or eBGP ? by the AS number !
IGP is used to find a path to another router (including ASBR) in the same AS
eBGP is used by ASBRs to learn / distribute routes to other ASs
iBGP is used for ASBR to inform core routers of external routes
Routing Slide 22
Simplest exampleSimplest example
Stub ASs (my home router) single homed to outside world single internal subnet, so don’t need IGP single homed, so don’t need to run BGP to ISP don’t need to have an AS number
Routing Slide 23
0.0.0.0/0
192.168.0.0/24 .1
.101 .104.102 .103 .105
More complex exampleMore complex exampleConnecting to a server connected to another ISP with dual homing
Routers 3 and 6 learn from eBGP how to reach A.B.C.D Policy determines that 3 will be used (see later) Router 1 learns from iBGP session that A.B.C.D is reachable via router 3 Router 1 learns from IGP that router 3 is reachable via router 2 Router 2 knows how to directly reach router 3 because of IGP adjacency Packet from a.b.c.d is forwarded via 1-2-3 to AS 2 and to A.B.C.D
Routing Slide 24
AS 1 AS 2Full mesh of iBGP sessions
eBGP sessions
1 2 3
4 5 6
A.B.C.D
a.b.c.d
7
Even more complex exampleEven more complex example
Three ASs, with one possibly acting as a transit domain
Routing Slide 25
AS 1
AS 2
AS3AS 1 would like to hand off the traffic to AS 2AS 2 has no economic incentive to carry this trafficBut AS 2 gets the route from AS3What can AS 2 do to stop this ? (remember autonomy!)What can AS 1 then do ?
Rules for customer ASsRules for customer ASs
Stub ASSingle-homed AS does not need to learn routes from providerIt only has to send all traffic via its unique exit point (0.0.0.0/0)Provider gets routes from static or IGP or private-AS eBGP
Multihomed Nontransit ASAS advertises only its own routes to both SPsAS filters out traffic for foreign routes that reach it via static/default routingeBGP is not needed, but recommended for route propagation and filtering
Multihomed Transit ASUses eBGP to SPs and iBGP for transit traffic
Routing Slide 26
BGP rulesBGP rules
There are : internal routes external routes customer routes
When eBGP learns a route it is repeated via iBGP to all others in ASthus all routers in AS learn it
When iBGP learns a route it is repeated only to externals via eBGPsince internals also get it directly
When there is another ASBR that can reach the same other ASa second route is repeated by iBGP to the ASBR
The ASBR will then make the decision as to which to use !
Routing Slide 27
IGP rulesIGP rules
IGPs are used between routers in the same AS so IGPs do not have sophisticated policy control routers usually blindly accept all information received
For proper operation (no routing loops) all routers in AS must have the same IGP RIB for link state protocols (OSPF, IS-IS)
there is a Link State Data Base (LSDB)from which IGP RIBs can be constructed (will be explained shortly)
Because all routers have the same LSDB
Although the forwarding is hop-by-hopthe result is the same as if there were coordination
IGPs do not scale to inifinity require complete knowledge are not suitable for interaction with non-trusted routers
since a single misconfiguration can be fatalRouting Slide 28
LSDBsLSDBs
We said before that LS routing protocols have another database
LSDB contains representation of every router and link in the ASimplicitly holding the complete topology of the network
In addition, the LSDB associate costs (metric) with every link RIP - the metric is always hop count, no non-trivial metric OSPF - the metric is more general, for example link length These costs form a matrix
The topology is symmetric, but the costs need not beRouting Slide 29
From \ to Router A Router B Router C
Router A M(A,B) M(A,C)
Router B M(B,A) M(B,C)
Router C M(C,A) M(C,B)
LSDBs and IGP RIBsLSDBs and IGP RIBs
Each router can independently calculate the least-cost pathto every other router in the AS
A Shortest Path First (SPF) algorithm (e.g., Dijkstra’s algorithm)is used to compute a tree of the shortest paths to all destinations
Each route in the SPF tree is an entire pathbut for each router we can extract the next hopand build the RIB for that router (each router has its own RIB)
From the RIBs we build the FIB needed for efficient forwarding of packets
Routing Slide 30
Graph search algorithmsGraph search algorithms
There are many algorithms for search on graphs :
Breadth first– Bellman-Ford– Iterative deepening
Depth first (backtracking)– Depth limited
Best first– Greedy algorithms
Dijkstra’s algorithm– Beam search– A*– B*
etc. etc.
We will discuss graphs, trees, etc. later onRouting Slide 31
1
0
2 3
4 65 7 8
Dijkstra’s algorithmDijkstra’s algorithm
Graph search algorithm first described by Edsger Dijkstra in 1959It assumes additive, non-negative, costs for each link in graphIt is a best-first greedy algorithm
Think of a city street mapWe want to from initial intersection to destination one with the least walking
Start at the initial intersection – its distance is zero
Measure and label the distances to all adjacent intersections (breadth first)Choose the closest one (this is the greedy step)
Consider all the neighbors of the chosen intersectionIf the distance (sum of the distance to the chosen intersection and the distance from chosen
intersection to neighbor) is the shortest known way to get to that neighbor then remember that distance (not a tree!)
Once you have considered all neighbors of the intersectionmark the chosen intersection as visited (its distance is now known)
Choose the unvisited intersection with shortest distance
Continue until all intersections have been visitedRouting Slide 32
Dijkstra’s algorithm - formalDijkstra’s algorithm - formal
Let's call the node we are starting with an initial node. The COST of node X will be the distance from the initial node, i.e., the sum of distances of all links along the path from the initial node to X
InitializationSet initial node’s COST to zero, all others to infinity Set initial node as current, all other as unvisited
Main stepFor all unvisited neighbours of current node :
Calculate their distances from the initial node as COST(neighbor) = COST(current) + DIST(current to neighbor)
If this is less than what is presently marked, overwrite the marking
When all neighbors have been considered, mark current node as visited(once visited, this node’s COST is final)
RecurseSelect the unvisited node with the smallest COST as current nodeGo to Main step
Routing Slide 33
Implementation issuesImplementation issues
If our graph has N nodes and L links
In a straightforward implementation of Dijkstra’s algorithm finding the unvisited node with lowest cost takes O(N) this is done N times so the total computation for this is O(N2) the computation of the distances to every node takes O(L)
since each link is followed once (to the node it lands on)
So the total complexity is O(N2 + L) ≈ O(N2)
By using more sophisticated data structures (Fibonacci heap) this can be reduced to O(N log N + L) ≈ O(N log N)
Routing Slide 34
RIBs to FIBRIBs to FIB
So we are finally ready to see how the FIB is populated
First rejection rules are applied, for example : do not accept routes from ASs without agreements do not accept routes that loop
(e.g., BGP advertisements with AS number in the AS-PATH)
Then install FIB entries according to policy, for example :1. first install Static routes2. then routes from IGP RIB3. choose eBGP before iBGP (hot
potato rule- get it out of my network - let someone else handle)4. if there are different routes from BGP
choose the route with highest local preference 5. if routes have equal local preference
choose the route with the shortest AS-PATH6. if routes have equal AS-PATHs choose
the route with the lowest origin number7. if still equal choose highest BGP peer address
Routing Slide 35
AdvertisementAdvertisement
Not everything received is accepted for inclusion in the FIBNot everything accepted for inclusion in the FIB is further advertisedNever advertise information not accepted to FIB !
Routing Slide 36
RIB in RIB local RIB out
FIB
Best path
import
policy
export
policy
Data Plane
(Forwarding)
Routing Slide 37
ForwardingForwarding
Now that we have a FIB installed, let’s forward some packets !
There are main steps to forwarding : classification switching misc. (scheduling, queuing, QoS, compression, encryption, …)
The operations may be stateless vs. statefull may change over time may involve peeking into the packet (Deep Packet Inspection) may be recursive
Packet forwarding can be done by SW or HW or some combination
We normally differentiate between the fast path (simple forwarding) the slow path (control protocol packets)
Routing Slide 38
Lookup and data structuresLookup and data structuresLookup comes in several varieties, such as : Exact match (e.g., MAC addresses, VLANs, IP multicast) Longest Prefix Matching (LPM) (needed for searching FIBs) Range matching (e.g., ports, firewalls)
In order to optimize lookup, we use appropriate data structures
Wirth’s law Programs = algorithms + data structures
We need to perform the following on our data structures : Insert Delete Modify Search
and to check the following metrics : Time complexity for each of the above Size (spatial) efficiency Scalability
Routing Slide 39
LookUp TableLookUp TableThe simplest (and fastest) data structure is the LookUp Table (LUT)
AKA indexed array, Location Addressable Memory (LAM)
The incoming address is used as an index to access the (NH) information
We can put in more info, e.g., L2 type and address, to save further lookupsExample :
Limitations only for exact match, not LPM limitation only for small number of possible addresses, e.g. VLANs
We can use LUTs after other data structures that return a key as an index
Routing Slide 40
address interface NH L2 type L2 NH address
0 1 192.0.2.0 Ethernet 00-17-42-F7-14-14
1 2 192.0.2.16 PPP -
2 3 192.0.2.128 SDH VC ID
Hash tablesHash tables
Hash tables enable handling a large number of potential addresses
A hashing function is a function from a large variable (large number of bits or large number of bytes) to a small variable (small number of bits or bytes)which is white (small changes in the input create widely different outputs)
Hashing long addresses returns a short index
The problem is that the hashing function is not 1:1so there will always be the probability of hash clashes (collisions)
Solutions : perfect hashing – only when addresses are known ahead of time index in table returns list of all addresses stored multiple hashing – linear probing, quadratic probing
Hashing is very good for exact match (e.g. MAC addresses)but is not suitable for LPM
Routing Slide 41
Hash implementationHash implementationFrom an efficiency point of view
hash tables are between LUTs and search tables (to be discussed next)
To control collisions, we need a relatively large table (birthday paradox !)
Example : 99% probability of collisionwhen 3000 entries are put into a hash table of size 1 million
Using multiple hashingaverage computational load is O(1 + keys/table-size)
A primitive hash function is the modulo hashH(key) = addr mod table-size table-size should be prime – must not be a power of 2 or close
A better hash function is (for appropriate integer m and fraction f)H(key) = Trunc [ m Frac( f addr) ]
The quality of the hashing function depends on fFor Fibonacci hashing : f = 1 / γ
γ is the golden ratio ½ (√5 – 2) ≈ 1.618Routing Slide 42
23 people – ½ 57 people – 99%
Search table Search table
The most spatially efficient data structure is the search table
It is similar to a LUTbut the addresses being looked up are not indexesrather, we need to sequentially search the table for the address
We can reduce the search time by ordering the table
Search tables are good for LPM !
For example :Order FIB from most specific (longest prefix) to least (0.0.0.0/0)
Loop through FIB list until find the first matchRouting Slide 43
row prefix interface NH
0 192.168.16.0/24 1 10.10.1.1
1 192.168.196.0/20 2 10.16.54.2
2 192.168.0.0/17 2 10.16.1.16
3 0.0.0.0/0 3 10.1.1.0
Limitations of search tablesLimitations of search tables
Search tables are not limited in size like LUTsbut this comes at a price
it is expensive to search for an address it is expensive to modify the information for an address it is very expensive to insert or delete addresses (copy!)
Search table FIBs can be rebuilt from RIBs each time
For exact match it is possible to speed up by binary search order by address guess position in middle of table range where key should be choose new range by comparison
Routing Slide 44
Linked listsLinked listsSearch tables are hard to maintain
if we need to insert or delete an element we need extensive moving
Linked lists are designed to simplify mechanics of such updatesstill need exhaustive search to find where to insert
Linked lists can be singly or doubly linked
Skip lists increase efficiency by enabling skipping over ranges
Routing Slide 45
start
startend
start
Linked list implementationLinked list implementation
If we already know where to insert or what to delete or whom to modifythen linked lists are very efficient
However, search is O(N) where N is the number of entriesworst case Naverage case N / 2
Properly constructed multi-level skip lists take O(log N) on averagebut are still O(N) in the worst case
But if we already need to use double pointersthen trees are better
Routing Slide 46
Tree structuresTree structures
A graph is a collection of nodes and edges connecting themIn a directed graph the edges have direction
and so or every edge there is a father node and a child nodeNodes without children are called leavesA forest is a directed graph without loops (only one path between 2 nodes)A tree is a forest with a single root -- it thus defines a partial order(it is conventional to draw the tree upsidedown)
A binary tree has no more than 2 output edges per node
Trees can be implemented using arrays, pointers (like linked lists), heaps, etc.
Routing Slide 47
forest treebinarytree
Search treesSearch trees
Search trees may store data in their nodes in leaves only on edges combinations of the above
For general search trees searching can be breadth-first or depth-first
Breadth-first start at the root find all children of root and check for desired data if not found, find all children’s children recurse
Depth first start at the root find first child and check for desired data if not found, find first child of first child recurse until data found or leaf if leaf backtrack to father and try the next child
Routing Slide 48
Binary treesBinary treesBinary Search Trees simplify storage and manipulationWe call the children of a node – L and R
Perfect binary trees have exactly 2 children for each internal nodeThus there are exactly 2H leaves, where H is the height of the treeAltogether there are (2H+1 – 1) nodes
To store keys in a binary search tree place keys in all nodes for every intermediate node N
– key(L) < key(N)– key(R) > key(N)
To search for a key in a BST start at root (set current node to root) check if key is stored in current node if not : if key < key(N) then set current to L else set current to R
In the worst case this takes only H comparisons (for perfect trees O(log N))
Routing Slide 49
N
RL
Balanced binary treesBalanced binary trees
Since the search complexity is proportional to the BST’s height Hwe would like to use BSTs with minimal height
Balanced BSTs are not perfect BSTs, but as close as possible to perfectIt is more complex to build a balanced BST, but faster to search
Routing Slide 50
4
2 6
1 3 5 7
2
1
7
3
5
4
6
balancedBST
unbalancedBST
TriesTries
From retrieve, but usually pronounced tryA trie is an ordered tree with subvalues on edges values at leaves
Tries are related to prefix search representations
Tries were originally developed for exact match
Tries can be binary or not
Tries are good for all lexicographical searches O(log n) but particularly efficient for LPM
Trie variants are the fastest known lookups - O(log log n) LC-trie used in modern Linux router implementations
Routing Slide 51
Example trieExample trieAssume the following FIB :10.0.128.0/1610.0.0.0/8192.168.0.0/16192.168.128.0/240.0.0.0/0
Note:in the plain trie, all IP prefixes have the same length
Routing Slide 52
10 1920
0
0
0
/0
0
0 128
0
/16
0
/8
0
168
128
/24/16
0 0
Example compact trieExample compact trieSame FIB :10.0.128.0/1610.0.0.0/8192.168.0.0/16192.168.128.0/240.0.0.0/0
Now maximum length is 2We could also save memory by exploiting common suffixes
Routing Slide 53
10.0 192.168
0.0.0.0/0
128.0/160.0/8 0.0/16 128.0/24
More trie variantsMore trie variants
Patricia triePractical Algorithm to Retrieve Information Coded in Alphanumeric Binary trie which store in nodes the number of bits
to skip before next decision point For LPM store prefixes in internal nodes
Bucket tries Store multiple keys in leaves
Multibit tries (M-tries) Fewer branching decisions than binary tries Multibit strides (usually variable strides)
Level-compressed tries (LC-tries) Replace perfect subtrees in binary trie with single degree 2k nodes LPC trie – level and path compressed
And there are many more (Lulea, full-tree, …)Routing Slide 54
Example LC-trieExample LC-triebinary trie LC-trie
Routing Slide 55
Content Addressable MemoryContent Addressable Memory
CAM (AKA associative memory)Addressable by content, rather than by location (LAM)Special purpose hardware Fastest possible lookup (essentially searches entire table in one clock) but limited in size usually drives regular memory for additional storage
Binary CAM (BCAM) stores 0 or 1 in each bitTernary CAM (TCAM) allows wildcards Can be used for LPM Can prioritize solution by
– number of bits matched– order in table
CAM technology today 32 to 144 bit keys 128K – 512 K memories hundreds of millions searches per second
Routing Slide 56
Example – 3 bit BCAMExample – 3 bit BCAM
Routing Slide 57
M1[1]
Q[1] Q[2] Q[3]
M2[1]
M3[1]
M1[2] M1[3]
M2[2] M2[3]
M3[2] M3[3]
search lines
match lines
search word
encoder
encode outputto access LAM
Other uses of lookupOther uses of lookup
Deep Packet Inspection URL lookup (often partial or with wildcards)
– can use Trees and tries XML information, patterns, etc. multiple encapsulations
– e.g. Ethernet in IP, MPLS over IP, etc.– there are special-purpose languages to describe such cases
Firewalls, Access Control source/destination IP addresses + source/destination ports 4-tuples TCP/UDP ports in ranges
Routing Slide 58
SwitchingSwitching
Once the packet has been classified we need to properly forward it
The classifier result (the FEC) becomes a metadata field that controls the switch
A switch fabric is combination of HW and SW components that enable moving the packet from input interface to output interface
The metaphor is borrowed from woven material
At low packets per second (pps) switching can be all software
At high speeds highly parallel hardware is needed
Routing Slide 59
BlockingBlocking
The major design constraint in switches is blocking
Blocking occurs when some resource is being used and the present packet can not be processed
We differentiate three types of blocking1. Output blocking (the egress port is in use)2. Internal blocking (some internal resource of the switch is in use)3. Input (head of line) blocking (packet can not enter the switch)
A switch which is designed such that there is never internal blockingis called a nonblocking switch
To alleviate blocking we can add buffers to store the waiting packets
According to the type of blocking we wish to avoid we have output queues internal buffers input queues
Routing Slide 60
Switch typesSwitch types
In the TDM days there were two switching mechanism types : time domain switches (time slot interchange) spatial domain switches
For packet networks there are no time slotsbut time domain switching can still be used
shared memory switches (packets from all inputs placed in single memory) shared bus switches (all inputs and outputs share a ring/hypercube, etc. )
In spatial switching the packets destined for different output interfacestravel different internal paths
The simplest spatial switch is the crossbarinvented for analog telephone circuitsdoes not scale well – O(N2) possible connections
Multilayer crossbars can scale better
Routing Slide 61
1 2 4 5 6 7 831234567
Shared memory/busShared memory/bus
Shared memory switches are simple and cheap to implementThe memory is the heart of the fabric
it determines the throughput and delayThe memory must run N times faster than the ingress ratesPackets may be divided into frame buffers Scalability can be achieved by parallelism (bit or byte slicing)The architecture breaks down at high speeds
Shared bus switches are similar in principle to shared memory onesThe shared medium is the heart of the switch
and determines its characteristicsScalability can be achieved by parallelism (parallel buses)The architecture breaks down at high speeds
Routing Slide 62
Multistage crossbarsMultistage crossbarsCrossbars are fast, but are not nonblocking do not scale
These problems are solved by multistage crossbars
The only blocking of a single stage crossbar is output blockingmore than one packet needs to leave the egress interface
Time-space switches solve this by sorting the frames before switching
More complex architectures are time-space-time switches
There are many multistage solutions to the scaling problem …
Routing Slide 63
1 2 3 42 3 1 4
3 4 1 24 1 2 3
1 2 3 42 3 4 1
3 4 1 24 1 2 3
Clos networkClos network
The Clos network is more efficient than a single N*N crossbarWe divide N into r groups of n N = n rIt has 3 stages of crossbars the first stage has r groups of n*n crossbars the second stage has n groups of r*r crossbars the third stage has r groups of n*n crossbars
The shuffle rule : Xth output of Yth crossbar connects to Yth input of Xth crossbar
Instead of N2 = n2r2 connections just r2 n + 2 r n2
For example:if r = n = √Nthen instead of n4 just 3n 3
a reduction by 3/n = 3 / √N
Routing Slide 64
shuffle shuffle
Benes networkBenes network
The Benes network contains only 2*2 crossbarsand is built recursively
For N = 2n, the network has 2n-1 stages and 4 N log (N-1/2) connectionsThe shuffle rule is that 1st outputs go to the 1st block, 2nd to the 2nd
Routing Slide 65
Benes(n-1)
Benes(n-1)
Banyan networkBanyan network
The Banyan network is a binary tree of 2*2 crossbarswith exactly one path from each ingress interface to each egress interface
At each stage the next bit of the identifier determines the switchingBanyan switches are time and space efficient O(log N) delay ½ N log N 2*2 crossbarsThe main problem is internal blocking at any of the crossbarsOne way to relieve this is to speed up the crossbars and to add buffers
Routing Slide 66shuffle shuffle
Batcher’s sortBatcher’s sort
To prevent internal blocking in a Banyan switchwe can first sort the frames according to egress interface
When this is done with Batcher’s parallel sorter ½ log N ( log N + 1) stages of simple sorters
we get the Batcher-Banyan switch
Routing Slide 67
4th order Batcher sorter to Banyan switch