Fat-tree Data Center Topology -...

Post on 13-Sep-2018

214 views 0 download

Transcript of Fat-tree Data Center Topology -...

A"Scalable,"Commodity"Data"Center"Network"Architecture"

Al9Fares,"Loukissas,"Vahdat,""A"Scalable,"Commodity"Data"Center"Network"Architecture,""Proc."of"ACM"

SIGCOMM"'08,"38(4):63974,"Oct."2008.""

Presenter:"William"Beyer"

Paper"Goals"

•  Point"out"faults"with"current"data"center"designs"

•  Propose"new"architecture"based"on"fat9tree"– Scalable"interconnecUon"bandwidth"– Economies"of"scale"– Backward"compaUbility"

A"Typical"Data"Center"

•  Data"center"topology"is"typically"293"level"tree"of"switches"and"routers"

OversubscripUon"

•  RaUo"of"worst9case"achievable"aggregate"bandwidth"among"end9hosts"to"the"total"bisecUon"bandwidth"of"the"network"topology"– Ability"of"hosts"to"fully"uUlize"their"uplink"capacity"

•  1:1"–"All"hosts"can"use"full"uplink"capacity"•  5:1"–"Only"20%"of"host"bandwidth"may"be"available"

•  Typical"raUo"is"2.5:1"(400"Mbps)"to"8:1"(125"Mbps)"

MulU9path"RouUng"

•  “MulU9rooted”"tree"required"to"communicate"at"full"bandwidth"for"large"clusters"– Otherwise"limited"to"max"bandwidth"of"a"single"expensive"switch"(1289port"10"GigE)"

•  Use"mulU9path"rouUng"technique"such"as"ECMP"– Performs"staUc"load"splidng,"cannot"account"for"flow"sizes"

– RouUng"tables"become"very"large"with"mulUple"paths"

Cost"Analysis"

Cost"Analysis" Fat9tree"Architecture"

•  k9ary"fat9tree:"three9layer"topology"(edge,"aggregaUon,"core)"–  k"pods,"each"consists"of"(k/2)2"hosts"and"two"layers"(edge/aggregate)"each"with"k/2"k9port"switches"

–  Each"edge"switch"connects"to"k/2"hosts"and"k/2"aggregate"switches"

–  Each"aggregate"switch"connects"to"k/2"edge"and"k/2"core"switches"

–  (k/2)2"core"switches:"each"connects"to"k"pods"–  Supports"k3/4"hosts!"

Fat9tree"Topology"with"k"="4" Issues"with"Fat9tree"Topologies"

•  Backwards"compaUble"with"IP/Ethernet"– Good"thing,"but"rouUng"algorithms"will"naively"choose"a"single"shortest"path"to"use"between"subnets"

– Leads"to"boilenecks"quickly"–  (k/2)2"shortest"paths"available,"should"use"them"all"equally"

•  Complex"wiring"due"to"lack"of"high"speed"ports"

Addressing"in"Fat9tree"

•  Use"10.0.0.0/8"private"addressing"block"•  Pod"switches"have"address"10.pod.switch.1"– Pod"and"switch"in"[0,"k91]"based"on"posiUon"

•  Core"switches"have"address"10.k.j.i"–  i"and"j"denote"core"posiUon"in"(k/2)2"core"switches"

•  Hosts"have"address"10.pod.switch.ID"–  ID"is"host"ID"in"switch"subnet"([2,"(k/2)"+"1])"– k"<"256,"this"scheme"does"not"scale"indefinitely"

Two9Level"Lookup"Table"

•  Prefixes"used"for"forwarding"intra9pod"traffic"•  Suffixes"used"for"forwarding"inter9pod"traffic"

Two9Level"Lookup"ImplementaUon"

•  Implemented"in"hardware"using"a"TCAM"– Can"perform"parallel"lookups"across"table"– Stores"don’t"care"bits,"suitable"for"storing"variable"length"prefixes"

•  Prefixes"preferred"over"suffixes"

RouUng"Algorithm"

•  Prefixes"in"two9level"table"prevent"intra9pod"traffic"from"leaving"pod"

•  Inter9pod"traffic"handled"by"suffix"table"– Suffixes"based"off"host"IDs,"ensures"spread"of"traffic"across"core"switches"

– Prevents"packet"reordering"by"having"staUc"path"•  Each"host9to9host"communicaUon"has"a"single"staUc"path"– Beier"than"having"a"single"path"between"subnets"

RouUng"Algorithm"(cont.)"

•  Core"switches"contain"(10.pod.0.0/16,"port)"entries"–  StaUcally"forwards"inter9pod"traffic"on"specified"port"

•  Aggregate"switches"contain"(10.pod.switch.0/24,"port)"entries"–  Switch"value"is"the"edge"switch"number"

•  Assumes"a"central"enUty"with"full"knowledge"of"topology"generates"these"rouUng"tables"– Also"responsible"for"detecUng"switch"failures"and"re9rouUng"traffic"

RouUng"Algorithm"Example"

Dynamic"RouUng"Techniques"•  AlternaUves"to"two9level"rouUng"table"– Aiempt"to"classify"and"schedule"flows"rather"than"use"staUc"rouUng"

•  Flow"ClassificaUon"–  Periodically"reassigns"flow"output"ports"–  Prevents"compeUUon"between"flows"for"a"single"port"

•  Flow"Scheduling"–  IdenUfy"large"flows"and"establish"reserved"paths"for"them"

–  Requires"communicaUon"between"edge"switches"and"a"central"flow"scheduler"

Fault"Tolerance"

•  Many"possible"paths"between"hosts"leads"to"“easy”"fault"tolerance"

•  Each"switch"maintains"BidirecUonal"Forwarding"DetecUon"session"with"neighbors"– Allows"switch"to"determine"when"neighbors"fail"

•  Two"primary"types"of"link"failure"– Between"lower"and"upper"switches"– Between"upper"and"core"switches"

Router"Power"and"Heat"DissipaUon" Topology"Power/Heat"DissipaUon"

Cafarella,"2013"

Hamilton,"2008"

Cafarella,"2013"

Emerson"Network"Power,"2007"

Cafarella,"2013"

EPA,"2007"

Sotware"ImplementaUon"

•  Validated"in"sotware"using"Click"– Click"is"a"modular"sotware"router"architecture"–  Implement"routers"on"PCs,"supports"experimental"router"designs"

•  Click"modules"called"“elements”"– Each"element"performs"a"specified"task"– RouUng"table"lookup,"decrement"packet"TTL,"etc…"

•  Implemented"elements"for"two9level"table,"flow"classifier,"and"flow"scheduler"

EvaluaUon"Setup"

•  Uses"a"49port"fat9tree"as"seen"previously"– Two9level"table"and"flow9based"schemes"analyzed"– Compared"against"hierarchical"tree"with"oversubscripUon"raUo"of"3.6:1"

•  Both"evaluated"using"Click"– Emulate"switches"and"hosts"on"PCs"

•  All"hosts"generate"96"Mbit/s"of"outgoing"traffic"– This"value"prevents"CPU"from"throiling"test"

EvaluaUon"Results"•  Percentages"indicate"aggregate"network"bandwidth"– Measured"as"amount"of"incoming"traffic"received"by"hosts""

Flow"Scheduler"Requirements"

•  Minimal"Ume"and"memory"requirements"for"flow"scheduler"

•  Feasible"to"use"at"least"unUl"k"grows"extremely"large"

Packaging"Problem"

•  Fat9tree"has"significant"cabling"overhead"– 1"GigE"switches"used"to"reduce"cost"– Lack"of"10"GigE"ports"leads"to"more"cabling"

•  Present"a"packaging"soluUon"for"k=48"– Generalizes"to"other"values"of"k"

Packaging"SoluUon" Strengths"

•  Fat9tree"architecture"seems"to"outperform"hierarchical"soluUon"

•  Excellent"power"and"heat"reducUons"over"hierarchical"approach"

•  EvaluaUon"methods"were"good"overall"with"tests"performed"

•  Data"centers"can"easily"switch"to"this"new"method"

Weaknesses"

•  Language"used"in"paper"was"confusing"at"Umes"– Referred"to"pod"switches"as"“aggregate"switch”,"“upper9layer"switch”,"and"“upper"pod"switch”"at"various"points"

•  EvaluaUon"performed"with"small"value"of"k=4"– Would"have"been"nice"to"see"higher"values"of"k"tested"

– Academic"project"and"resources"were"obviously"a"factor"for"evaluaUon"

References"•  Al9Fares,"Loukissas,"Vahdat,""

A"Scalable,"Commodity"Data"Center"Network"Architecture,""Proc.&of&ACM&SIGCOMM&'08,"38(4):63974,"Oct."2008."

•  Cafarella,"M."(2013,"April"20)."Datacenters."EECS&485."Lecture"conducted"from"University"of"Michigan,"Ann"Arbor."

•  "Energy"Efficient"Cooling"SoluUons"for"Data"Centers.""Emerson&Network&Power."2007"Web."28"Oct."2013."<hip://www.emersonnetworkpower.com/documents/en9us/latest9thinking/edc/documents/white%20paper/energy_efficient_cooling_soluUons_for_data_centers.pdf>."

•  Hamilton,"James.""PerspecUves"9"Cost"of"Power"in"Large9Scale"Data"Centers.""Perspec>ves&@&James&Hamilton's&Blog."N.p.,"28"Nov."2008."Web."28"Oct."2013."<hip://perspecUves.mvdirona.com/2008/11/28/CostOfPowerInLargeScaleDataCenters.aspx>."

•  "Report"to"Congress"on"Server"and"Data"Center"Energy"Efficiency.""Energy&Star."U.S."Environmental"ProtecUon"Agency,"2"Aug."2007."Web."28"Oct."2013."<hip://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf?db729bf5a>."