1 Department of Computer Science, Jinan University

42
1 Department of Computer Science, Jinan University 2 School of Computer Science & Technology, Huazhong University of Science & Technology Junjie Xie 1 , Yuhui Deng 1 , Ke Zhou 2 1 NPC 2013: The 10th IFIP International Conference on Network and Parallel Computing. June 14, 2022. Guiyang, China.

description

Totoro: A Scalable and Fault-Tolerant Data Center Network by Using Backup Port. Junjie Xie 1 , Yuhui Deng 1 , Ke Zhou 2. 1 Department of Computer Science, Jinan University 2 School of Computer Science & Technology, Huazhong University of Science & Technology. Agenda. Motivation Challenges - PowerPoint PPT Presentation

Transcript of 1 Department of Computer Science, Jinan University

Page 1: 1  Department of Computer Science, Jinan University

1 Department of Computer Science, Jinan University2School of Computer Science & Technology, Huazhong

University of Science & Technology

Junjie Xie1, Yuhui Deng1, Ke Zhou2

1NPC 2013: The 10th IFIP International Conference on Network and Parallel Computing. April 21, 2023. Guiyang, China.

Page 2: 1  Department of Computer Science, Jinan University

• Motivation

• Challenges

• Related work

• Our idea

• System architecture

• Evaluation

• Conclusion

2

Page 3: 1  Department of Computer Science, Jinan University

• The Explosive Growth of Data Large Data Center⇒ Industrial manufacturing, E-commerce, Social network... IDC: 1,800EB data in 2011, 40-60% annual increase YouTube : 72 hours of video are uploaded per minute. Facebook : 1 billion active users upload 250 million photos per

day.

Image from http://www.buzzfeed.com3

Page 4: 1  Department of Computer Science, Jinan University

Feb.2011, 《 Science 》: On the Future of Genomic Data 。 Feb.2011, 《 Science 》: Climate Data Challenges in the 21st Century

Jim Gray : The global amount of information would double every 18 months (1998).

Page 5: 1  Department of Computer Science, Jinan University

• IDC report: Most of the data would be stored in data centers.

• Large Data Center Scalability⇒ Google: 19 data centers>1 million servers Facebook, Microsoft, Amazon… : >100k servers

• Large Data Center Fault Tolerance⇒ Google MapReduce:

5 nodes fail during a job 1 disk fails every 6 hours

Google Data Center

Therefore, the data center network has to be very scalable and fault tolerant

Page 6: 1  Department of Computer Science, Jinan University

• Tree-based Structure Bandwidth bottleneck, Single points of failure, Expensive

• Fat-tree High capacity,

Limited scalability

6

Tree-based StructureFat-tree

Page 7: 1  Department of Computer Science, Jinan University

7

DCell Scalable, Fault-tolerant, High capacity, Complex, Expensive

• DCell is a level-based, recursively defined interconnection structure.

• It requires multiport (e.g., 3, 4 or 5) servers.

• DCell scales doubly exponentially with the server node degree.

• It is also fault tolerant and supports high network capacity.

• Downside: It trades-off the expensive core switches/routers with multiport NICs and higher wiring cost.

C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang and S. Lu. DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers. In: Proc. of the ACM SIGCOMM’08, Aug 2008

Page 8: 1  Department of Computer Science, Jinan University

• FiConn Scalable, Fault-tolerant, Low capacity

8

D. Li, C. Guo, H. Wu, K. Tan, and S. Lu. FiConn: Using Backup Port for Server Interconnection in Data Centers. In: Proc. of the IEEE INFOCOM, 2009.

• FiConn utilizes servers with two built-in ports and low-end commodity switches to form the structure.

• FiConn has a lower wiring cost than DCell.

• Routing in FiConn also makes a balanced use of links at different levels and is traffic-aware to better utilize the link capacities.

• Downside: it has lower aggregate network capacity.

Other architectures: Portland, VL2, Camcube…

Page 9: 1  Department of Computer Science, Jinan University

• What we achieve: Scalability: Millions of

servers Fault-tolerance:

Structure & Routing Low cost: Commodity

devices High capacity: Multi-

redundant links

Totoro Structure of One Level

9

Page 10: 1  Department of Computer Science, Jinan University

10

0, 0, 0 0, 0, 10, 0, 2 0, 0, 3 0, 1, 0 0, 1, 1 0, 1, 20, 1, 3 0, 2, 0 0, 2, 1 0, 2, 20, 2, 3 0, 3, 0 0, 3, 1 0, 3, 2 0, 3, 3

3, 2, 33, 2, 23, 2, 13, 2, 0 3, 3, 33, 3, 23, 3, 13, 3, 03, 1, 33, 1, 23, 1, 13, 1, 03, 0, 33, 0, 23, 0, 13, 0, 0 2, 3, 32, 3, 22, 3, 02, 2, 32, 2, 22, 2, 12, 2, 02, 1, 32, 1, 22, 1, 12, 1, 02, 0, 32, 0, 22, 0, 1

1-0, 0 1-0, 1

1-2, 11-2, 01-3, 0 1-3, 1

2-0 2-1 2-2 2-3

1-1, 0 1-1, 1

1, 0, 0 1, 0, 11, 0, 2 1, 0, 3 1, 1, 0 1, 1, 1 1, 1, 21, 1, 3 1, 2, 0 1, 2, 1 1, 2, 21, 2, 3 1, 3, 1 1, 3, 2 1, 3, 31, 3, 0

2, 3, 12, 0, 0

Level -1 Link

Level -2 Link

structure with N = 4, n = 4, K = 2.

Page 11: 1  Department of Computer Science, Jinan University

• Architecture: Two-port servers Low-end switches Recursively defined

• Building Algorithm

k-level Totoro

two-port NIC

11

Page 12: 1  Department of Computer Science, Jinan University

• Connect N servers to an N-port switch

• Here, N=4

• Basic partition: Totoro0

• Intra-switch

A Totoro0 Structure 12

Page 13: 1  Department of Computer Science, Jinan University

• Available ports in Totoro0: c. Here, c=4

• Connect n Totoro0s to n-port switches by using c/2 ports

• Inter-switch

A Totoro1 structure consists of n Totoro0s. 13

Page 14: 1  Department of Computer Science, Jinan University

• Connect n Totoroi-1s to n-port switches to build a Totoroi

• Recursively defined• Half of available ports ⇒ Open & Scalable

• The number of paths among Totorois is n/2 times of the number of paths among Totoroi-1s ⇒Multi-redundant links ⇒ High network capacity

14

Page 15: 1  Department of Computer Science, Jinan University

15

0 TotoroBuild(N, n, K) {1 Define tK = N * nK 2 Define server = [aK, aK-1, …, ai, …, a1, a0] 3 For tid = 0 to (tK - 1) 4 For i = 0 to (K – 1)5 ai+1 = (tid / (N * ni)) mod n6 a0 = tid mod N7 Define intra-switch = (0 - aK, aK-1, …, a1, a0) 8 Connect(server, intra-switch)9 For i = 1 to K10 If ((tid – 2i-1 + 1) mod 2i == 0) 11 Define inter-switch (u - bK-u, …, bi, …, b0)12 u = i13 For j = i to (K - 1)14 bj = (tid / (N * nj-1)) mod n 15 b0 = (tid / 2u) mod (N / n * (n/2)u) 16 Connect(server, inter-switch)17 }

The key: work out the level of the outgoing link of this server

Page 16: 1  Department of Computer Science, Jinan University

16

N n u tu

16 16 2 4096

24 24 2 13824

32 32 2 32768

16 16 3 65536

24 24 3 331776

32 32 3 1048576 Millions of servers

Page 17: 1  Department of Computer Science, Jinan University

• Totoro Routing Algorithm (TRA) Basically, Not Fault-tolerant

• Totoro Broadcast Domain (TBD) Detect & Share link states

• Totoro Fault-tolerant Routing (TFR) TRA + Dijkstra algorithm (Based on TBD)

17

Page 18: 1  Department of Computer Science, Jinan University

Totoro Routing Algorithm (TRA)

18

• Divide & Conquer algorithm• Path from src to dst?

Page 19: 1  Department of Computer Science, Jinan University

19

Step 1: src and dst belong to two different partitions respectively

Totoro Routing Algorithm (TRA)

Page 20: 1  Department of Computer Science, Jinan University

Totoro Routing Algorithm (TRA)

20

Step 2: Take a link between these two partitions

Page 21: 1  Department of Computer Science, Jinan University

Totoro Routing Algorithm (TRA)

21

m and n are the intermediate servers The intermediate path is from m to n

Page 22: 1  Department of Computer Science, Jinan University

Totoro Routing Algorithm (TRA)

22

Step 3: src(dst) and m(n) are in the same basic partition, just return the directed path

Page 23: 1  Department of Computer Science, Jinan University

Totoro Routing Algorithm (TRA)

23

Step 3: Otherwise, return to Step 1 to work out the path from src(dst) to m(n)

Page 24: 1  Department of Computer Science, Jinan University

Totoro Routing Algorithm (TRA)

24

Step 4: Join the P(src, m), P(m, n) and P(n, dst) for a full path

Page 25: 1  Department of Computer Science, Jinan University

Totoro Routing Algorithm (TRA)

25

• The performance of TRA is close to the SP under the conditions of different sizes.

• Simple & Efficient

N n u tu MuTRA

Shortest Path Algorithm

Mean StdDev Mean StdDev

24 24 1 576 6 4.36 1.03 4.36 1.03

32 32 1 1024 6 4.40 1.00 4.39 1.00

48 48 1 2304 6 4.43 0.96 4.43 0.96

24 24 2 13824 10 7.61 1.56 7.39 1.32

32 32 2 32768 10 7.68 1.50 7.45 1.26

The mean value and standard deviation of path length in TRA and SP Algorithm in Totorou of different sizes. Mu is the maximum distance between any two servers in Totorou.tu indicates the total number of servers

Page 26: 1  Department of Computer Science, Jinan University

Totoro Broadcast Domain (TBD)

26

• Fault-tolerance Detect and share link states ⇒• Time cost & CPU load Global strategy is ⇒

impossible• Divide Totoro into several TBDs

Green: inner-serverYellow: outer-server

Page 27: 1  Department of Computer Science, Jinan University

Totoro Fault-tolerant Routing (TFR)

27

• Two strategies: Dijkstra algorithm within TBD TRA between TBDs

• Proxy: a temporary destination• Next hop: the next server on P(src, proxy/dst)

Page 28: 1  Department of Computer Science, Jinan University

Totoro Fault-tolerant Routing (TFR)

28

• If the proxy is unreachable

Page 29: 1  Department of Computer Science, Jinan University

Totoro Fault-tolerant Routing (TFR)

29

• Reroute the packet to another proxy by using local redundant links

Page 30: 1  Department of Computer Science, Jinan University

• Evaluating Path Failure Totoro vs. Shortest Path Algorithm(Floyd-Warshall)

• Evaluating Network Structure Totoro vs. Tree-based structure, Fat-Tree, DCell

& FiConn

30

Page 31: 1  Department of Computer Science, Jinan University

Evaluating Path Failure

31

• Types of failures Link, Node, Switch & Rack failures

• Comparison TFR vs. SP

• Platform Totoro1 (N=48, n=48, K=1, tK=2,304 servers)

Totoro2 (N=16, n=16, K=2, tK=4,096 servers)

• Failures ratios 2% - 20%

• Communication mode All-to-all

• Simulation times 20 times

Page 32: 1  Department of Computer Science, Jinan University

Evaluating Path Failure

32

• Path failure ratio vs. node failure ratio. The performance of TFR is almost identical to that of SP Maximize the usage of redundant links when a node failure occurs

Page 33: 1  Department of Computer Science, Jinan University

Evaluating Path Failure

33

• Path failure ratio vs. link failure ratio. TFR performs well when the link failure ratio is small (i.e., <4%). The performance gap between TFR and SP becomes larger and

larger. Not global optimal Not guaranteed to find out an existing path A huge performance improvement potential

Page 34: 1  Department of Computer Science, Jinan University

Evaluating

34

• Path failure ratio vs. switch failure ratio. TFR performs almost as well as SP in Totoro1

The performance gap between TFR and SP becomes larger and larger in the same Totoro2

Page 35: 1  Department of Computer Science, Jinan University

Evaluating Path Failure

35

• Path failure ratio vs. switch failure ratio. Path failure ratio of SP is lower in a larger-level Totoro More redundant high-level switches help bypass the failure

Page 36: 1  Department of Computer Science, Jinan University

Evaluating Path Failure

36

• Path failure ratio vs. rack failure ratio. In a low-level Totoro, TFR achieves results very close to SP. The capacity of TFR in a relative high-level Totoro can be

improved.

Page 37: 1  Department of Computer Science, Jinan University

Evaluating Network Structure

37

• Low degreeApproaches to but never reach 2Lower degree Lower deployment and maintenance overhead.⇒

Structure Degree DiameterBisection Width

Tree -- 2logd-1T 1

Fat-Tree -- 2log2T T/2

DCell k + 1 <2lognT-1 T/4longnT

FiConn 2 – 1/2k O(logT) O(T/logT)

Totoro 2 – 1/2k O(T) T/2k+1

N: the number of ports on an intra-switchn:the number of ports on an inter-switch

T : the total number of servers .For Totoro, there is

Page 38: 1  Department of Computer Science, Jinan University

Evaluating Network Structure

38

• Relative large diameter Smaller diameter More efficient routing mechanism ⇒ In practice, the diameter of a Totoro3 with 1M servers is only 18.

This can be improved.

Structure Degree DiameterBisection Width

Tree -- 2logd-1T 1

Fat-Tree -- 2log2T T/2

DCell k + 1 <2lognT-1 T/4longnT

FiConn 2 – 1/2k O(logT) O(T/logT)

Totoro 2 – 1/2k O(T) T/2k+1

Page 39: 1  Department of Computer Science, Jinan University

Evaluating Network Structure

39

• Large bisection widthLarge bisection width Fault-tolerant & Resilient ⇒ Take a small number of k, the bisection width is large. BiW=T/4, T/8, T/16 when k = 1, 2, 3.

Structure Degree DiameterBisection Width

Tree -- 2logd-1T 1

Fat-Tree -- 2log2T T/2

DCell k + 1 <2lognT-1 T/4longnT

FiConn 2 – 1/2k O(logT) O(T/logT)

Totoro 2 – 1/2k O(T) T/2k+1

Page 40: 1  Department of Computer Science, Jinan University

• Scalability: Millions of servers & Open structure

• Fault-tolerance: Structure & Routing mechanism

• Low cost: Two-port servers & Commodity switches

• High capacity: Multi-redundant links

Totoro is a viable interconnection solution for data centers!

40

Page 41: 1  Department of Computer Science, Jinan University

• Fault-tolerance: Structure

How to be more resilient?

Routing under complex failures: More robust rerouting techniques?

• Network capacity Data locality:

Mapping between servers and switches? Data storage allocation policies?

41

Page 42: 1  Department of Computer Science, Jinan University

42

NPC 2013: The 10th IFIP International Conference on Network and Parallel Computing. April 21, 2023. Guiyang, China.