A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case...

43
A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance by Design: Computer Capacity Planning by Example Prentice Hall, 2004

Transcript of A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case...

Page 1: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

A Data Center

by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak

Case Study

Source:

D. Menasce, V.A. Almeida, L.W. Dowdy

Performance by Design: Computer Capacity Planning by Example

Prentice Hall, 2004

Page 2: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

2

Table of Contents:

• Introduction

• The Data Center

• First Model Attempt: Markov Chain

• Tasks

• Second Model Attempt: Two-Device QN

• Cost Analysis

Page 3: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

3

Introduction

Data centers offer a variety of services Trend: service-based data centers Problems:

Compliance with SLA default tolerance, privacy, security (...)

Too expensive How to choose the optimal size?

( cost)

Page 4: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

4

The Data Center

Machine-Repair-Model: M machines (functionally identical) N repair people Diagnostic system:

Detect failures of the machines Maintain a queue of machines waiting to be

repaired Log failure time record repair times

Page 5: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

5

GSPN-Model

MiO Machines in operation

MBR Machines being repaired

MWR Machines waiting to be repaired

(Sharpe)

Failure rate

Repair rate

Page 6: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

6

Queueing Model

Machines waiting to be repaired

Machines in operation

Machines being repaired

Page 7: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

7

Parameters Failure rate

1/ MTTF (Mean Time to Failure)

Repair rate

1/ Time to repair a machine

MTTR Mean Time to Repair

MTBF Mean Time Between Failures

Page 8: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

8

Building a Model~1~

Example: Markov Chain

k number of failed machines

k →k+1 transition when a machine fails

k →k-1 transition when a machine is repaired

λk = (M-k)λ aggregate failure rate

MNkN

Nkkk ),...,1(

,...,1

aggregate repair rate

Page 9: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

9

Building a Model~2~

1-dim. Generalized Birth-Death (GBD)

0,1,2,...k 1

0 10

k

i i

ik pp

M-k machines in operation

Page 10: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

10

Building a Model~3~

Average aggregate rate at which machines fail

(which equals average aggregate rate at which

machines are repaired):

1

0

1

0

)(M

kk

M

kkkf pkMpX

Page 11: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

11

Building a Model~4~

Interactive Response Time Law:

1

ff X

MMTTF

X

MMTTR

Client work station ↔ machines in operation

Average think time Z ↔ MTTF

Average response time R ↔ MTTR

System throughput fXX 0

ZX

MR

0

Page 12: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

12

Building a Model~5~

Little´s Law: (Box of reparation)

f

ff

XMMTTRXN

R ↔ MTTR

Nf = average number of failed machines

XRN

fXX

Page 13: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

13

Building a Model~6~

Little´s Law: (operational machines)

R ↔ MTTF

No = average number of operational machines

XRN

fXX

f

fo

XMTTFXN

)( 0 fNNM

Page 14: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

14

Values for the Example

120 machines

MTTF = 500 min

= 0.002 per min

Time to repair a machine = 20 min

= 0.05 per min

Page 15: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

15

Task 1

Given is

• failure rate of machines = 0.002 per min• number of machines M = 120• repair rate of machines = 0.05 per min

What is the probability that exactly j machines are operational?

Page 16: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

16

Task 1

Use:

pexactly j machines in operation = pM-j

MNkN

kN

K

Mp

NkK

Mp

pkNk

k

k

),...,1(!

!

,...,1

0

0

1

0 10 !

!

N

k

M

Nk

kNkk

N

kN

K

M

K

Mp

Page 17: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

17

Task 1 N = 2,5,10

Page 18: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

18

Task 2

Given is

• failure rate of machines = 0.002 per min• number of machines M = 120• number of repair people N• repair rate of machines = 0.05 per min

What is the probability Pj that at least j

machines are operational ?

Page 19: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

19

Task 2

Use Task 1 and:

once the personnel becomes overloaded, the system tends towards failure

if M>>N: having extra machines is pointless

M

jiiMj pP

Page 20: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

20

Task 3

Given is

• failure rate of machines = 0.002 per min• number of machines M = 120

• wanted probability: Pj = 0.9

• Time to repair a machine = 20 per min

How many repair people are necessary to guarantee that at least two thirds of the machines are operational with Pj = 0.9 ?

Page 21: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

21

Task 2,3 N = 2,3,4,5,10

Page 22: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

22

Task 4Given are the values

13

120 machines

MTTF = 500 min

= 0.002 per min

Time to repair a machine = 20 min

= 0.05 per min

What is the effect of the size of the repair team, N, on the MTTR a machine ?

Page 23: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

23

Task 4

computation

1 5

U s e :

P e x a c t l y j m a c h i n e s i n o p e r a t i o n = P M - j

MNkN

kN

K

Mp

NkK

Mp

pkNk

k

k

),...,1(!

!

,...,1

0

0

N

k

M

Nk

kNkk

N

kN

K

Mp

K

Mpp

0 1000 !

!

1. p0

2. pk

Page 24: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

24

Task 4

computation

1. p0

2. pk

fX.3

9

B u i l d i n g a M o d e l~ 3 ~

A v e r a g e a g g r e g a t e r a t e a t w h i c h m a c h i n e s f a i l

e q u a l s a v e r a g e a g g r e g a t e r a t e a t w h i c h

m a c h i n e s a r e r e p a i r e d :

1

0

1

0

)(M

kk

M

kkkf pkMpX

Page 25: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

25

Task 4

computation

1. p0

2. pk

4. MTTR

1 0

B u i l d i n g a M o d e l~ 4 ~

1

ff X

MMTTF

X

MMTTR

fX.3

Page 26: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

26

Task 4

computation

1. p0

2. pk

4. MTTR

5. No

1 2

B u i l d i n g a M o d e l~ 6 ~

f

fo

XMTTFXN

fX.3

Page 27: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

27

Task 4

computation

1. p0

2. pk

4. MTTR

5. No

6. Nf 1 1

B u i l d i n g a M o d e l~ 5 ~

f

ff

XMMTTRXN

fX.3

Page 28: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

28

Task 4 Effect of Number of Repair People

N repair peopleNO average number of operational machinesNf average number of failed machinesMTTR Mean Time to Repair

Page 29: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

29

Task 4

• number of repair people is increased beyond 5, further decreases in the MTTR is minimal

with 5 repair people: • 111 machines operational• down time of 38 minutes

(MTTR = 38 min: 20 min repair, 18 min wait)

Page 30: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

30

Task 4

case N = M =120:

11ff XMTTRMTTFXM

M

MTTFXN fo

M

X f

Page 31: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

31

Task 5Given are the values

13

120 machines

MTTF = 500 min

= 0.002 per min

N = 5

What is the effect of a repair person´s skill level on the overall down time ?

Page 32: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

32

Task 5Given are the values

13

120 machines

MTTF = 500 min

= 0.002

N = 5

How does the skill level affect the percentage of operational machines ?

Page 33: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

33

Task 5 Effect of the Repair Rate

NO average number of operational machinesNf average number of failed machinesMTTR Mean Time to Repair

Page 34: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

34

Second Modeling Attempt~1~

The Failure-recovery-model can also be modeled by a two-device QN:

• 1st device: delay server( Machines in Operation)

• 2nd device: load-dependent server( repair people)

Page 35: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

35

Second Modeling Attempt~2~

Delay server:

A fixed machine goes into operation without queuing.

The time a machine is valid depends only on its MTTF.

Page 36: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

36

Second Modeling Attempt~3~

Load-dependent server:

total rate at which machines are repaired (TRMR) depends on:

- number of failed machines k

- number of repair people N

service rate:

MNkN

Nkkk

),...,1(

....,,.........1)(

Page 37: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

37

Second Modeling Attempt~4~

Use MVA method with load-dependent devices for solving this model

required: service rate´multipliers

, k=1,...,M (s.Chp 14)

MNkNN

Nkkk

k),...,1(

....,,.........1)(

)1(

)()(

k

k

Page 38: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

38

Second Modeling Attempt~5~

The solution of this MVA model gives us:

• average throughput:

• average residence time at the LD-device:

= MTTR

X

´

LDR

Little´s Law to LD device:

av. number of failed machines:

av. number of machines in op.:

´LDf RXN

fNMN 0

Page 39: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

39

A Cost Analysis

Cp annual personnel cost

Cm annual cost per machine

constant revenue multiplier No average number of machines in operation

Mmin minimum number of machines that need to be in operation for the data center not to have to pay a penalty

Cα cost

Rα revenue

Page 40: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

40

A Cost Analysis

cost:

revenue:

profit:

mp CMCNC

minMNR o

mpo CMCNMNCRP min

Page 41: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

41

A Cost Analysis

Page 42: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

42

A Cost Analysis

negative profit for low numbers of personnel, because of low machine availability

with more than 6 personnel costs increases more then revenue, thus 6 service personnel are optimal

Page 43: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance.

43

References

Skripts And Talks Of Menasce CS672_Performance

cs672-07CaseStudy-III-DataCenter.pdf

cs672-03QuantifyingPerformanceModels.pdf

Skript SN1

Haverkort: Computer Communication Systems

Performance Analysis