Lecture slides from last week are posted on course web ...bianca/lecture_slides/lec2_2010.pdf ·...
Transcript of Lecture slides from last week are posted on course web ...bianca/lecture_slides/lec2_2010.pdf ·...
1
2
• Lecture slides from last week are posted on course web page.
• Project suggestions & deadlines are posted on web page
• Reading list is posted. • Volunteer now! :-) • Need one presenter for next week …
3
• Sept. 30: Project proposal • Oct 7: Related work • Oct 28: Status report I • Nov 20: Status report II • Dec 20: Final report
4
• Two case studies from my own research • Some project suggestions • A few words about paper presentations
• Probably next week: • Queueing Terminology • First operational laws • Little’s law
5
6
PS (timesharing)
Standard web server
SRPT (shortest-remaining-time)
Socket 1
Socket 3
Socket 2
SRPT web server (kernel-level Implementation)
Socket 1
Socket 3
Socket 2 S
M
L
Size-based scheduling for better
response times.
7
Workload generator 1
resp
onse
tim
e (m
s)
PS SRPT
PS
SRPT
300
200
100
load 0 .25 .5 .75 1 0 .25 .5 .75 1
load
Workload generator 2
WHY? Mean file size File size distribution Access pattern Request rate CPU utilization Bandwidth Network effects ALL THE SAME!
Tuning knob - Request rate
Tuning knob - Number of users - Think time
8
load 0 .25 .5 .75 1
load 0 .25 .5 .75 1
Web server App server Database (PostgreSQL)
resp
onse
tim
e (s
ec (
10
5
0
TPC-W generator X TPC-W generator Y
Tuning knob - Request rate
Tuning knob - Number of users - Think time
9
• Based on trace from top-10 online auctioning site.
load 0 0.25 0.5 0.75 1
load 0 0.25 0.5 0.75 1
resp
onse
tim
e (s
ec (
20
15
10
5
0
Simulator 1 Simulator 2 20
15
10
5
0
PLJF PS SRPT
PLJF PS SRPT
Tuning knob - Request rate
Tuning knob - Number of users - Think time
10
load
10 clnt.
100 clnt.
1000 clnt.
load 0.1 0.25 0.5 0.75 1
Simulator A Simulator B
FCFS
Cray J90/C90
• Simulation based on trace from Pittsburgh Supercomputing Center.
10
10
10
10 resp
onse
tim
e (m
in (
5
4
3
2
0.1 0.25 0.5 0.75 1
Tuning knob - Request rate
Tuning knob - Number of users - Think time
11
PLJF PS PSJF
10 clnt. 100 clnt. 1000 clnt.
PLJF PS PSJF
PS SRPT
PS SRPT
Tuning knob - Request rate
Tuning knob - Number of users - Think time
12
Model of user behavior
User requests web page, receives page, reads page, clicks on new link
• Arrivals triggered by completions.
• Fixed number of users, called the Multi-Programming-Level (MPL)
think send receive
13
x x x server
new arrivals
arrival times
next arrival time from trace
• Arrivals are independent of completions
• There is no max number of simultaneous users
Trace / probability distribution
14
Surge SPECWeb
TPC-W Sclient RUBiS
WebBench Webjamma
• Generators for same purpose use different models!
• Often not clear which model generators use!
15
• Very little … • Limited to FCFS single server queue.
– Response times under open system higher than under closed [Bondi and Whitt 1986].
– For MPL -> , closed system converges to open system [Schatte83, Schatte84]. ∞
16
– What is the magnitude in difference of response times? – What is the speed of convergence? – How does variability (heavy tails) affect results? – How are different scheduling disciplines affected? – …. in practice?
17
• What is the magnitude in difference of response times? – Orders of magnitude!
load 0 0.25 0.5 0.75 1
mea
n re
spon
se ti
me
1000
100
10
Open
Closed (MPL=50)
ANALYSIS
• Why? – Bounded number of jobs in closed system.
18
• How does variability affect open/closed response times? – Huge effect on open, limited effect on closed system.
Closed (MPL=50) Closed (MPL=100)
Closed (MPL=1000)
low variability high variability mea
n re
spon
se ti
me
1500
1000
500
Open Web Workloads
• Why? – Dependency between completions and arrivals in closed system
reduces burstiness.
ANALYSIS
19
• Can we make closed look like open, by increasing MPL?
Closed (MPL=50) Closed (MPL=100)
Closed (MPL=1000)
low variability high variability mea
n re
spon
se ti
me
1500
1000
500
Open Web Workloads
20
• What is the impact of scheduling? – Huge in open system, almost none in closed system.
PLJF FCFS PS PSJF
PLJF FCFS PS PSJF
ANALYSIS
• Why? – Scheduling takes advantage of variability in the system. – Closed systems reduce the effect of variability.
ANALYSIS
21
1. Is there a more realistic model?
2. What’s most representative of real systems?
22
x x x new arrivals
server
think send receive
leave system
with probability q return to the system
23
1 10 100 1000 mean think time
300
200
100
0
mea
n re
spon
se ti
me
SRPT
PS
24
q1 q0
number of requests per visit ↑ number of requests per visit ↓ ? ?
x x x new arrivals
server
think send receive
leave system
with probability q return to the system
25
300
200
100
0 0 5 10 15 20
PS open
PS closed
PS
SRPT mea
n re
spon
se ti
me
mean number of requests per visit
OPEN CLOSED
26
Open or Closed? Use partly-open system
to decide
Real web workloads
• A site being “Slashdotted” • Financial service provider • CMU web server • Kasparov vs Deep Blue • Large corporate web site • Science Institute USGS • Online dept. store • Supercomp. site • World cup site • Online gaming site
)1.2( )1.4( )1.8( )2.4( )2.4( )3.6( )5.4( )6.0( )11.6( )12.9(
#req. / visit
27
28
Storage system (RAID)
• Depends on probability that after one drive fails, a second drive fails while reconstructing data.
29
4
2
1
0
3
5
6 x 10 -3
Prob
abili
ty (%
(
1 hour reconstruction time
• Need probability of second failure during reconstruction
Standard approach: Use datasheet MTTF and exponential distr.
30
4
2
1
0
3
5
6 x 10 -3
Prob
abili
ty (%
(
1 hour reconstruction time
• Need probability of second failure during reconstruction
Standard approach: Use datasheet MTTF and exponential distr.
Estimate based on data
31
x 10 -3
4
2
1
0
3
5
6
Prob
abili
ty (%
(
1 hour reconstruction time
• Need probability of second failure during reconstruction
Standard approach: Use datasheet MTTF and exponential distr. Use measured MTTF and exponential distribution
Estimate based on data
32
x 10 -3
4
2
1
0
3
5
6
Prob
abili
ty (%
(
1 hour reconstruction time
• Need probability of second failure during reconstruction
Standard approach: Use datasheet MTTF and exponential distr. Use measured MTTF and exponential distribution Use measured MTTF and Weibull distribution Estimate based on data
33
1.2 1.0
0.6 0.4 0.2
0
0.8
1.4 1.6
x 10 -2
Prob
abili
ty (%
(
Reconstruction time
• Need probability of second failure during reconstruction
Standard approach: Use datasheet MTTF and exponential distr. Use measured MTTF and exponential distribution Use measured MTTF and Weibull distribution Estimate based on data
34
• Intuition is not always good enough • Need back-of-the envelope calculations
and analytical tools to answer questions. • Workload / fault load matters hugely
• Important to understand what the real world looks like!