12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems
description
Transcript of 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems
![Page 1: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/1.jpg)
12. Experimental Evaluation
18-749: Fault-Tolerant Distributed Systems
Tudor Dumitraş &Prof. Priya Narasimhan
Carnegie Mellon University
Recommended readings and these lecture slides are available
on CMU’s BlackBoard
&Electrical ComputerENGINEERING
![Page 2: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/2.jpg)
2
What Are We Going To Do Today?
Overview of experimental techniques Case study: “Fault-Tolerant Middleware and the Magical 1%” Experimental requirements for the project
![Page 3: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/3.jpg)
3
Overview of Experimental Techniques
Basics– Probability distributions, density functions– Outlier detection: 3σ test
Visual representation of data– Boxplots– 3D, contour plots– Multivariate plots
Do’s and don’ts of experimental science
![Page 4: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/4.jpg)
4
Experimental Research
“God has chosen that which is the most simple in hypotheses and the most rich in phenomena [...] But when a rule is extremely complex, that which conforms to it passes for random.”
Gottfried Wilhelm Leibniz, Discours de Métaphysique, 1686
![Page 5: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/5.jpg)
5
Statistical Distributions
If a metric is measured repeatedly, then we can determine its probability distribution function (PDF)
– PDF(x) is the probability that the metrictakes the value x
– – Matlab function ksdensity
Common statistics– Mean = sum of values / #measurements (mean)– Median = half the measured values are below this point (median)– Mode = measurement that appears most often in the dataset– Standard deviation (σ) = how widely spread the data points are (std)
where Xi is a measurement and X is the mean
]Pr[)( bmetricadxxPDFb
a
n
ii XX
n 1
2
1
1
![Page 6: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/6.jpg)
6
Statistical Tools
Percentiles– “The Nth percentile” is a value X such that N% of the measured samples are
less than X– The median is the 50th percentile– Matlab function prctile
Outlier detection: 3σ test– Any value that is more than 3 standard deviations away from the mean is an
outlier– For example, for latency:– In Matlab: outliers(a) = a (a > mean(a) + 3*std(a))
3 LatencyLatencyoutlier
![Page 7: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/7.jpg)
7
Basic Plots
Line plot (plot)– Y-axis is a function of X-axis values– Can use error bars to show standard
deviation– Can also do an area plot to emphasize
overhead or difference between similar metrics Scatter plot (plot, scatter)
– Determine a relationship between two variables
– Reveal clustering of data Bar graphs (bar, bar3)
– Compare discrete values Pie charts (pie, pie3)
– Breakdown of a metric into its constituent components
Rounds
Nod
es r
each
ed
0% Data upsets50% Data upsets
Latency [in µs]
Cli
ent-
per
ceiv
ed t
hro
ugh
pu
t[b
ytes
/s]
![Page 8: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/8.jpg)
8
Boxplots
A “box and whisker” plot describes a probability distribution– The box represents the size of the inter-quartile range
(the difference between the 25th and 75th percentiles of the dataset)– The whiskers indicate the maximum and minimum values– The median is also shown– Matlab function boxplot
In 1970, US Congress instituted a random selection process for the military draft
– All 366 possible birth dates were placed in a rotating drum and selected one by one
– The order in which the dates were drawn defined the priority for drafting
The boxplots show that men born later in the year were more likely to be drafted
From http://lib.stat.cmu.edu/DASL/Stories/DraftLottery.html
![Page 9: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/9.jpg)
9
Impact of Two Variables
3D plots– Z axis is a function of X and Y values– Surface plots: mesh, surf– Scatter plots: plot3, scatter3– Volume: display convex hull using
convhulln and trisurf
Contour plots– Represents a function of 2 variables
(the X and Y axes)– Suggests the values of the function
through color and annotations– Displays the isolines (variable
combinations that yield the same value) of the function
pupset
p
94
110
9772
8070
67
6562
![Page 10: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/10.jpg)
10
Impact of Many Variables
Multi-variate plot
![Page 11: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/11.jpg)
11
Do …
Make Results Comparable– Use same hardware for all the experiments– Use same versions of your software– Avoid interference from other programs or make sure you always get the same
interference– Vary one parameter at a time
Make Results Reproducible– Record and report all the parameters of your experimental setup– Archive and publish raw data
Be Rigorous– Minimize the impact of your monitoring infrastructure– Report number of runs– Report mean values and standard deviations– Examine statistical distributions (modes, long tails, etc.)
![Page 12: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/12.jpg)
12
Don’t …
Forget to label the axes of your figures
Use different axis limits when comparing results
Plot mean values without looking at the error margin
0 5 10 15 202000
4000
6000
8000
10000
12000
14000
Late
ncy
[s]
Clients0 5 10 15 20
0
0.5
1
1.5
2
2.5
3x 10
4
Clients
Late
ncy
[s]
0 5 10 15 202000
4000
6000
8000
10000
12000
14000
Late
ncy
[s]
0 5 10 15 200
5000
10000
15000
Late
ncy
[s]
Clients0 5 10 15 20
0
5000
10000
15000
Late
ncy
[s]
Clients
![Page 13: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/13.jpg)
13
FT Middleware and the Magical 1%
Unpredictability of FT middleware Unpredictability limited to 1% of
remote invocations
T. Dumitraş and P. Narasimhan. Fault-Tolerant Middleware and the Magical 1%. In ACM/IFIP/USENIX Conference on Middleware, Grenoble, France, Nov.-Dec. 2005.http://www.ece.cmu.edu/~tdumitra/public_documents/dumitras05magical.pdf
![Page 14: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/14.jpg)
14
Predictability in FT Middleware Systems ?
Group Communication
Client
CORBA
Replicator
Server
CORBA
Replicator
Host OS Host OS
Host OS
R
CR
C
R
C
Cli
Srv
Srv
Networking
Networ
king
Replic
ated C
lient
Replic
ated S
erve
r
Faults are inherently unpredictable What about the fault-free case?
![Page 15: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/15.jpg)
15
System Configuration for Predictability
Can we configure an FT CORBA system for predictable latency?
Software configuration– Operating system: RedHat Linux w/ TimeSys 3.1 kernel– Group Communication: Spread v. 1.3.1– Replication: MEAD v. 1.1– ORB: TAO Real Time ORB v. 1.4– Micro-benchmark: 10,000 remote invocations per client
Hardware configuration– 25 hosts on the Emulab test bed– Pentium III at 850 MHz – 100 Mb/s LAN
![Page 16: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/16.jpg)
16
Experimental Methodology
Parameters varied: – Replication style: active, warm passive – Replication degree: 1, 2, 3 replicas– Number of clients: 1, 4, 7, 10, 13, 16, 19, 22 clients– Request arrival rates: 0, 0.5, 2, 8, 32 ms client pause– Sizes of reply messages: 16, 256, 4096, 65536 bytes
Tested all 960 combinations, collected 9.1 Gb of data– Trace available at: www.ece.cmu.edu/~tdumitra/MEAD_trace
Statistical analysis of end-to-end latency:– Means, medians, standard deviations– Maximum and minimum values – 1st, 5th, 95th, 99th percentiles– Numbers and sizes of the outliers
![Page 17: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/17.jpg)
17
Example of Unpredictability
Maximum latency can be several orders of magnitude larger than the average Distribution is skewed to the right and has a long tail Long tail occurs on only one side because the latency cannot be arbitrarily low
– MEAD latency is lower-bounded by CORBA and group communication latency
![Page 18: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/18.jpg)
18
Systematic Unpredictability
Average values increase linearly with the number of clients
Maximum values are unpredictable
![Page 19: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/19.jpg)
19
Counting the Outliers
An outlier is a measurement that fails the 3σ test
In most cases, less than 1% of the measured latencies are outliers
Outliers originate in various modules of the system:
– The ORB– The group communication– The application
![Page 20: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/20.jpg)
20
The “Magical” 1%
![Page 21: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/21.jpg)
21
The “Magical” 1%
The “haircut” effect of removing 1% of the highest remote latencies
![Page 22: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/22.jpg)
22
Observable Trends
The 99th percentile helps us identify trends in the data– E.g., latency increases with request rate and size
0500
10001500
2000
16
256
4096
65536
104
105
106
107
Request rate [req/s]Request size [bytes]
Max
imum
late
ncy
[s]
0500
10001500
2000
16
256
4096
65536
103
104
105
106
Request rate [req/s]Request size [bytes]
99%
late
ncy
[s]
![Page 23: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/23.jpg)
23
Interpretation
Predictable maximum latencies are hard to achieve– Tried to achieve predictability by selecting a good FT CORBA
configuration – Even in the fault-free case, end-to-end latencies have skewed distributions
for almost all 960 parameter combinations– Maximums are several orders of magnitude higher than averages– Unpredictability cannot be isolated to a single component
Magical 1%: achieving predictability through statistical approaches– We remove 1% of the highest measured latencies– Remaining samples have more deterministic properties
• 99th percentile helps us identify trends in the data
– This allows us to extract tunable, predictable behavior out of fairly complex, dependable systems
![Page 24: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/24.jpg)
24
Experimental Evaluation of 18-749 Projects
Requirements for experimental evaluation– List of client invocations– Probes– Graphs
Tips Digging deeper
![Page 25: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/25.jpg)
25
Requirements for Experimental Evaluation
Things to hand in:– List of client invocations – the server methods you’re going to exercise– Raw data from the 7 probes in your application– Graphs of end-to-end latency– Interpretation of the results
Constraints– All clients must run on separate machines– Each client must issue at least 10,000 requests– All requests must receive a reply (two-way invocations)– The middle tier must have 2 replicas (e.g., primary & backup)– Try all 48 combinations of the following:
• Number of clients: 1, 4, 7, 10
• Size of reply message: original, 256, 512, 1024 bytes
• Inter-request time: 0 (no pause), 20, 40 ms
Administrative– Each team must designate a chief experimenter
![Page 26: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/26.jpg)
26
List of Client Invocations
METHOD ONE_WAY DB_ACCESS SZ_REQUEST SZ_REPLY
createObj() No Yes 16 4
getInfo() No Yes 4 256
deleteObj() No Yes 4 4
Name of remoteinvocation
Is it a one-way(no reply)?
Does it require a DB access(all 3 tiers are involved)?
Size of the forward message before marshaling (the combined sizes of all
the in and inout parameters)
Size of the return message before marshaling (the combined sizes of all
the out and inout parameters)
![Page 27: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/27.jpg)
27
Application Modifications
Use only two-way invocations – The client must receive a reply from the server for each invocation– Suggestion: have at least 2 different invocations in your benchmark
Tunable size of replies– Add a variable-sized parameter that is returned by the server
(e.g., sequence<octet>)– Try the following reply sizes: original, 256 bytes, 512 bytes and 1024 bytes
Inter-request time– Insert a pause in-between requests– Try the following pauses: 0 (no pause), 20, 40 ms– CAUTION:
• sleep(0) inserts a non-zero pause• On most Linux kernels, you cannot pause for less than 10 ms• For more information: http://
www.atl.lmco.com/projects/QoS/RTOS_html/periodic.html
![Page 28: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/28.jpg)
28
Experiments Make Your Life Meaningful
![Page 29: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/29.jpg)
29
Stages of an Invocation
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
Database
![Page 30: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/30.jpg)
30
Data Probes (1 of 7)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1
Legend
${STY} Replication style
(ACTIVE or WARM_PASSIVE)
${C} Number of clients
${IRT} Inter-request time (in µs)
${BYT} Reply size (in bytes)
${HOST} Hostname
${N} Your team number
File NameDATA749_app_out_cli_${STY}_2srv_${C}cli_${IRT}us_${BYT}req_${HOST}_team${N}.txt
Data
Time (in µs) when each request is issued
Example67605
69070
69877
72807
...
![Page 31: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/31.jpg)
31
Data Probes (2 of 7)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
Database
File NameDATA749_app_in_cli_${STY}_2srv_${C}cli_${IRT}us_${BYT}req_${HOST}_team${N}.txt
Data
Time (in µs) when each reply is received
Example67605
69070
69877
72807
...
P1 P2
Legend
${STY} Replication style
(ACTIVE or WARM_PASSIVE)
${C} Number of clients
${IRT} Inter-request time (in µs)
${BYT} Reply size (in bytes)
${HOST} Hostname
${N} Your team number
![Page 32: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/32.jpg)
32
Data Probes (3 of 7)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2
File NameDATA749_app_msg_cli_${STY}_2srv_${C}cli_${IRT}us_${BYT}req_${HOST}_team${N}.txt
Data
Name of each invocation
ExamplecreateObj()
createObj()
getInfo()
deleteObj()
...
Legend
${STY} Replication style
(ACTIVE or WARM_PASSIVE)
${C} Number of clients
${IRT} Inter-request time (in µs)
${BYT} Reply size (in bytes)
${HOST} Hostname
${N} Your team number
P3
![Page 33: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/33.jpg)
33
Data Probes (example)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
Example:probe1.record (new Long(gettimeofday()));remoteFactory.createObj ();probe2.record (new Long(gettimeofday()));probe3.record (new String(“createObj()”));
![Page 34: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/34.jpg)
34
Data Probes (4 of 7)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P4
Legend
${STY} Replication style
(ACTIVE or WARM_PASSIVE)
${C} Number of clients
${IRT} Inter-request time (in µs)
${BYT} Reply size (in bytes)
${HOST} Hostname
${N} Your team number
File NameDATA749_app_in_srv_${STY}_2srv_${C}cli_${IRT}us_${BYT}req_${HOST}_team${N}.txt
Data
Time (in µs) when each request is received
Example67605
69070
69877
72807
...
![Page 35: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/35.jpg)
35
Data Probes (5 of 7)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
File NameDATA749_app_out_srv_${STY}_2srv_${C}cli_${IRT}us_${BYT}req_${HOST}_team${N}.txt
Data
Time (in µs) when each reply is completed
Example67605
69070
69877
72807
...
Legend
${STY} Replication style
(ACTIVE or WARM_PASSIVE)
${C} Number of clients
${IRT} Inter-request time (in µs)
${BYT} Reply size (in bytes)
${HOST} Hostname
${N} Your team number
P4
![Page 36: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/36.jpg)
36
Data Probes (6 of 7)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
P6
P4
File NameDATA749_app_msg_srv_${STY}_2srv_${C}cli_${IRT}us_${BYT}req_${HOST}_team${N}.txt
Data
Name of each invocation
ExamplecreateObj()
createObj()
getInfo()
deleteObj()
...
Legend
${STY} Replication style
(ACTIVE or WARM_PASSIVE)
${C} Number of clients
${IRT} Inter-request time (in µs)
${BYT} Reply size (in bytes)
${HOST} Hostname
${N} Your team number
![Page 37: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/37.jpg)
37
Data Probes (7 of 7)
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
P6
P4
P7
File NameDATA749_app_source_srv_${STY}_2srv_${C}cli_${IRT}us_${BYT}req_${HOST}_team${N}.txt
Data
Hostname of client sending the invocation
Exampleblack
black
blue
magenta
...
Legend
${STY} Replication style
(ACTIVE or WARM_PASSIVE)
${C} Number of clients
${IRT} Inter-request time (in µs)
${BYT} Reply size (in bytes)
${HOST} Hostname
${N} Your team number
![Page 38: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/38.jpg)
38
Probe Invariant
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
P6
P4
P7
Probes at the same side and same level must have the same number of records!
![Page 39: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/39.jpg)
39
Computing End-To-End Latency
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
P6
P4
P7
)()()( 12 iPiPiLatency For request i:
![Page 40: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/40.jpg)
40
Computing the Components of Latency
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
P6
P4
P7
)()()( 45 iPiPiServer For request i:
)()()( iServeriLatencyiMiddleware
![Page 41: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/41.jpg)
41
Computing the Request Arrival Rate
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
P6
P4
P7
For request i:
)1()(
10)(
44
6
iPiPiReq_rate
![Page 42: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/42.jpg)
42
Computing the Server Throughput
Client Server
Application
Replication
Middleware
Network
out in
in outout in
out in
in outout in
request
reply
request
reply
DatabaseP1 P2P3
P5
P6
P4
P7
For request i:
replySizeiPiP
iThroughput)1()(
10)(
44
6
![Page 43: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/43.jpg)
43
Graphs Required
Line plots of latency for increasing number of clients and different reply sizes (no pause)
Area plots of (mean, max) latency and (mean, 99%) latency, sorted by increasing mean values
Bar graphs of latency component break-down for outliers and normal requests
3D scatter plots of reply size and request rate impact on max and 99% latency
Latency vs. throughput
![Page 44: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/44.jpg)
44
Interpretation of Results
Short write-up containing the “lessons learned” from the experiments
What did you learn about your system?– What can you tell (good or bad) about the performance, dependability and
robustness of your application?– Were the results surprising?– If you observed some behavior you didn’t expect, how can you explain it?– What further experiments would be needed to verify your hypothesis?
Do your results confirm or infirm the magical 1% theory?
![Page 45: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/45.jpg)
45
Tips for Experimental Evaluation Avoid interference
– Use separate machines for each client, server replica, NamingService/JNDI, FT manager, database, etc.
– Make sure there are no other processes using your CPU or bandwidth
Minimize impact of monitoring– Store data in pre-allocated memory buffer– Flush buffers to the disk at the end– Record timestamps as time from the start of the process
• Use 4-byte integers (long) for the timestamps
Automate the experimental process as much as possible– Create scripts for launching the servers and clients, for collecting data, for analyzing it
and for creating the graphs
Use Matlab for graphs and data processing– This is installed on the ECE cluster and is available to students
• Can also download it from https://www.cmu.edu/myandrew/– If you need help with plotting your graphs, please send email to us
![Page 46: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/46.jpg)
46
Digging Deeper Do the same thing while injecting faults
Other probes– CPU usage (time spend in kernel, user mode)– Memory (total, resident set)– Bandwidth usage– Context switches– Major/minor page faults (page not in physical memory)
Other ways to represent data– Boxplots for end-to-end latency– Impact of varying #clients, size, request rate on #outliers, size of outliers, latency, etc.– Do you see multi-modal distributions (can you explain them)?
Interpretation of results– Are outliers isolated or do they come in bursts?– What is the source of the outliers?– Can you predict anything about the behavior of your system?– What questions can you answer by looking at this data?
![Page 47: 12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems](https://reader036.fdocuments.us/reader036/viewer/2022062500/568157f3550346895dc57008/html5/thumbnails/47.jpg)
47
Summary of Lecture
What matters to you?– What experiments should you run?– What data should you collect?– How should you present your data?– What should you analyze?– What lessons might you learn about your system?
Email all questions to the course mailing list– The other two TAs and myself (Tudor) are on this list– We’re happy to sit down and work out the details with you and to help you run
your experiments
It might sound like a lot of work, but the hard part is behind you – you’ve already built your system
– Now, it’s time to understand what you actually built!