31_Performanceforcasting_findingbottlenecksbeforetheyhappen
-
Upload
nandakumarrathi -
Category
Documents
-
view
213 -
download
0
Transcript of 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
1/24
Performance Forecasting:
Finding bottlenecks before they happen
Ali Saidi, Nathan Binkert, Steve Reinhardt, Trevor Mudge
University of Michigan
HP Labs
AMD Research ARM R&D
1
June 23, 2009
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
2/24
Performance Analysis Challenge2
Todays workloads & systems are complexMany layers of HW (disk, network), SW (app, OS)
How to evaluate systems in design stage?Where are the bottlenecks?
Conventional tools inadequate
Application
Kernel
Protocol Driver
Hardware
NIC Disk
Network
Hardware
NIC Disk
Kernel
Protocol Driver
Application
Machine 1 Machine 2
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
3/24
Application
Kernel
Protocol Driver
Hardware
NIC Disk
Network
Hardware
NIC Disk
Kernel
Protocol Driver
Application
Machine 1 Machine 2
Solution: Global Critical Path3
Directly identifies true bottlenecksAccounts for overlapped latencies
Used successfully in past in isolated domainsFields et al. out-of-order CPUBarford and Crovella TCPYang and Miller MPI
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
4/24
Building a Global Critical Path4
Requires global event dependence graph
Challenge:typically requires detailed knowledge
across many domains!
Solution:automatically extract dependence graph
from interacting state machines
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
5/24
End Result5
Our simulation technique directly identifies:The current bottleneckHow much improvement until next bottleneckWhat the next bottleneck will be
Conventional simulation approach:Hypothesize bottleneckPrototype solutionSimulate solutionTest hypothesis, repeat if incorrect
MINUTE
S
HOURS/DAYS
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
6/24
Constructing a Dependence Graph6
Systematically map state machines into a globaldependence graph
Most HW is already specified as a state machinesExtract implicit state machines from SW
App Protocol Driver NIC Network NIC Driver Protocol App
Machine 1 Machine 2
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
7/24
Explicit State Machine Conversion
Edges
Nodes
Nodes
Edges
7
State Machine Dependence Graph
dependence edge weight =
time spent in state
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
8/24
Explicit State Machine Conversion8
A
BD
C
t1
t2
t3
t4
Start BDAB DB BCt1-t0 t2-t1 t3-t2 t4-t3
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
9/24
What about software?9
F() {int a;H();S();
}
H() {int z;ret L();}
L() {int z;
}
Start HLFH LF FSt1-t0 t2-t1 t3-t2 t4-t3
t1 t2
t3
t4
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
10/24
State Machine Interactions10
Link up piece of dependence graph throughthese interactions
Queues are interaction pointsWithout them back pressure cant be modeledAbstract entities
Annotated in models and codeDeveloped iterativelyAnalysis can pinpoint problems
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
11/24
Interaction Example11
Get
Packet
Append
hdrs
Find
Dest
Enqueue
in TXQ
Simplified IP stack state machine
Process
Packet
Dequeue
from TXQ
Wait on
TXQ
Enqueue
in TX ring
Simplified TX NIC driver state machine
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
12/24
Interaction Example12
GetPacket
Appendhdrs
Find Dest
Enqueuein TXQ
Simplified IP stack state machine
ProcessPacket
Dequeuefrom TXQ
Wait onTXQ
Enqueuein TX ring
Simplified TX NIC driver state machine
Get Pkt
App hdrs
App hdrs
Find dest
Find
destEnQ
Enqueue
Get pkt
Get Pkt
App hdrs
App hdrs
Find dest
EnQWaitTXQ
Wait TXQDeQ
DeQProcess
Pkt
ProcessPktEnQ
5 4 2 4 8
13 3 10
2
Time
Current Time: 059111315162326
0 X
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
13/24
Finding Global Critical Path13
Use standard graph analysis techniques Locate longest path through the graph
Get PktApp hdrs
App hdrsFind dest
FinddestEnQ
EnqueueGet pkt
Get PktApp hdrs
App hdrsFind dest
EnqueueWait TXQ
Wait TXQDeQ
DeQProcess
Pkt
ProcessPktEnQ
5 4 2 4 8
0 (13) 3 10
2
Time
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
14/24
Critical States & Predicting Speedup14
Aggregate states on critical pathMost frequent state is the bottleneck
Dependence graph contains all transitions andinteractions
Not just the ones that compose critical path orwhere waiting occurred
Modify weights on the critical pathRe-analyze data to see how critical path changesNext critical path length potential speedup
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
15/24
Resource Dependence Loops15
Critical path can sometimes be improvedwithout reducing latency of any tasks
In resource constrained environments criticalpath can be shorted by providing more
resources
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
16/24
Resource Dependence Loops
Analysis automatically find candidates Addition of buffering changes critical path
16
A0-A1 A1-A2 A2-A3 A3-A0 A0-A1 A1-A2 A2-A3 A3-A0
B0-B1 B1-B2 B2-B3 B3-B0 B0-B1 B1-B2 B3-B3 B3-B0
C0-C1 C1-C2 C2-C0 C0-C1 C1-C2 C2-C0
Time
Iteration 2Iteration 1
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
17/24
Workloads17
Linux 2.6.18 SinkGen Streaming benchmark from CERNAnalyzed the transmit side
Lighttpd High-performance web serverUses non-blocking I/O to manage connectionsUsed by large websites
Metric is bandwidth achieved
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
18/24
TCP Transmit18
Start with default M5 system parameters1.Capture bottleneck data from that system2.Locate current bottleneck3.Predict performance when bottleneck is removed4.Repeat steps 2 and 3 for successive bottlenecks5.Verify that the locations and predictions are
correct
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
19/24
TCP Streaming Benchmark
0
500
1000
1500
2000
2500
3000
3500
4000
Config 2 Config 3 Config 4
Predicted Actual
19
Initial How did we do? Run experiments
making the abovesuggested
changes
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
20/24
TCP Streaming Benchmark
0
500
1000
1500
2000
2500
3000
3500
4000
Config 3 Config 4
Predicted Actual Predict performanceagain, this time
starting with
configuration 2
Config 3: 1%Config 4: 15%
20
Initial
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
21/24
TCP Streaming Benchmark
0
500
1000
1500
2000
2500
3000
3500
Config 4
Predicted Actual Predict performanceone last time, starting
with configuration 3
3% error
21
Initial
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
22/24
Experiments and Errors22
Additional experiments are in the paperMulti-core speed up of web server
Describe why errors occurredCompare modified dependence graph to observed
graph from new simulation
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
23/24
Conclusion23
Architects are increasing looking at system-levelissues for performance
Apply critical path analysis to complete systemscomposed of concurrent components Span multiple layers of HW &SW Automate extraction of dependence graphs
Identify end-to-end bottlenecks in network systems Critical tasks Resource dependence loops Performance of hypothetical systems Minutes not hours
-
8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen
24/24
Questions?24