31_Performanceforcasting_findingbottlenecksbeforetheyhappen

download 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

of 24

Transcript of 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    1/24

    Performance Forecasting:

    Finding bottlenecks before they happen

    Ali Saidi, Nathan Binkert, Steve Reinhardt, Trevor Mudge

    University of Michigan

    HP Labs

    AMD Research ARM R&D

    1

    June 23, 2009

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    2/24

    Performance Analysis Challenge2

    Todays workloads & systems are complexMany layers of HW (disk, network), SW (app, OS)

    How to evaluate systems in design stage?Where are the bottlenecks?

    Conventional tools inadequate

    Application

    Kernel

    Protocol Driver

    Hardware

    NIC Disk

    Network

    Hardware

    NIC Disk

    Kernel

    Protocol Driver

    Application

    Machine 1 Machine 2

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    3/24

    Application

    Kernel

    Protocol Driver

    Hardware

    NIC Disk

    Network

    Hardware

    NIC Disk

    Kernel

    Protocol Driver

    Application

    Machine 1 Machine 2

    Solution: Global Critical Path3

    Directly identifies true bottlenecksAccounts for overlapped latencies

    Used successfully in past in isolated domainsFields et al. out-of-order CPUBarford and Crovella TCPYang and Miller MPI

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    4/24

    Building a Global Critical Path4

    Requires global event dependence graph

    Challenge:typically requires detailed knowledge

    across many domains!

    Solution:automatically extract dependence graph

    from interacting state machines

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    5/24

    End Result5

    Our simulation technique directly identifies:The current bottleneckHow much improvement until next bottleneckWhat the next bottleneck will be

    Conventional simulation approach:Hypothesize bottleneckPrototype solutionSimulate solutionTest hypothesis, repeat if incorrect

    MINUTE

    S

    HOURS/DAYS

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    6/24

    Constructing a Dependence Graph6

    Systematically map state machines into a globaldependence graph

    Most HW is already specified as a state machinesExtract implicit state machines from SW

    App Protocol Driver NIC Network NIC Driver Protocol App

    Machine 1 Machine 2

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    7/24

    Explicit State Machine Conversion

    Edges

    Nodes

    Nodes

    Edges

    7

    State Machine Dependence Graph

    dependence edge weight =

    time spent in state

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    8/24

    Explicit State Machine Conversion8

    A

    BD

    C

    t1

    t2

    t3

    t4

    Start BDAB DB BCt1-t0 t2-t1 t3-t2 t4-t3

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    9/24

    What about software?9

    F() {int a;H();S();

    }

    H() {int z;ret L();}

    L() {int z;

    }

    Start HLFH LF FSt1-t0 t2-t1 t3-t2 t4-t3

    t1 t2

    t3

    t4

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    10/24

    State Machine Interactions10

    Link up piece of dependence graph throughthese interactions

    Queues are interaction pointsWithout them back pressure cant be modeledAbstract entities

    Annotated in models and codeDeveloped iterativelyAnalysis can pinpoint problems

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    11/24

    Interaction Example11

    Get

    Packet

    Append

    hdrs

    Find

    Dest

    Enqueue

    in TXQ

    Simplified IP stack state machine

    Process

    Packet

    Dequeue

    from TXQ

    Wait on

    TXQ

    Enqueue

    in TX ring

    Simplified TX NIC driver state machine

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    12/24

    Interaction Example12

    GetPacket

    Appendhdrs

    Find Dest

    Enqueuein TXQ

    Simplified IP stack state machine

    ProcessPacket

    Dequeuefrom TXQ

    Wait onTXQ

    Enqueuein TX ring

    Simplified TX NIC driver state machine

    Get Pkt

    App hdrs

    App hdrs

    Find dest

    Find

    destEnQ

    Enqueue

    Get pkt

    Get Pkt

    App hdrs

    App hdrs

    Find dest

    EnQWaitTXQ

    Wait TXQDeQ

    DeQProcess

    Pkt

    ProcessPktEnQ

    5 4 2 4 8

    13 3 10

    2

    Time

    Current Time: 059111315162326

    0 X

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    13/24

    Finding Global Critical Path13

    Use standard graph analysis techniques Locate longest path through the graph

    Get PktApp hdrs

    App hdrsFind dest

    FinddestEnQ

    EnqueueGet pkt

    Get PktApp hdrs

    App hdrsFind dest

    EnqueueWait TXQ

    Wait TXQDeQ

    DeQProcess

    Pkt

    ProcessPktEnQ

    5 4 2 4 8

    0 (13) 3 10

    2

    Time

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    14/24

    Critical States & Predicting Speedup14

    Aggregate states on critical pathMost frequent state is the bottleneck

    Dependence graph contains all transitions andinteractions

    Not just the ones that compose critical path orwhere waiting occurred

    Modify weights on the critical pathRe-analyze data to see how critical path changesNext critical path length potential speedup

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    15/24

    Resource Dependence Loops15

    Critical path can sometimes be improvedwithout reducing latency of any tasks

    In resource constrained environments criticalpath can be shorted by providing more

    resources

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    16/24

    Resource Dependence Loops

    Analysis automatically find candidates Addition of buffering changes critical path

    16

    A0-A1 A1-A2 A2-A3 A3-A0 A0-A1 A1-A2 A2-A3 A3-A0

    B0-B1 B1-B2 B2-B3 B3-B0 B0-B1 B1-B2 B3-B3 B3-B0

    C0-C1 C1-C2 C2-C0 C0-C1 C1-C2 C2-C0

    Time

    Iteration 2Iteration 1

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    17/24

    Workloads17

    Linux 2.6.18 SinkGen Streaming benchmark from CERNAnalyzed the transmit side

    Lighttpd High-performance web serverUses non-blocking I/O to manage connectionsUsed by large websites

    Metric is bandwidth achieved

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    18/24

    TCP Transmit18

    Start with default M5 system parameters1.Capture bottleneck data from that system2.Locate current bottleneck3.Predict performance when bottleneck is removed4.Repeat steps 2 and 3 for successive bottlenecks5.Verify that the locations and predictions are

    correct

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    19/24

    TCP Streaming Benchmark

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Config 2 Config 3 Config 4

    Predicted Actual

    19

    Initial How did we do? Run experiments

    making the abovesuggested

    changes

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    20/24

    TCP Streaming Benchmark

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Config 3 Config 4

    Predicted Actual Predict performanceagain, this time

    starting with

    configuration 2

    Config 3: 1%Config 4: 15%

    20

    Initial

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    21/24

    TCP Streaming Benchmark

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    Config 4

    Predicted Actual Predict performanceone last time, starting

    with configuration 3

    3% error

    21

    Initial

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    22/24

    Experiments and Errors22

    Additional experiments are in the paperMulti-core speed up of web server

    Describe why errors occurredCompare modified dependence graph to observed

    graph from new simulation

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    23/24

    Conclusion23

    Architects are increasing looking at system-levelissues for performance

    Apply critical path analysis to complete systemscomposed of concurrent components Span multiple layers of HW &SW Automate extraction of dependence graphs

    Identify end-to-end bottlenecks in network systems Critical tasks Resource dependence loops Performance of hypothetical systems Minutes not hours

  • 8/7/2019 31_Performanceforcasting_findingbottlenecksbeforetheyhappen

    24/24

    Questions?24