Cluster Parallel

download Cluster Parallel

of 29

Transcript of Cluster Parallel

  • 8/11/2019 Cluster Parallel

    1/29

    Goals for future from last year

    1 Finish Scaling up. I want a kilonode program.

    2 Native learning reductions. Just like more complicated losses.3 Other learning algorithms,as interest dictates.

    4 Persistent Demonization

  • 8/11/2019 Cluster Parallel

    2/29

    Goals for future from last year

    1 Finish Scaling up. I want a kilonode program.

  • 8/11/2019 Cluster Parallel

    3/29

    Some design considerations

    Hadoop compatibility: Widely available, scheduling androbustness

    Iteration-firendly: Lots of iterative learning algorithms exist

    Minimum code overhead: Dont want to rewrite learningalgorithms from scratch

    Balance communication/computation: Imbalance on eitherside hurts the system

  • 8/11/2019 Cluster Parallel

    4/29

    Some design considerations

    Hadoop compatibility: Widely available, scheduling androbustness

    Iteration-firendly: Lots of iterative learning algorithms exist

    Minimum code overhead: Dont want to rewrite learningalgorithms from scratch

    Balance communication/computation: Imbalance on eitherside hurts the system

    Scalable: John has nodes aplenty

  • 8/11/2019 Cluster Parallel

    5/29

    Current system provisions

    Hadoop-compatible AllReduce

    Various parameter averaging routinesParallel implementation of Adaptive GD, CG, L-BFGS

    Robustness and scalability tested up to 1K nodes andthousands of node hours

  • 8/11/2019 Cluster Parallel

    6/29

    Basic invocation on single machine

    ./spanning tree

    ../vw --total 2 --node 0 --unique id 0 -d $1

    --span server localhost > node 0 2>&1 &../vw --total 2 --node 1 --unique id 0 -d $1

    --span server localhost

    killall spanning tree

  • 8/11/2019 Cluster Parallel

    7/29

    Command-line options

    --span server : Location of server for setting upspanning tree

    --unique id (=0): Unique id for cluster parallel job--total (=1): Total number of nodes used incluster parallel job

    --node (=0): Node id in cluster parallel job

  • 8/11/2019 Cluster Parallel

    8/29

    Basic invocation on a non-Hadoop cluster

    Spanning-tree server: Runs on cluster gateway, organizescommunication

    ./spanning tree

    Worker nodes: Each worker node runs VW

    ./vw --span server --total --node --unique id -d

  • 8/11/2019 Cluster Parallel

    9/29

    Basic invocation in a Hadoop cluster

    Spanning-tree server: Runs on cluster gateway, organizescommunication

    ./spanning tree

    Map-only jobs: Map-only job launched on each node usingHadoop streaming

    hadoop jar $HADOOP HOME/hadoop-streaming.jar-Dmapred.job.map.memory.mb=2500 -input -output -file vw -file runvw.sh -mapper

    runvw.sh -reducer NONEEach mapper runs VWModel stored in /model on HDFSrunvw.sh calls VW, used to modify VW arguments

  • 8/11/2019 Cluster Parallel

    10/29

    mapscript.sh example

    //Hadoop-streaming has no specification for number of mappers,

    we calculate it indirectly

    total=

    mapsize=`expr $total / $nmappers`

    maprem=`expr $total % $nmappers`

    mapsize=`expr $mapsize + $maprem`

    ./spanning tree //Starting span-tree server on the gateway//Note the argument min.split.size to specify number of mappers

    hadoop jar $HADOOP HOME/hadoop-streaming.jar

    -Dmapred.min.split.size=$mapsize

    -Dmapred.map.tasks.speculative.execution=true -input

    $in directory -output $out directory -file ../vw -file

    runvw.sh -mapper runvw.sh -reducer NONE

  • 8/11/2019 Cluster Parallel

    11/29

    Communication and computation

    Two main additions in cluster-parallel code:

    Hadoop-compatible AllReduce communicationNew and old optimization algorithms modified for AllReduce

  • 8/11/2019 Cluster Parallel

    12/29

    Communication protocol

    Spanning-tree server runs as daemon and listens forconnections

    Workers via TCP with a node-id and job-id

    Two workers with same job-id and node-id are duplicates,faster one kept (speculative execution)

    Available as mapper environment variables in Hadoop

    mapper=`printenv mapred task id | cut -d " " -f 5`

    mapred job id=`echo $mapred job id | tr -d job `

  • 8/11/2019 Cluster Parallel

    13/29

    Communication protocol contd.

    Each worker connects to spanning-tree sever

    Server creates a spanning tree on the n nodes, communicatesparent and children to each node

    Node connects to parent and children via TCP

    AllReduce run on the spanning tree

  • 8/11/2019 Cluster Parallel

    14/29

    AllReduce

    Every node begins with a number (vector)

    7

    1

    2 3

    4 5 6

    Extends to other functions: max, average, gather, . . .

  • 8/11/2019 Cluster Parallel

    15/29

    AllReduce

    Every node begins with a number (vector)

    16

    1

    4 5 6 7

    11

    Extends to other functions: max, average, gather, . . .

  • 8/11/2019 Cluster Parallel

    16/29

    AllReduce

    Every node begins with a number (vector)

    28

    4 5 6 7

    11 16

    Extends to other functions: max, average, gather, . . .

    AllR d

  • 8/11/2019 Cluster Parallel

    17/29

    AllReduce

    Every node begins with a number (vector)Every node ends up with the sum

    28

    28

    28 28

    28 28 28

    Extends to other functions: max, average, gather, . . .

    AllR d E l

  • 8/11/2019 Cluster Parallel

    18/29

    AllReduce Examples

    Counting: n = allreduce(1)

    Average: avg = allreduce(ni)/allreduce(1)

    Non-uniform averaging: weighted avg =allreduce(niwi)/allreduce(wi)

    Gather: node array = allreduce({0, 0, . . . , 1

    i

    , . . . , 0})

    AllR d E l

  • 8/11/2019 Cluster Parallel

    19/29

    AllReduce Examples

    Counting: n = allreduce(1)

    Average: avg = allreduce(ni)/allreduce(1)

    Non-uniform averaging: weighted avg =allreduce(niwi)/allreduce(wi)

    Gather: node array = allreduce({0, 0, . . . , 1

    i

    , . . . , 0})

    Current code provides 3 routines:

    accumulate(): Computes vector sums

    accumulate scalar(): Computes scalar sumsaccumulate avg(): Computes weighted andunweighted averages

    M hi l i ith AllR d

  • 8/11/2019 Cluster Parallel

    20/29

    Machine learning with AllReduce

    Previously: Single node SGD, multiple passes over data

    Parallel: Each node runs SGD, averages parameters afterevery pass (or more often!)

    Code change:if(global.span server != "") {

    if(global.adaptive)

    accumulate weighted avg(global.span server,

    params->reg);

    else

    accumulate avg(global.span server,params->reg, 0);

    }

    Weighted averages computed for adaptive updates, weightfeatures differently

    M hi l i ith AllR d td

  • 8/11/2019 Cluster Parallel

    21/29

    Machine learning with AllReduce contd.

    L-BFGS requires gradients and loss values

    One call to AllReduce for each

    Parallel synchronized L-BFGS updates

    Same with CG, another AllReduce operation for Hessian

    Extends to many other common algorithms

    Communication and computation

  • 8/11/2019 Cluster Parallel

    22/29

    Communication and computation

    Two main additions in cluster-parallel code:

    Hadoop-compatible AllReduce communicationNew and old optimization algorithms modified for AllReduce

    Hybrid optimization for rapid convergence

  • 8/11/2019 Cluster Parallel

    23/29

    Hybrid optimization for rapid convergence

    SGD converges fast initially, but slow to squeeze the final bitof precision

    L-BFGS converges rapidly towards the end, once in a goodregion

    Hybrid optimization for rapid convergence

  • 8/11/2019 Cluster Parallel

    24/29

    Hybrid optimization for rapid convergence

    SGD converges fast initially, but slow to squeeze the final bitof precision

    L-BFGS converges rapidly towards the end, once in a goodregion

    Hybrid optimization for rapid convergence

  • 8/11/2019 Cluster Parallel

    25/29

    Hybrid optimization for rapid convergence

    SGD converges fast initially, but slow to squeeze the final bitof precision

    L-BFGS converges rapidly towards the end, once in a goodregion

    Each node performs few local SGD iterations, averaging afterevery pass

    Switch to L-BFGS with synchronized iterations usingAllReduce

    Two calls to VW

    Speedup

  • 8/11/2019 Cluster Parallel

    26/29

    Speedup

    Near linear speedup

    10 20 30 40 50 60 70 80 90 100

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Nodes

    Speedup

    Hadoop helps

  • 8/11/2019 Cluster Parallel

    27/29

    Hadoop helps

    Nave implementation driven by slow node

    Speculative execution ameliorates the problem

    Table: Distribution of computing time (in seconds) over 1000 nodes.First three columns are quantiles. The first row is without speculativeexecution while the second row is with speculative execution.

    5% 50% 95% Max Comm. time

    Without spec. exec. 29 34 60 758 26With spec. exec. 29 33 49 63 10

    Fast convergence

  • 8/11/2019 Cluster Parallel

    28/29

    Fast convergence

    auPRC curves for two tasks, higher is better

    0 10 20 30 40 500.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    Iteration

    auPRC

    OnlineLBFGS w/ 5 online passesLBFGS w/ 1 online passLBFGS

    0 5 10 15 20

    0.466

    0.468

    0.47

    0.472

    0.474

    0.476

    0.478

    0.48

    0.482

    0.484

    Iteration

    auPRC

    OnlineLBFGS w/ 5 online passesLBFGS w/ 1 online passLBFGS

    Conclusions

  • 8/11/2019 Cluster Parallel

    29/29

    Conclusions

    AllReduce quite general yet easy for machine learning

    Marriage with Hadoop great for robustnessHybrid optimization strategies effective for rapid convergence

    John gets his kilonode program