Post on 27-Jan-2021
Dynamic Parameter Allocationin Parameter Servers
Alexander Renz-Wieland1, Rainer Gemulla2,Steffen Zeuch1,3, Volker Markl1,3
1TU Berlin, 2University of Mannheim, 3DFKI
VLDB 2020
1 / 11
TakeawaysI Key challenge in distributed Machine Learning (ML):
communication overheadI Parameter Servers (PSs)
I IntuitiveI Limited support for common techniques to reduce overhead
I How to improve support?I Dynamic parameter allocation
I Is this support beneficial?I Up to two orders of magnitude faster
2 / 11
Background: Distributed Machine LearningI Distributed training is a necessity for large-scale ML tasksI Parameter management is a key concernI Parameter servers (PS) are widely used
Logical
Parameter Server
worker workerworker ...
push()
pull()pus
h()
pull
()
push()
pull()
Physical
para
met
ers
worker
worker
worker
para
met
ers
worker
worker
worker
para
met
ers
worker
worker
worker
3 / 11
Problem: Communication OverheadI Communication overhead can limit scalabilityI Performance can fall behind a single node
Training knowledge graph embeddings (RESCAL, dimension 100):
4.5h
4h
2.4h
1.5h
Classic PS (PS−Lite)
0
100
200
1x4 2x4 4x4 8x4Parallelism (nodes x threads)
Epo
ch r
un ti
me
in m
inut
es
4 / 11
Problem: Communication OverheadI Communication overhead can limit scalabilityI Performance can fall behind a single node
Training knowledge graph embeddings (RESCAL, dimension 100):
4.5h
1.2h
4h
2.4h
1.5h
Classic PS (PS−Lite)
Classic PSwith fast local
access
0
100
200
1x4 2x4 4x4 8x4Parallelism (nodes x threads)
Epo
ch r
un ti
me
in m
inut
es
4 / 11
Problem: Communication OverheadI Communication overhead can limit scalabilityI Performance can fall behind a single node
Training knowledge graph embeddings (RESCAL, dimension 100):
●
●
●●
4.5h
1.2h
4h
0.6h
2.4h
0.4h
1.5h
0.2h
Classic PS (PS−Lite)
Classic PSwith fast local
access
Dynamic Allocation PS (Lapse),incl. fast local access
0
100
200
1x4 2x4 4x4 8x4Parallelism (nodes x threads)
Epo
ch r
un ti
me
in m
inut
es
4 / 11
How to reduce communication overhead?I Common techniques to reduce overhead:
DATA
PARAMETERS
Data clustering
DATA
PARAMETERS
Parameter blocking
DATA
PARAMETERS
Latency hiding
I Key is to avoid remote accessesI Do PSs support these techniques?
I Techniques require local access at different nodes over timeI But PSs allocate parameters statically
5 / 11
worker 1 worker 2
parameter access
How to reduce communication overhead?I Common techniques to reduce overhead:
DATA
PARAMETERS
Data clustering
DATA
PARAMETERS
Parameter blocking
DATA
PARAMETERS
Latency hiding
I Key is to avoid remote accessesI Do PSs support these techniques?
I Techniques require local access at different nodes over timeI But PSs allocate parameters statically
5 / 11
worker 1 worker 2
parameter access
Dynamic Parameter AllocationI What if the PS could allocate parameters dynamically?
Localize(parameters)
I Would provide support forI Data clustering XI Parameter blocking XI Latency hiding X
I We call this dynamic parameter allocation
6 / 11
The Lapse Parameter ServerI Features
I Dynamic allocationI Location transparencyI Retains sequential consistency
PS per-key consistency guarantees (for synchronous operations)
Classic Lapse Stale
Eventual X X XPRAM X X XCausal X X ×Sequential X X ×Serializability × × ×
I Many system challenges (see paper)I Manage parameter locationsI Route parameter accesses to current locationI Relocate parametersI Handle reads and writes during relocationsI All while maintaining sequential consistency
7 / 11
Experimental studyTasks: matrix factorization, knowledge graph embeddings, word vectorsCluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet
1. Performance of Classic PSsI 2–8 nodes barely outperformed 1 node in all tested tasks
2. Effect of dynamic parameter allocationI 4–203x faster than a Classic PSs, up to linear speed-ups
3. Comparison to bounded staleness PSsI 2–28x faster and more scalable
4. Comparison to manual managementI Competitive to a specialized low-level implementation
5. Ablation studyI Combining fast local access and dynamic allocation is key
8 / 11
●
●
●●
4.5h
1.2h
4h
0.6h
2.4h
0.4h
1.5h
0.2h
Classic PS (PS−Lite)
Classic PSwith fast local
access
Dynamic Allocation PS (Lapse),incl. fast local access
0
100
200
1x4 2x4 4x4 8x4Parallelism (nodes x threads)
Epo
ch r
un ti
me
in m
inut
es
Comparison to Bounded Staleness PSI Matrix factorization (matrix with 1b entries, rank 100)I Parameter blocking
Single-nodeoverhead Non-linearscaling
Bounded staleness PS(Petuum), client sync.
Bounded staleness PS(Petuum), server sync.
Dynamic AllocationPS (Lapse)
●
●
●
●
0.6x
2.9x
8.4x0
10
20
30
40
1x4 2x4 4x4 8x4Parallelism (nodes x threads)
Epo
ch r
un ti
me
in m
inut
es
9 / 11
Comparison to Bounded Staleness PSI Matrix factorization (matrix with 1b entries, rank 100)I Parameter blocking
Single-nodeoverhead Non-linearscaling
Bounded staleness PS(Petuum), client sync.
Bounded staleness PS(Petuum), server sync.
Dynamic AllocationPS (Lapse)
●
●
●
●
0.6x
2.9x
8.4x0
10
20
30
40
1x4 2x4 4x4 8x4Parallelism (nodes x threads)
Epo
ch r
un ti
me
in m
inut
es
9 / 11
Experimental studyTasks: matrix factorization, knowledge graph embeddings, word vectorsCluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet
1. Performance of Classic PSsI 2–8 nodes barely outperformed 1 node in all tested tasks
2. Effect of dynamic parameter allocationI 4–203x faster than a Classic PSs, up to linear speed-ups
3. Comparison to bounded staleness PSsI 2–28x faster and more scalable
4. Comparison to manual managementI Competitive to a specialized low-level implementation
5. Ablation studyI Combining fast local access and dynamic allocation is key
10 / 11
Dynamic Parameter Allocation in Parameter ServersI Key challenge in distributed Machine Learning (ML):
communication overheadI Parameter Servers (PSs)
I IntuitiveI Limited support for common techniques to reduce overhead
I How to improve support?I Dynamic parameter allocation
I Is this support beneficial?I Up to two orders of magnitude faster
I Lapse is open source:https://github.com/alexrenz/lapse-ps
11 / 11
https://github.com/alexrenz/lapse-ps
IntroBackground and ProblemDynamic Parameter AllocationExperimentsConclusion