Cluster Computing with DryadLINQ
description
Transcript of Cluster Computing with DryadLINQ
Cluster Computing with DryadLINQ
Mihai Budiu Microsoft Research, Silicon Valley
Intel Research Berkeley, Systems Seminar SeriesOctober 9, 2008
3
The other ‘60s
OS/360
Multics
ARPANET
PDP/8
Spacewars
Virtual memory
Time-sharing
(defun factorial (n) (if (<= n 1) 1 (* n (factorial (- n 1)))))
4
What about us, now?
5
Layers
Networking
Storage
Distributed Execution
Scheduling
Resource Management
Applications
Identity & Security
Caching and Synchronization
Programming Languages and APIs
Ope
ratin
g Sy
stem
7
This Work
8
• Introduction• Dryad • DryadLINQ• DryadLINQ Applications
Outline
9
Dryad• Continuously deployed since 2006• Running on >> 104 machines• Sifting through > 10Pb data daily• Runs on clusters > 3000 machines• Handles jobs with > 105 processes each• Platform for rich software ecosystem• Used by >> 100 developers
• Written at Microsoft Research, Silicon Valley
10
Bibliography
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey
Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007
11
SQL
Software Stack
Windows Server
Cluster Services
Distributed Filesystem (Cosmos)
Dryad
Distributed Shell
PSQL
DryadLINQSQL
server
Windows Server
Windows Server
Windows Server
C++
CIFS/NTFS
legacycode
sed, awk, perl, grep
SSISScope
C#MachineLearning
.Net
Job
queu
eing
, mon
itorin
g
Distributed Data Structures
GraphsData
mining
Applications
12
Goal
13
Design Space
ThroughputLatency
Internet
Privatedata
center
Data-parallel
Sharedmemory
DryadSearch
HPC
Grid
Transaction
14
Data Partitioning
RAM
DATA
DATA
15
2-D Piping• Unix Pipes: 1-D
grep | sed | sort | awk | perl
• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50
16
Virtualized 2-D Pipelines
17
Virtualized 2-D Pipelines
18
Virtualized 2-D Pipelines
19
Virtualized 2-D Pipelines
20
Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized
21
Dryad Job Structure
grep
sed
sortawk
perlgrep
grepsed
sort
sort
awk
Inputfiles
Vertices (processes)
Outputfiles
ChannelsStage
22
Channels
X
M
Items
Finite streams of items
• distributed filesystem files (persistent)• SMB/NTFS files (temporary)• TCP pipes (inter-machine)• memory FIFOs (intra-machine)
23
Dryad System Architecture
Files, TCP, FIFO, Networkjob schedule
data plane
control plane
NS PD PDPD
V V V
Job manager cluster
Fault Tolerance
25
Policy Managers
R R
X X X X
Stage RR R
Stage X
Job Manager
R managerX ManagerR-X
Manager
Connection R-X
26
• Introduction• Dryad • DryadLINQ• DryadLINQ Applications
Outline
27
LINQ
Dryad
=> DryadLINQ
28
LINQ = .Net+ Queries
Collection<T> collection;bool IsLegal(Key);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
29
LINQ System ArchitectureLocal machine
.Netprogram(C#, VB, F#, etc)
LINQProvider
Execution engine
Query
Objects
•LINQ-to-obj•PLINQ•LINQ-to-SQL•LINQ-to-WS•DryadLINQ•Fickr•Oracle•LINQ-to-XML•Your own
30
Collection<T> collection;bool IsLegal(Key k);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ = LINQ + Dryad
C#
collection
results
C# C# C#
Vertexcode
Queryplan(Dryad job)Data
31
DryadLINQ Data Model
Partition
Collection
.Net objects
The DryadLINQ Provider
32
DryadLINQClient machine
(11)
Distributedquery plan
.Net
Query Expr
Data center
Output TablesResults
Input TablesInvoke Query
Output DryadTable
Dryad Execution
.Net Objects
Dryad JM
ToCollection
foreach
Vertexcode
33
Demo
34
Example: Histogrampublic static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k){ var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top;}
“A line of words of wisdom”
[“A”, “line”, “of”, “words”, “of”, “wisdom”]
[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]
[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}]
35
Histogram PlanSelectMany
SortGroupBy+SelectHashDistribute
MergeSortGroupBy
SelectSortTake
MergeSortTake
36
Map-Reduce in DryadLINQ
public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result;}
37
Map-Reduce Plan
M
R
G
M
Q
G1
R
D
MS
G2
R
static dynamic
X
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
MS
G2
R
map
sort
groupby
reduce
distribute
mergesort
groupby
reduce
mergesort
groupby
reduce
consumer
map
parti
al a
ggre
gatio
nre
duce
S S S S
A A A
S S
T
dynamic
38
Distributed Sorting in DryadLINQ
public static IQueryable<TSource>DSort<TSource, TKey>(this IQueryable<TSource> source, Expression<Func<TSource, TKey>> keySelector, int pcount){ var samples = source.Apply(x => Sampling(x)); var keys = samples.Apply(x => ComputeKeys(x, pcount)); var parts = source.RangePartition(keySelector, keys); return parts.OrderBy(keySelector);}
39
Distributed Sorting Plan
O
DS
H
D
M
S
DS
H
D
M
S
DS
D
DS
H
D
M
S
DS
D
M
S
M
S
static dynamic dynamic
40
Language Summary
WhereSelectGroupByOrderByAggregateJoinApplyMaterialize
41
Combining Query Providers
PLINQ
Local machine
.Netprogram(C#, VB, F#, etc)
LINQProvider
Execution engines
Query
Objects
SQL Server
DryadLINQ
LINQProvider
LINQProvider
LINQProvider
LINQ-to-obj
42
Using PLINQQuery
DryadLINQ
PLINQ
Local query
43
LINQ to SQL
Using LINQ to SQL Server
Query
DryadLINQ
Query Query Query Query Query
LINQ to SQL
44
Using LINQ-to-objects
Query
DryadLINQ
Local machine
Cluster
LINQ to obj
debug
production
45
• Introduction• Dryad • DryadLINQ• DryadLINQ Applications
Outline
46
Sample applications written using DryadLINQ Class
Distributed linear algebra Numerical
Accelerated Page-Rank computation Web graph
Privacy-preserving query language Data mining
Expectation maximization for a mixture of Gaussians Clustering
K-means Clustering
Linear regression Statistics
Probabilistic Index Maps Image processing
Principal component analysis Data mining
Probabilistic Latent Semantic Indexing Data mining
Performance analysis and visualization Debugging
Road network shortest-path preprocessing Graph
Botnet detection Data mining
Epitome computation Image processing
Neural network training Statistics
Parallel statistical algorithms Statistics
Distributed query caching Optimization
Web indexing structure Web graph
47
Linear Algebra & Machine Learning in DryadLINQ
Dryad
DryadLINQ
Large Vector
Machine learningData analysis
48
Operations on Large Vectors: Map 1
U
T
T Uf
f
f preserves partitioning
49
V
Map 2 (Pairwise)
T Uf
V
U
T
f
50
Map 3 (Vector-Scalar)T U
fV
V
50
U
T
f
Reduce (Fold)
51
U UU
U
f
f f f
fU U U
U
52
Linear Algebra
T U Vnmm ,,=, ,
T
53
Linear Regression
• Data
• Find
• S.t.
mt
nt yx ,
mnA
tt yAx
},...,1{ nt
54
Analytic Solution
X×XT X×XT X×XT Y×XT Y×XT Y×XT
Σ
X[0] X[1] X[2] Y[0] Y[1] Y[2]
Σ
[ ]-1
*
A
1))(( Ttt t
Ttt t xxxyA
Map
Reduce
55
Linear Regression Code
Vectors x = input(0), y = input(1);Matrices xx = x.Map(x, (a,b) => a.OuterProd(b));OneMatrix xxs = xx.Sum();Matrices yx = y.Map(x, (a,b) => a.OuterProd(b));OneMatrix yxs = yx.Sum();OneMatrix xxinv = xxs.Map(a => a.Inverse());OneMatrix A = yxs.Map(xxinv, (a, b) => a.Mult(b));
1))(( Ttt t
Ttt t xxxyA
Expectation Maximization (Gaussians)
56
• 160 lines • 3 iterations shown
57
Probabilistic Index MapsImages
features
Conclusions
58
Visual Studio
LINQ
Dryad
58
=
59
Backup Slides
60
Data-Parallel Computation
Storage
Execution
Application
Parallel Databases
Map-Reduce
GFSBigTable
CosmosNTFS
Dryad
DryadLINQScope,PSQL
Sawzall, Pig
61
Dryad = Execution Layer
Job (application)
Dryad
Cluster
Pipeline
Shell
Machine≈
62
DryadLINQ• Declarative programming • Integration with Visual Studio• Integration with .Net• Type safety• Automatic serialization• Job graph optimizations static dynamic
• Conciseness
X[0] X[1] X[3] X[2] X’[2]
Completed vertices Slow vertex
Duplicatevertex
Dynamic Graph Rewriting
Duplication Policy = f(running times, data volumes)
64
S S S S
A A A
S S
T
S S S S S S
T
# 1 # 2 # 1 # 3 # 3 # 2
# 3# 2# 1
static
dynamic
rack #
Dynamic Aggregation
TT[0-?) [?-100)
Range-Distribution Manager
S
D D D
S S
S S S
Tstatic
dynamic65
Hist
[0-30),[30-100)
[30-100)[0-30)
[0-100)
66
Goal: Declarative Programming
X
T
S
X X
S S
T T T
X
static dynamic
JM code
vertex code
Staging1. Build
2. Send .exe
3. Start JM
5. Generate graph
7. Serializevertices
8. MonitorVertex execution
4. Querycluster resources
Cluster services6. Initialize vertices
68
“What’s the point if I can’t have it?”
• Glad you asked• We’re opening Dryad+DryadLINQ to
a few academic partners• Expect an official announcement soon• Please talk to us if you are interested• Most likely not open-source (for now)