Post on 30-Dec-2015
description
The Dryad ecosystem
Rebecca IsaacsMicrosoft Research
Silicon Valley
Outline
• Introduction– Dryad– DryadLINQ
• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging
• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management
• Current status
Data-parallel programming
• Partition large data sets and process the pieces in parallel
• Programming frameworks have made this easy– The execution environment (eg Dryad, Hadoop)
deals with scheduling of tasks, movement of data, and fault tolerance
– A high level language (eg DryadLINQ, PigLatin) allows the programmer to express the parallelism in a declarative fashion
Dryad (Isard et al, EuroSys 07)
• Generalized MapReduce• Programs are dataflow graphs (dags)– Vertices (nodes) connected by channels (edges)– Channels are implemented as shared-memory
FIFOs, TCP streams or files• The scheduler dispatches vertices onto
machines to run the program
Dryad components
JM D D D
Control plane
V V V
Files, FIFO, Network
Data planeJob schedule
Dryad computations
X
M
X X X X X
M M M
Input files
R R R R Stage
Vertices (processes)
Channels
Output files
M M
DryadLINQ (Yu et al, OSDI 08)
• LINQ is a set of .NET constructs for programming with datasets– Relational databases, XML, ...– Supported by new language features in C#, Visual Basic, F#– Lazy evaluation on the data source
• DryadLINQ extends LINQ with– Partitioned datasets– Some additional operators– Compilation of LINQ expressions into data-parallel
operations expressed as a Dryad dataflow graph
DryadLINQ example
• Join: find the lines in a file that start with one of the keywords in a second file
DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(mydata);DryadTable<LineRecord> keywords = ddc.GetTable<LineRecord>(keys);IQueryable<LineRecord> matches = table.Join(keywords, l1 => l1.line.Split(' ').First(), /* first key */ l2 => l2.line, /* second key */ (l1, l2) => l1); /* keep first line */
Dryad execution graph for join
Work is distributed: each word is sent to a machine based on its hash function
Data file has 2 partitions Keys file has
1 partition
2 partitions for the
output file
Outline
• Introduction– Dryad– DryadLINQ
• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging
• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management
• Current status
Joint work with Paul Barham and Richard Black, MSR Cambridge/Silicon ValleySimon Peter and Timothy Roscoe, ETH Zurich
How do vertices behave?
• Use Performance Analyzer tool from Microsoft (search for xperf on MSDN)
• Detail view of one vertex– “Select” operation– Reads and writes 1 million 1K records on local disk
Select vertex, version 1
• Hardware:– 2 quad-core processors, 2.66GHz– 2 disks configured as a striped volume
Disk2 utilization
Disk1 utilization
CPU utilization
Reads (Red)
Writes (Blue)
Select vertex, version 2
• Hardware:– 1 quad-core processor, 2GHz– 1 disk
Disk utilization CPU
utilization
Reads (Red)
Writes (Blue)
Data is read and then written in
batches
Reader thread
Other threads pick up the IO completions, sometimes
issuing writes
NB Processors are 95% idle during execution of this vertex
View of thread activity
Observations
• The bottleneck resource changes every few seconds• And may not be 100% utilized
• Vertices are multi-threaded, consuming multiple resources simultaneously
• Dryad is engineered for throughput– Sequential I/O– Batched in 256KB chunks– Requests are pipelined, typically 4+ deep
• Most DryadLINQ vertices are standard data processing operators with predictable behaviour
Factors affecting vertex execution times
• Hardware:– CPU speed– Number of CPUs– Disk transfer rate– Network transfer rate
• Workload:– I/O size– We assume file access patterns stay the same
• Placement relative to parent(s):– Channels can be local (read from or write to disk)– Or remote (read from remote or local file via SMB)
Key idea: identify vertex phases
• Trace a reference execution of the vertex• Identify phases within which resource
demands are consistent• Phase boundaries are when the resource
demands change– E.g. start reading, stop reading, etc.
• Similar phases, in terms of resource consumption are grouped together
Phases in the Select vertex
Phases in the Select vertex
Dcpu =70msDdisk = 30ms
Dcpu = 20msDdisk = 40ms
Dcpu =40ms
Predicting phase runtimes
• Each phase has the attributes:– Type: read, write, both, compute, overhead– “Concurrency histogram”– File being read/written– Number of bytes read/written
i.e. demands on each resource• Simple operational laws can be applied to
each phase individually– Can predict its runtime on different hardware
Expectations of accuracy
• Inherent variability in running times:– Layout of file on disk
• Inner or outer track• File fragmentation
– Background processes• Logging and scanning services
– Unanticipated network effects– Model deficiencies
• Memory contention• Caching• Garbage collection
• Prediction within 30% of actual would be good...
Prediction accuracy evaluation
Label Read Write CPU I/O Pred (s) Avg (s) %error
Reference 140 128 2.66*8 1.0 20.7 18.7 9.9
½ size I/O 140 128 2.66*8 0.5 10.3 10.1 1.9
Desktop 42 42 2.39*2 1.0 51.6 48.8 5.7
Remote 11.5 128 2.66*8 1.0 remote
18.1 29.6 38.9
File server 210 180 2.0*4 1.0 18.8 16.7 12.6
Laptop, remote 20 20 1.83 1.0 remote
79.9 91.9 13.1
Merge vertex, 1 input 1 output, average over 10 runsPredict running time on different hardware
Outline
• Introduction– Dryad– DryadLINQ
• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging
• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management
• Current status
The parallelism spectrum
shared memory
core1
disk
core2
coren
shared memory
multiprocessor
homogeneous clusters,
data centres
small, heterogeneous
clusters
Ad-hoc clusters
• Small, heterogeneous clusters are everywhere– In the workplace– In my house... and yours?
• Could be pretty useful for data-parallel programming– Data mining– Video editing– Scientific applications– ...
A data-parallel programming framework for ad-hoc clusters?
• Why?– Exploit unused machines with no hardcoded assumptions
about hardware and availability– “Easy” to write and run the code
• Why not?– Heterogeneity: the wrong schedule can make it go badly wrong – Built-in assumptions about failure don’t apply
• Our solution:1. Construct vertex performance models2. Apply a constraint-based search procedure to find a good
assignment of vertices to the physical computers
Default scheduling in Dryad
• DryadLINQ compiler creates an XML description of the vertices and how they are connected
• JobManager places the vertices on available nodes according to constraints specified in the XML file– Greedy scheduling approach – Programmer and/or DryadLINQ compiler can
provide hints
Heterogeneity can cause problems for greedy scheduling
Add a performance-aware planner to the end-to-end picture
Logging service on each node
Performance planner
Vertex phase analyser
CPU and I/O log
Vertex phase summary
XML graph
Updated XML graph
Planning algorithm
• Implemented with a constraint logic programming system (ECLiPSE)
• Constraints prune the search space• Heuristics reduce search time– Eg decide where to place longest running vertices
first• Greedy schedule gives upper bound
Contention between vertices
50 100 150 200
Hash(1)Hash(0)
Merge(1)Merge(0)
Join(1)
Join(0)
50 100 150 200
Hash(1)Hash(0)
Merge(1)
Merge(0)
Join(1)
Join(0)
Merge(1)
Merge(0)
Without Contentionmodel
With Contentionmodel
Workloads for experimental eval
Physical config of cluster
Machine Num CPUs
CPU (GHz)
RAM (GB)
Disk read (MBps)
Disk write (MBps)
Network (Mbps)
Laptop 1 1.8 2 20 20 1000
Desktop 2 2.4 2 42 42 1000
Server 4 2.0 4 210 180 1000
3 machines, quite heterogeneous
Overall speed-up vs greedy
Workload Greedy(s)min/med/max
Exhaustive (s) Achieved (s)min/med/max
Speedup (%)
Algebra 114/120/155 73 71/73/87 39
Join 165/171/206 115 114/123/144 28
Terasort 155 127 123/143/204 8
Outline
• Introduction– Dryad– DryadLINQ
• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging
• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management
• Current status
Edison
• New project– Position paper to appear at HotOS 11– Joint work with Moises Goldszmidt
• Performance problems in Dryad clusters– Resource contention– Data or computation skew– Hardware issues– Often transient
• Use active intervention– Re-run the vertex in a sandbox on the cluster– Construct experiments using its causal model– Systematically probe behavior: fix some variables while altering
others
Circuit blueprint
G4
G2
G3
G1A
B
• Given partial observations, lets us make inferences about the state of the circuit• If we intervene and
fix some inputs, lets us make inferences about the state of the circuit
Running time
Phase
CPU
Reading time Computing time
Data size
Rate data_in
Disk congestionNet congestion
Blueprint of a vertex
Blueprint for inferringState from bothObservations and Interventions Answer “what-if” questions
Root-cause analysis
Overview
• Introduction– Dryad– DryadLINQ
• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging
• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management
• Current status
Quincy (Isard et al, SOSP 09)
• Need to share the cluster between multiple, concurrently executing jobs
• Goals are fairness and data locality– If job x takes t seconds when it is run exclusively on the
cluster then x should take no more than Jt seconds when the cluster has J jobs
– Very large datasets stored on the cluster itself means that unnecessary data movement is costly
• These goals are conflicting– Optimal data locality => delay job until resources available– Fairness => allocate resources as soon as available
Quincy (cont)
• Strategies for fairness:– Sacrifice locality– Kill already running jobs– Admission control
• Fairness is achieved at a cost to throughput
Quincy (cont)
• Quantify every scheduling decision– Data transfer cost– Cost in wasted time if a task is killed
• Express scheduling problem as a flow network– Represent all worker tasks that are ready to run with preferred
locations, and all currently running tasks– Edge weights and capacities encode the scheduling policy– Produces a set of scheduling assignments for all jobs at once
that satisfy a global criterion• Solve online with standard min-cost flow algorithm
– Graph is updated whenever anything changes
TidyFS (Fetterly et al, USENIX 11)
• Simple distributed file system– Like HDFS or GFS
• Highly optimized to perform well for data-parallel computations:– Data streams are striped across cluster nodes– Stream parts are read or written in parallel
• By a single process
– I/O is sequential for high throughput– Streams are replaced rather than modified– In case of failure, missing parts of output streams can easily
be regenerated
TidyFS (cont)
• Data streams contain parts– Parts are replicated lazily
• Failure before replication is complete is handled by Dryad regenerating the missing part(s)
– Parts are “native” files, eg NTFS files or SQL Server database• Read and written using native APIs
• Centralized meta-data server– Replicated for fault tolerance– Replicas synchronize using Paxos
Artemis (Cretu-Ciocarlie et al, WASL 08)
• Management and analysis of Dryad logs– Each vertex produces around 1MB/s/process– A single Dryad job can easily produce >1TB of log data
• Runs and logs continuously on the cluster– To locate and collate log data for a particular job, itself runs a
DryadLINQ computation on the cluster• Combines job manager and vertex logs with over 80
Windows performance counters• Sophisticated GUI for post-processing and visualization
– Histograms and time series especially helpful for performance debugging
Nectar (Gunda et al, OSDI 10)
• Key idea: Data and the computation that generates it are interchangeable– Datasets are uniquely identified by the programs that produce
them• Automatic data management
– Cluster-wide caching service• Re-use of common datasets to save computation and space
– Garbage collection of obsolete datasets• Data can be transparently regenerated
• Nectar client service interposes on the DryadLINQ compiler– Consults Nectar cache server and rewrites the program
appropriately
Outline
• Introduction– Dryad– DryadLINQ
• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging
• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management
• Current status
Current status
• Imminent release of Dryad and DryadLINQ on Windows HPC– Uses HPC scheduler and other cluster services– Includes TidyFS as DSC (Distributed Storage Catalog)– Can download and try Technology Preview
• Ongoing research…– Naiad: allowing cycles in Dryad graphs
• Strongly connected components: more general programming model
• Loops and convergence tests without the need for driver programs• Continuous queries on streaming data
Conclusion
• 256-server cluster at MSR-SV– In continuous use by researchers and interns– Used information retrieval, machine learning, vision,
algorithms, planning, network trace analysis…– The mature components are in active use (Dryad,
DryadLINQ, Quincy, TidyFS, Artemis)• “Real users” drive the continuous enrichment of
the Dryad ecosystem– But many good research problems remain!