Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing...
-
Upload
paulina-anderson -
Category
Documents
-
view
212 -
download
0
Transcript of Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing...
DryadLINQ
Definition
• DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.
• The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmer.
• DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).
The Evolution of DryadLINQ
• Dryad had its roots in an idea developed in October 2004 by Isard
• Yu recognized the potential of LINQ to serve as the front-end programming tool for Dryad, and started the DryadLINQ project in September 2006
• By early 2008, the Dryad/DryadLINQ combination was made available within Microsoft
• The DryadLINQ research paper won a best-paper award in 2008 during the eighth USENIX Symposium on Operating Systems Design and Implementation
Current Status
• Works with any LINQ enabled language– C#, VB, F#, IronPython, …
• Works with multiple storage systems– NTFS, SQL, Windows Azure, Cosmos DFS
• Released internally within Microsoft– Used on a variety of applications
• External academic release announced at PDC– DryadLINQ in source, Dryad in binary– UW, UCSD, Indiana, ETH, Cambridge, …
Dryad Definition
• Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.
The Structure of Dryad Jobs
Dryad services
•An API to create distributed applications (jobs), by specifying which processes have to be executed and communication channels linking them.•Scheduling of the processes on the cluster machines.•Fault-tolerance through re-execution of processes after transient failures.•Monitoring of the computation and statistics collection.•Job visualization.•An API for run-time resource management policies.•Support for efficient bulk data transfer between processes.
8
ImageProcessing
Cosmos DFSSQL Servers
Software Stack
Windows Server
Cluster Services
Azure Platform
Dryad
DryadLINQ
Windows Server
Windows Server
Windows Server
Other Languages
CIFS/NTFS
MachineLearning
GraphAnalysis
DataMining
Applications
…Other Applications
9
Dryad System Architecture
Files, TCP, FIFO, Networkjob schedule
data plane
control plane
NS PD PDPD
V V V
Job manager cluster
LINQ Framework
PLINQ
Local machine
.Netprogram(C#, VB, F#, etc)
Execution engines
Query
Objects
LINQ-to-SQL
DryadLINQ
LINQ-to-XML
LIN
Q p
rovi
der i
nter
face
Scalability
Single-core
Multi-core
Cluster
• Extremely open and extensible
DryadLINQ Operators
• Operators present in LINQ which are implemented by DryadLINQ. • Adaptations of operators present in LINQ which return scalar values
(i.e., not IQueryable), but which are modified to return an IQueryable instead. For example, Count returns an integer, while CountAsQueryable returns an IQueryable whose actual contents will be a single integer. The AsQueryable variants can be chained together to produce complex queries, while using the scalar variants would require breaking queries into small sub-queries, which could decrease efficiency
• New operators, which exist only in DryadLINQ. We have added new operators which cannot be synthesized efficiently from compositions of primitive LINQ operators, and which can substantially improve the performance of queries in the context of a distributed execution environment like Dryad.
12
Combining with LINQ-to-SQL
DryadLINQ
Subquery Subquery Subquery Subquery Subquery
Query
LINQ-to-SQL LINQ-to-SQL
DryadLINQ and LINQ
• C# and LINQ data objects become distributed partitioned files.
• LINQ queries become distributed Dryad jobs.• C# methods become code running on the
vertices of a Dryad job.
DryadLINQ representation
DryadLInq features• Declarative programming: computations are expressed in a high-level language similar to SQL• Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans
spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on thePLINQ parallelization framework.
• Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
• Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.• Type safety: distributed computations are statically type-checked.• Automatic serialization: data transport mechanisms automatically handle all .Net object types.• Job graph optimizations
– static: a rich set of term-rewriting query optimization rules is applied to the query plan, optimizing locality and improving performance.
– dynamic: run-time query plan optimizations automatically adapt the plan taking into account the statistics of the data set processed.• Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in
DryadLINQ:• public static IQueryable<R>
MapReduce<S,M,K,R>(this IQueryable<S> source, Expression<Func<S,IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<K,IEnumerable<M>,R>> reducer){ return source.SelectMany(mapper).GroupBy(keySelector, reducer);}
DryadLINQ System Architecture
16
DryadLINQClient machine
(11)
Distributedquery plan
.NET program
Query Expr
Data center
Output TablesResults
Input TablesInvoke Query
Output DryadTable
Dryad Execution
.Net Objects
JM
ToTable
foreach
Vertexcode
A Query provider translates IQueryable objects to a suitable format and ships them to a remote execution engine. It also
transforms the remote data into C# objects. DryadLINQ is just an instance of such a provider which interfaces
with the Dryad remote execution framework.
Query Provider
Execute
Local program
(5)
(11)
Transform
C#
C#
(1)
(12)
Query obj
C# Objects
(3)
Remote execution
Data (8)Results
Data
Query Invoke Query Query(2)
Transform
(7)
(9)
(10)
(4) (6)
Execution stages of a Dryad Job
Partitioned File Structure
Reductions (Aggregations)
• var result = input.Aggregate((x,y) => x+y);
[Associative]int Add(int x, int y); var sum = input.Aggregate((x,y)=>Add(x,y));
ApplyThe Select delegate receives each element
individually, while the one of Apply receives the whole stream.
22
MapReduce in DryadLINQ
MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs{ var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs}
Map-Reduce Plan(When reduce is combiner-enabled)
M
Q
G1
C
D
MS
G2
R
M
Q
G1
C
D
MS
G2
R
M
Q
G1
C
D
MS
G2
R
MS
G2
R
map
sort
groupby
combine
distribute
mergesort
groupby
reduce
mergesort
groupby
reducem
apD
ynam
ic a
ggre
gatio
nre
duce
An Example: PageRankRanks web pages by propagating scores along hyperlink structure
Each iteration as an SQL query:
1. Join edges with ranks2. Distribute ranks on edges3. GroupBy edge destination4. Aggregate into ranks5. Repeat
Multi-Iteration PageRankpages ranks
Iteration 1
Iteration 2
Iteration 3
Memory FIFO
Dryad Enters the Market
• A big step is coming, as Dryad and DryadLINQ become fully productized as part of the Microsoft HPC Server suite.
• It will be integrated with Microsoft SQL Server and Windows Azure to give customers from academia to the business community a new, powerful computing tool.
• Offering an easy-to-use but powerful, data-intensive computing tool
• It benefits a whole new set of Microsoft customers
Windows Azure and DryadLINQ
• Windows Azure is a platform for building scalable, highly reliable, multi-tiered web service applications. It is hosted on Microsoft’s large data centers in the United States, Europe, and Asia. Windows Azure has both compute and data resources. The compute resources are designed to allow applications to scale to thousands of servers and data resources.
• There is no port of Hadoop or Dryad/LINQ currently available. However, Windows Azure is an excellent platform for experimenting with new variations on large-scale map-reduce algorithms, as these patterns are easily coded as worker role networks.
“We’re convinced that we will delight our customers, both with the pure capability of the system, as well as its ease of use. What I really like about Dryad is that is not just about handling a problem in a better way, it is also about new possibilities in computing that you couldn’t imagine before.”
Resources• http://research.microsoft.com/en-us/projects/dryadlinq/• http://research.microsoft.com/en-us/projects/dryad/• DryadLINQ: A System for General-Purpose Distributed Data-Parallel
Computing Using a High-Level Language - Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey
• Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations - Yuan Yu, Pradeep Kumar Gunda, Michael Isard
• Distributed Data-Parallel Computing Using a High-Level Programming Language - Michael Isard, Yuan Yu
• Some sample programs written in DryadLINQ - Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey, Frank McSherry, and Kannan Achan
• http://blogs.msdn.com/b/dryad/archive/2009/11/24/what-is-dryad.aspx
• http://research.microsoft.com/en-us/projects/azure/faq.aspx• http://research.microsoft.com/en-us/news/features/dryad-012611.a
spx
Questions?
Thank you for your attention!