CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these...
Transcript of CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these...
CSE6331: Cloud Computing
Leonidas Fegaras
University of Texas at Arlingtonc©2016 by Leonidas Fegaras
MRQL
Apache MRQL
Apache MRQL (incubating)
MRQL: a Map-Reduce Query Language
Web site: http://mrql.incubator.apache.org/
Wiki: http://wiki.apache.org/mrql/
History:
Fall 2010: started at UTA as an academic research project
March 2013: enters Apache Incubation
4 releases under Apache so far: latest MRQL 0.9.6
CSE6331: Cloud Computing MRQL 2
Motivation
Map-Reduce is not the only player in the Hadoop ecosystem any more
designed for batch processingnot well-suited for some big data workloads:real-time analytics, continuous queries, iterative algorithms, ...
Alternatives:
Spark, Flink, Pregel, Giraph, ...
New distributed stream processing engines:
Spark Streaming, Flink Streaming, Storm, S4, Samza, ...
CSE6331: Cloud Computing MRQL 3
The new API-Based Distributed Processing Frameworks
They provide powerful abstractions as building blocks
They hide the details of parallel and distributed computing
They automate fault tolerance
. . . but, steep learning curve
Hard to develop, optimize, and maintain non-trivial applicationscoded in a programming language
Too many competing Big Data processing systems
Hard to tell which one of these systems will prevail in the near future
applications coded in one of these paradigms may have to be rewrittenas technologies evolve
... or you can express your applications in a query language that isindependent of the underlying distributed platform!
CSE6331: Cloud Computing MRQL 4
MRQL: Design Objectives
Wanted to develop a powerful and efficient query processing systemfor complex data analysis applications on big data
more powerful than existing query languagesable to capture most complex data analysis tasks declarativelyable to work on read-only, raw (in-situ), complex dataHDFS as the physical storage layerplatform-independent:
the same query can run on multiple platforms on the same clusterallowing developers to experiment with various platforms effortlessly
efficient!
We envision MRQL to be:
a common front-end for the multitude of distributed processingframeworks emerging in the Hadoop ecosystema tool for comparing these systems (functionality & performance)
CSE6331: Cloud Computing MRQL 5
Is this yet Another SQL for Map-Reduce?
MRQL is NOT SQL!
MRQL is an SQL-like query language for large-scale, distributed dataanalysis on a computer cluster
Unlike SQL, MRQL supports
a richer data model (nested collections, trees, ...)arbitrary query nestingmore powerful query constructsuser-defined types and functions
MRQL queries can run on multiple distributed processing platforms
currently Apache Hadoop MapReduce, Hama, Spark, Flink, and (soon)Storm/Trinder
The MRQL syntax and semantics have been influenced by
modern database query languages (mostly, XQuery and ODMG OQL)functional programming languages (sequence comprehensions, algebraicdata types, type inference)
CSE6331: Cloud Computing MRQL 6
Language Features
The MRQL query language:
provides a rich type system that supports hierarchical data and nestedcollections uniformly
general algebraic datatypes – can be recursive
JSON and XML are user-defined types
pattern matching over data constructions (similar to Scala match)local type inference (similar to Scala)
allows nested queries at any level and at any place
no need for awkward nulls and outer-joins
supports UDFs
provided that they don’t have side effects
allows to operate on the grouped data using queries
as is done in XQuery and Pigimproves SQL group-by with aggregation
CSE6331: Cloud Computing MRQL 7
Language features
The MRQL query language:
supports custom aggregations and reductions using UDFs
provided they have certain properties (associativity & commutativity)
supports iteration declaratively
to capture iterative algorithms, such as PageRank
supports custom parsing and custom data fragmentation
provides syntax-directed construction and deconstruction of data
to capture domain-specific languages
CSE6331: Cloud Computing MRQL 8
The MRQL Data Model
MRQL supports the following types:
a basic type: bool, short, int, long, float, double, string.
a tuple (t1, . . . , tn),
a record < A1 : t1, ...,An : tn >,
a list (sequence) [t] or list(t),
a bag (multiset) {t} or bag(t),
a user-defined type
an algebraic data type T
CSE6331: Cloud Computing MRQL 9
Algebraic Data Types
A data type T is defined at the top level:
data T = C1: t1 | ... | Cn: tn;
where C1, ..., Cn are data constructors and t1, ..., tn are types
It can be recursive
For example, an integer list can be defined as follows:
data IList = Cons: (int,IList) | Nil: ();
Then, Cons(1,Cons(2,Nil())) constructs the list [1,2]
XML is a predefined data type, defined as:
data XML = Node: ( String, bag( (String,String) ),
list(XML) )
| CData: String;
For example, <a x=’1’><b>text</b></a> is constructed using
Node(’a’,{(’x’,’1’)},[Node(’b’,{},[CData(’text’)])])
CSE6331: Cloud Computing MRQL 10
Patterns
Patterns are used in select-queries and case statements
A pattern can be:
a pattern variable that matches any data and binds the variable to data,a constant basic value,a * that matches any data,a data construction C (p1, . . . , pn),a tuple (p1, . . . , pn),a record < A1 : p1, . . . ,An : pn >,a list [p1, . . . , pn]
A case statement takes the following form:
case e { p1: e1 | ... | pn: en }
Example:
case e { Node(*,*,Node(’a’,*,cs)): cs | *: [] }
CSE6331: Cloud Computing MRQL 11
Accessing Data Sources
Parsing text files:
source(line,’employee.txt’,’,’,
type ( < name: string, dno: int, phone: any,address: string > ))
Loading binary data:
source(binary,’graph.bin’)
Parsing XML Documents:
XMark = source(xml,’xmark.xml’,{’person’});
Also, syntax to parse JSON documents and define custom parsers
Statements to dump the results of query e to a binary or a CSV textfile fname:
store fname from e;
dump fname from e;
CSE6331: Cloud Computing MRQL 12
The Select-Query Syntax
The select-query syntax takes the form: ([ ... ] is optional)
select [ distinct ] e
from p in e, ..., p in e
[ where e ]
[ group by p: e [ having e ] ]
[ order by e [ limit e ] ]
where e are expressions/queries and p are patternsExample:
select (n,cn)
from < name: n, children: cs > in Employees,
< name: cn > in cs
Same as:
select (e.name,c.name)
from e in Employees,
c in e.children
CSE6331: Cloud Computing MRQL 13
More Examples
select ( d, c, sum(s) )
from <dno:dn,salary:s> in Employees
group by (d,c): ( dn, salary>=100000 )
having avg(s) >= 80000
select (k,sum(o.TOTALPRICE))
from o in Orders,
c in Customers
where o.CUSTKEY=c.CUSTKEY
group by k: c.NAME;
select (c.NAME,avg(select o.TOTALPRICE
from o in Orders
where o.CUSTKEY=c.CUSTKEY))
from c in Customers;
CSE6331: Cloud Computing MRQL 14
Simple example: matrix multiplication
A sparse matrix X is represented as a bag of (Xij , i , j).
Zij =∑k
Xik ∗ Ykj
select ( sum(z), i, j )
from (x,i,k) in X, (y,k,j) in Y, z = x*y
group by i, j
CSE6331: Cloud Computing MRQL 15
An XML example
Group all persons according to their interests and the number of openauctions they watch. For each such group, return the number of personsin the group:
select ( cat, os, count(p) )
from p in XMARK,
i in p.profile.interest
group by cat: i.@category,
os: count(p.watches.@open_auctions)
CSE6331: Cloud Computing MRQL 16
Another XML example
Invert the citation graph in DBLP by grouping the items by their citationsand by ordering these groups by the number of citations they received:
select ( select text(a.title)
from a in DBLP
where a.@key = x,
count(a) )
from a in DBLP,
c in a.cite
group by x: text(c)
order by inv(count(a))
CSE6331: Cloud Computing MRQL 17
Example: k-means clustering
Derive k clusters from a set of points P:
repeat centroids = ...
step select < X: avg(s.X), Y: avg(s.Y) >
from point in Points
group by k: (select c from c in centroids
order by distance(point,c))[0]
CSE6331: Cloud Computing MRQL 18
Example: the PageRank algorithm
Simplified PageRank:
A graph node is associated with a PageRank and its outgoing links:
< id: 23, rank: 0.0, adjacent : { 10, 45, 35 } >
Propagate the PageRank of a node to its outgoing links;each node gets a new PageRank by accumulating the propagatedPageRanks from its incoming links:
repeat nodes = ...step select < id: m.id, rank: n.rank, adjacent : m.adjacent >
from n in (select < id: key, rank: sum(c.rank) >from c in ( select < id: a,
rank: n.rank/count(n.adjacent) >from n in nodes,
a in n.adjacent )group by key: c. id ),
m in nodeswhere n.id = m.id
CSE6331: Cloud Computing MRQL 19
Architecture
Query translation stages:
1 type inference
2 query translation and normalization
3 simplification
4 algebraic optimization
5 plan generation
6 plan optimization
7 compilation to Java code
CSE6331: Cloud Computing MRQL 20
Algebraic operators
Algebraic operations on bags:
groupBy ( X: {(κ, α)} ) : {(κ, {α})}flatMap ( f: α→ {β}, X: {α} ) : {β}reduce ( ⊕: (α, α)→ α , X: {α} ) : αunion ( X: {α}, Y: {α} ) : {α}
Join: coGroup ( X: {(κ, α)}, Y: {(κ, β)} ) : {(κ, {α}, {β})}map-reduce(m,r,X) = flatMap(r,groupBy(flatMap(m,X)))
List operations: orderBy, append
Iteration: repeat ( f: {α} → {α}, X: {α} ) : {α}
CSE6331: Cloud Computing MRQL 21
Query optimization
The query optimizer:
uses a cost-based optimization framework to map algebraic terms toefficient workflows of physical operations
handles dependent joins (used for nested collections)
unnest deeply nested queries and converts them to join plans
CSE6331: Cloud Computing MRQL 22
Current Work: Distributed Stream Processing
Support for continuous queries over multiple streams of data
Data come in incremental batches ∆X
Batch streaming based on sliding windows
Query q(X1,X2;Y ) over one invariant and two streaming data sources
CSE6331: Cloud Computing MRQL 23
Current Work: Distributed Stream Processing
select (k,avg(p.Y))from p in stream(binary,"points")
group by k: p.X
Currently, works on Spark Streaming
Soon, on Flink Streaming and Spark/Trinder
CSE6331: Cloud Computing MRQL 24
Current Work: Incremental Query Processing
Problem: translate any batch program (eg, PageRank) to anincremental program automatically
Solution: Break the query q(X1,X2;Y ) = g(f (X1,X2;Y )) such as:
f (X1 ]∆X1,X2 ]∆X2;Y ) = f (X1,X2;Y ) ⊗ f (∆X1,∆X2;Y )
Requires program analysis & transformation
CSE6331: Cloud Computing MRQL 25