An Intermediate Algebra for Optimizing RDF Graph
Pattern Matching on MapReducePadmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu
COUL – Semantic COmpUting research Lab
Outline IntroductionBackground
MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig
Approach TripleGroup data model and Nested TripleGroup
Algebra (NTGA) Comparing NTGA based plans and Pig Latin
plans for graph pattern matching queries
EvaluationRelated WorkConclusion and Future Work
2/30
Basics: MapReduce Large scale processing of data on a
cluster of commodity grade machinesUsers encode task as map / reduce
functions, which are executed in parallel across the cluster
Apache Hadoop* – open-source implementation
Key Terms Hadoop Distributed File System (HDFS) Slave nodes / Task Tracker – Mappers (Reducers) execute
the map (reduce) function Master node / Job Tracker – manages and
assigns tasks to Mappers / Reducers* http://hadoop.apache.org/
3/30
Supports Partition ParallelismEach MR cycle I/O and communication costs
Data Processing on HadoopJob Tracker
Mapper1map()
Mapper2map()
MapperNmap()
Reducer1reduce()
ReducerMreduce()
DiskDisk
Disk………….
………….
Input
Sort / Shuffle
Output
HDFS Reads
Local Writes
HDFS Writes
Remote Reads
Map exec
Reduce exec
(k1, v1)
(k1, v2)
(k1, v3)
(k1, {v1, v2, v3})
(k1, val)
4/30
Joins in Map Reduce
Map phase – scan input records map func. annotates each record
based on join column e.g. (joinKey, Record)
Reduce phase – records with same joinKey collected by same reduce task reduce func. joins the tuples Output written into HDFS
Single Join Workload
5/30
Data Processing in Pig
Express data flow using high-level query primitives usability, code reuse, automatic optimizationPig Latin
Data model : atom, tuple, bag (nesting) Operators : LOAD, STORE, JOIN, GROUP BY,
COGROUP, FOREACH, SPLIT, aggr. functions • Ex.Equijoin on REL A (column 0) and REL B (column 1)
JOIN A by $0, B by $1;Extensibility support via UDFs
Dataflow is compiled into a workflow of MapReduce jobs
6/30
#MR cycles = #Joins = 6(I/O & communication costs) * 6Loads have I/Os as wellExpensive!!! (SPLIT Operator)*
SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}
Example Pig Query PlanA =
LOADInput.rdf
FILTER (homepage)
B = LOAD
Input.rdf
FILTER(label)
T1 = JOIN A ON Sub,
B ON Sub;
C = LOAD
Input.rdfFILTER(country
)
T2 = JOIN C ON Sub,
T1 ON Sub;
STORE
T3 = JOIN H ON Sub,
T7 ON Sub;
…….
H= LOAD
Input.rdfFILTER(product
)
MR1
MR2
MR6
7/30
SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}
Join between
Stars
Possible Optimizations : m-way Join
JOIN SJ2
Disk
reduce
map
JOIN J1
Disk
reduce
map
HDFS
Input
JOIN SJ1
Disk
reduce
map
MR1
MR2
MR3
SJ1
SJ2
J
1 #MR cycles reduced from
6 to 3
8/30
BUT ?Still expensive! I MR cycle/star-joinMany pattern matching queries involve
multiple star join subpatterns 50% of BSBM* benchmark queries have two
or more star patterns
Our proposal: Coalesce the computation of ALL star-join
subpatterns into a single MR cycle
How? Don’t think of them as a set of joins! Think of it as a GROUP BY operation
9/30
Sub Prop Obj
&V1 type Vendor
&V1 label Vendor1
&V1 country US
&V1 homepage www.ven...
&Offer1 type Offer
&Offer1 vendor &V1
&Offer1 product &P1
&Offer1 price 108
&Offer1 delDays 2
&Offer1 validToDate 01/01/2011
&Offer1validFromDa
te08/01/2011
&Rev1 type Review
&Rev1 reviewFor &P1
&Rev1 rating1 9
&Rev1 reviewer &R1
WHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry .
?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}
GROUPBY Subject
1 MapReduce Cycle!!!
10/30
What are we proposing?
A new data model (TripleGroup) and algebra (Nested TripleGroup Algebra - NTGA) for more efficient graph pattern matching on MapReduce platforms
11/30
Outline IntroductionBackground
MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig
Approach TripleGroup data model and Nested TripleGroup
Algebra (NTGA) Comparing NTGA based plans and Pig Latin
plans for graph pattern matching queries
EvaluationRelated WorkConclusion and Future Work
12/30
Our Approach : RAPID+ Goal : Minimize I/O and communication costs by reducing MR cycles
Reinterpret and refactor operations into a more suitable (coalesced) set of operators – NTGA
Foundation: Re-interpret multiple star-joins as a grouping operation leads to “groups of Triples” (TripleGroups) instead of n-tuples
different structure BUT “content equivalent”
NTGA- algebra on TripleGroups
13/30
NTGA – Data Model Data model based on nested
TripleGroupsMore naturally capture graphs
TripleGroup – groups of triples sharing Subject / Object component Can be nested at the Object
component
{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2)}
{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.vendors….)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
14/30
NTGA Operators…(1)
TG_Unnest – unnest a nested TripleGroup{(&Offer1, price, 108), (&Offer1, vendor,{(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
TG_Unnest
TG_Flatten – generate equivalent n-tuple(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven...)}
(&V1, label, vendor1, &V1, country, US, &V1, homepage, www.ven...)
TG_Flatten
t1 t2 t3
“Content Equivalence”
15/30
NTGA Operators…(2) TG_GroupFilter – retain only
TripleGroups that satisfy the required query sub structure
Structure-based filtering
TG_GroupFilter
{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,
{ (&Offer2, vendor, &V2), (&Offer2, product, &P3), (&Offer2, delDays, 1) } }
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }
(TG, {price, vendor, delDays, product})
TG TG{price, vendor, delDays, product}
Eliminate TripleGroups with
missing triples (edges)
16/30
NTGA Operators…(3) TG_Filter – filter out triples that do not
satisfy the filter condition (FILTER clause) Value-based filtering
TG_Filterprice<200(TG)
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,
{ (&Offer3, vendor, &V2), (&Offer3, product, &P3), (&Offer3, price, 306), (&Offer3, delDays, 1) } }
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }
TG{price, vendor, delDays, product}
Eliminate TripleGroups with triples that do not satisfy filter condition
TG{price, vendor, delDays, product}
17/30
NTGA Operators…(4) TG_Join – join between different structure
TripleGroups based on join triple patterns
TG_Join
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }
TG{price, vendor, delDays, product}
(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, ww.ven...)}
TG{label, country, homepage}
{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
?o vendor ?v ?v country ?vcountry
18/30
Pattern Matching using NTGA in Pig
Subject
Property Object
&V1 type VENDOR&V1 label Vendor1&V1 country US&V1 homepage www.ven...
&Offer1
type OFFER
&Offer1
vendor &v1
&Offer1
product &p1
&Offer1
price 108
&Offer1
delDays 2
&Offer1
validToDate 01/01/2011
&Offer1
validFromDate
08/01/2011
Subject
Property
Object
&V1 label Vendor1&V1 country US
&V1homepa
gewww.ven..
&Offer1
vendor &v1
&Offer1
product &p1
&Offer1
price 108
&Offer1
delDays 2
{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } }
{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
LoadFilter
StarGroupFilter
RDFJoin
(load +TG_Filter)
(TG_GroupBy+TG_GroupFilter)
(TG_Join)
Mapping to Pig Latin/Relational Algebra
20/30
RDFMap: Efficient Data Representation
Compact representation of intermediate results during TripleGroup based processing
Efficient look-up of triples matching a given Property type via property-based indexing scheme
Ability to represent structure-label information for groups of triples.
21/30
Outline IntroductionBackground
MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig
Approach TripleGroup data model and Nested TripleGroup
Algebra (NTGA) Comparing NTGA based plans and Pig Latin
plans for graph pattern matching queries
EvaluationRelated WorkConclusion and Future Work
22/30
Evaluation Setup: 5-node to 25-node Hadoop
clusters on NCSU’s Virtual Computing Lab*
Dataset: Synthetic benchmark dataset generated using BSBM** tool(max. 40GB data – approx. 175 million triples)
Evaluation of Pig (Pig_opt) vs. RAPID+ Task 1 – Scalability with size of RDF graphs Task 2 – Scalability with denser star patterns Task 3 – Scalability with increasing cluster
sizes
*https://vcl.ncsu.edu **http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/
23/30
Experimental Results…(1)Cost Analysis across Increasing size of RDF graphs (5-node)
Key Observations:Benefit of TripleGroup based processing seen across data sizes – up
to 60% in some cases (RAPID+ << Pig_opt < Pig) Pig approaches did not complete for large data size
24/30
Query
#Triple
Patterns
#Edges in
Stars
%gain
Q1 3 1:2 56.8
Q2 4 2:2 46.7
Q3 5 2:3 47.8
Q4 6 3:3 51.6
Q5 7 3:4 57.4
Q6 8 4:4 58.4
Q7 9 5:4 58.6
Q8 10 6:4 57.3
Q9* 6 2:4 65.4
Q10* 10 2:4:4 61.5
Experimental Results…(2)Cost Analysis across Increasing Star Density
%gain of RAPID+ over Pig (10-node / 32GB)
(5-node / 20GB)
Key Observations: RAPID+ maintains a consistent %gain
of 50% across the varying density Costs savings by eliminating redundant
Subject values and join triples
25/30
Experimental Results…(3)Cost Analysis across Increasing Cluster Sizes
Query pattern with three star-joins and two chain-joins (32GB)
Key Observations: RAPID+ has 56% gain for 10-
node cluster over Pig approaches
Pig approaches catch up with increasing cluster size Increasing nodes decrease
probability of disk spills with the SPLIT approach
RAPID+ still maintains 45% gain across the experiments
26/30
And some Updates… Additional evaluation –
Up to 65% performance gain on another synthetic benchmark dataset* for three/two star-join queries
Experiments extended to 1 billion 3-ary triples (43GB) – 31% (10-node) to 41% (30-node) performance gain
RAPID+ now includes a SPARQL interface
In Future: Cost-based optimizations to select Pig vs. NTGA execution plans
Join us for a demo of RAPID+@VLDB2011**
**Kim, H., Ravindra, P., Anyanwu, K : From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. To appear In: Proc. International Conference on Very Large Data Bases. (VLDB 2011)
*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)
27/30
Related WorkMapReduce-based Processing
Indexing
Partitioning Schemes Rule-based
Optimizations[Newman08] * [Hunter08]*[Afrati10]
[Husain10]*, Hadoop++
[Dittrich10],HadoopDB[Abadi09]
Reasoning[Urbani07] *
High-levelDataflow
LanguagesPig Latin
[Olston08], [HiveQL][JAQL]
Other Extensions
Map-Reduce-Merge
[Yang07]
28/30
Conclusion TripleGroup based processing for
evaluating pattern matching queries on MapReduce platformsNTGA Operators re-factored to
minimize #MR cycles minimize costs Reduce costs of repeated data
handling via operator coalescingEfficient data representation
(RDFMap)
29/30
References[Dean04] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.
ACM 51 (2008) 107–113[Olston08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign
language for data processing. In: Proc. International Conference on Management of data. (2008)[Abadi09] Abouzied, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D.J., Silberschatz, A.: Hadoopdb in
action: building real world applications. In: Proc. International Conference on Management of data. (2010)
[Sridhar09] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009
[Yu08] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008
[Newman08] Newman, A., Li, Y.F., Hunter, J.: Scalable semantics: The silver lining of cloud computing. In: eScience. IEEE International Conference on. (2008)
[Hunter08] Newman, A., Hunter, J., Li, Y., Bouton, C., Davis, M.: A scale-out rdf molecule store for distributed processing of biomedical data. In: Semantic Web for Health Care and Life Sciences Workshop. (2008)
[Urbani07] Urbani, J., Kotoulas, S., Oren, E., Harmelen, F.: Scalable distributed reasoning using mapreduce. In: Proc. International Semantic Web Conference. (2009)
[Abadi07] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007
[Dittrich10] Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). VLDB 2010/PVLDB
[Yang07] Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D.S.: Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD 2007
[Afrati10] Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proc. International Conference on Extending Database Technology. (2010)
[Husain10] Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: Cloud Computing (CLOUD), IEEE International Conference on. (2010)
[HiveQL] http://hadoop.apache.org/hive/ [JAQL], http://code.google.com/p/jaql
30/30
Thank You!
EnvironmentNode Specifications
Single / duo core Intel X86 2.33 GHz processor speed 4G memory Red Hat Linux
Pig 0.5.0 Hadoop 0.20
Block size 256MB
Benchmark Data*Log files of HTTP server trafficColumn-delimited text file
Rankings:pageRank | PageURL | avgDuration
UserVisits:sourceIPAddr | destinationURL | visitDate | adRevenue | UserAgent | cCode | lCode | sKeyword | avgTimeOnSite
*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)
Scripts (Q1)A = load '/data/' using PigStorage(' ');A1 = filter A by $1 eq 'pageRank' or $1 eq 'pageURL' or $1 eq
'destURL' or $1 eq 'srcIP' or $1 eq 'adRevenue' or ($1 eq 'type' and ($2 eq 'Ranking' or $2 eq 'UserVisits'));
B = group A1 by $0 PARALLEL 5;C = foreach B generate flatten(ReassembleRDF($1,'pageURL|
destURL','1'));D = group C by $0 PARALLEL 5;E = foreach D generate
flatten(ReassembleRDF($1,'srcIP|','2')) as (srcIP:chararray, vals:bytearray);
store E into '/q1_app1';
Scripts (Q1)A1 = load '/data/' using PigStorage(' ');split A1 into pageRank IF $1 eq 'pageRank',srcIP IF $1 eq 'srcIP‘, pageURL IF $1 eq 'pageURL',destURL IF $1 eq 'destURL‘, adRevenue IF $1 eq 'adRevenue',typeRanking IF $1 eq 'type' and $2 eq 'Ranking',typeUV IF $1 eq 'type' and $2 eq 'UserVisits';Ranking = join pageURL by $0, pageRank by $0, typeRanking
by $0 PARALLEL 5;UserVisits = join srcIP by $0, destURL by $0, adRevenue by
$0, typeUV by $0 PARALLEL 5;C1 = join Ranking by $2, UserVisits by $5 PARALLEL 5;D1 = foreach C1 generate $11, $17, $5;store D1 into '/q1_app2';
Experiment ResultsPercentage Performance Gain = (exec time 1) – (exec time 2) (exec time 1)
Possible Optimizations (2) Coalesce join operations into as few
MR cycles as possibleCompute star patterns via m-way
JOIN Star-join using m-way JOIN = 1 MR
cycle Reduced #MR cycles Reduced I/O
+ communication costs
Structured Data Processing in Pig
srcIP destURL
visitDate
adRevenue
…
158.112.27.3
url1 1979/12/12
339.08142
….
158.112.27.3
url5 1979/12/15
180.334 ….
150.121.18.6
url1 1979/12/28
550.7889 ….
… … … … …pageRank
pageURL
avgDur
11 url1 96
23 url2 3
18 url3 87
… … …
UserVisits
Ranking
Query: Retrieve the pageRank and adRevenue of pages visited by particular users between “1979/12/01” and “1979/12/30”
LOADUserVisits
LOADRanking
FILTER(visitDate)
JOIN UserVisits ON destURL,
Ranking ON pageURL;
STORE
Package tuples
JOINUserVisits ON
destURL,Ranking ON pageURL;
JOIN: Pig Latin MapReduce UserVisits Ranking
Annotate based on join key
map
reduce
Reducer 1 Reducer 2158.112.27.3
url1
url1
11 …
srcIP destURL
visitDate adRev
…
158.112.27.3
url1 1979/12/12
339.081
…
158.112.27.3
url2 1979/12/15
180.334
…
150.121.18.6
url1 1979/12/28
550.78
…
url2url1
pageRank
pageURL
avgDur
11 url1 96
23 url2 3
url1url2
url1
url1
150.121.18.6
url1
url1
11 …
url2
158.112.27.3
url2
url2
3 …
… srcIP destURL
pageURL
pageRank
…
… 158.112.27.3
url1 url1 339.081 …
… 150.121.18.6
url1 url1 550.78 …
… 158.112.27.3
url2 url2 180.334 …
Subject Prop Object(&UV1 srcIP 158.112.27.3)
RDF Data Model(Resource Description Framework)
Statements (triples) Graph representationSub Prop Obj
&R1 type Ranking
&R1 pageRank 11
&R1 pageURL Url1
&R1 avgDuration 97
&UV1
type UserVisits
&UV1
srcIP 158.112.27.3
&UV1
destURL url1
&UV1
adRevenue 339.08142
&UV1
visitDate 1979/12/12
&UV1
userAgent SCOPE
&UV1
cCode VNM
&UV1
iCode VNM-KH
&UV1
sKeyword comets
&UV1
avgTime 3
Ranking
UserVisits
Example SPARQL Query
Sub Prop Obj
&V1 type Vendor&V1 label Vendor1&V1 country US&V1 homepage www.ven...
&Offer1
type Offer
&Offer1
vendor &V1
&Offer1
product &P1
&Offer1
price 108
&Offer1
delDays 2
&Offer1
validToDate 01/01/2011
&Offer1
validFromDate
08/01/2011
&Rev1 type Review&Rev1 reviewFor &P1&Rev1 rating1 9
&Rev1 reviewer &R1
Data: Description of Vendors, their product Offers, and Reviews of products (BSBM* dataset)
Query: Retrieve the details of US-based Vendors
SELECT ?vlabel ?hpageWHERE {?v type Vendor . ?v country ?vcountry . ?v label ?vlabel . ?v homepage ?hpage .}FILTER (?vcountry = “US”);
*http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
Top Related