Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

Mohammad Farhan Husain, Latifur Khan, Murat Kantar-cioglu and Bhavani ThuraisinghamDepartment of Computer Science The University of Texas at DallasIEEE 2010 Cloud Computing

May 11, 2011Taikyoung Kim

SNU IDB Lab.

Outline Introduction Proposed Architecture MapReduce Framework Results Conclusions and Future Works

Introduction With the explosion of semantic web technologies, the

need to store and retrieve large amounts of data is common– The most prominent standards are RDF and SPARQL

Current frameworks do not scale for large RDF graphs– E.g. Jena, Sesame, BigOWLIM– Designed for a single machine scenario– Only 10 million triples can be processed in a Jena in-

memory(2GB) model

Introduction A distributed system can be built to overcome the

scalability and performance problems of current Se-mantic Web frameworks

However, there is no distributed repository for storing and managing RDF data– Distributed database system or relational databases are

available Performance and Scalability issues

– Possible to construct a distributed system from scratch

A better way is to use Cloud Computing framework or generic distributed storage system – Just tailor it to meet the needs of semantic web data

Introduction Hadoop is an emerging Cloud Computing tools

– Open source– High fault tolerance– Great reliability– MapReduce programming model

We introduce a schema to store RDF data in Hadoop

Our goal is to answer SPARQL queries as efficiently as possible using summary statistics about the data– Choose the best plan based on a cost model– The plan determines the number of jobs and also their se-

quence and inputs

Introduction Contributions

1. Design a storage scheme to store RDF data in HDFS*

2. Device an algorithm which determines the best query plan for SPARQL query

3. Build a cost model for query processing plan4. Demonstrate that our approach performs better than Jena

HDFS – Hadoop Distributed File System

Proposed Architecture Data Generation and Storage

– Use the LUBM dataset (benchmark datasets) Generate RDF/XML serialization format

– Convert the data to N-Triples to store data One RDF triple in one line of a file

File Organization– Do not store the data in a file since

A file is the smallest unit of input to a MapReduce job A file is always read from the disk (No cache)

– Divide the data into multiple smaller files

<owl:Class rdf:ID="AdministrativeStaff"> <rdfs:label>administrative staff worker</rdfs:label> <rdfs:subClassOf rdf:resource="#Employee" /> </owl:Class>

AdministrativeStaff rdfs:subClassOf Employee

Proposed Architecture Predicate Split (PS)

– Divide the data according to the predicates Can cut down the search space if the query has no variable predicate

– Name the files with predicates e.g) predicate p1:pred go into a file named p1-pred

“rdf-type” file “ub-advisor” file “ub-takesCourse” fileJohn | StudentJames | Professor

John | James

John | DB

Proposed Architecture Predicate Object Split (POS)

– Split Using Explicit Type Information of Object Devide rdf-type file into as many files as the number of distinct objects the rdf:type predicate has

– Split Using Implicit Type Information of Object Keep intact all literal objects URI objects move into their respective file named as predicate_type

– The type information of a URI object can be retrieved from the rdf-type_* files

“rdf-type” file rdf-type_Student “ub-advisor” file

rdf-type_Professor

JamesURI1

ub-advisor_Projessor

JohnJohn | StudentJames | ProfessorURI1 | Professor

John | URI1

Proposed Architecture Space benefits

Special case

– Search all the files having leaf types of the subtree rooted at that type node

E.g. type-FullProfessor, type-AssociateProfessor, etc.

MapReduce Framework Challenges

1. Determine the number of jobs needed to answer a query2. Minimize the size of intermediate files3. Determine number of reducers

Use Map phase for selection and Reduce phase for join

Often require more than one job– No inter-process communication– Each job may depend on the output of the previous job

MapReduce Framework

Input Files Selection Select all files when

– P: variable & O: variable & O: has no type info.– O: concrete

Select all predicate files having object of that type when– P: variable & O: has type info.

Select all files for the predicate when – P: concrete & O: variable & O: has no type info.

Select the predicate file having objects of that type when– Query has type information of the object

Select all subclasses which are leaves in the subtree rooted at the type node when– Type associated with a predicate is not a leaf in the ontology

P: predicate, O: object

MapReduce Framework

Cost Estimation for Query Processing

Definition 1 (Conflicting Joins, CJ)– A pair of joins on different variables sharing a triple pattern– JoinA(Line1&Line3), JoinB(Line3&Line4) CJ (Line3)

Definition 2 (NonConflicting Joins, NCJ)– A pair of joins not sharing any triple pattern– A pair of joins sharing a triple pattern and the joins are on

same variable– Join1(Line1&Line3), Join2(Line2&Line4) NCJ

Line 1Line 2Line 3Line 4

MapReduce Framework

Map Input phase (MI)– Read the triple patterns from the selected input file– Cost equals to the total number of triple patterns in each selected

Map Output phase (MO)– No bound variable case (e.g. [?X ub:worksFor ?Y])

MO cost = MI cost (All of the triple patterns are transformed into key-value pairs)

– Bound variable case (e.g. [?Y ub:subOrganizationOf <http://www.U0.edu>]) Use summary statistics for selectivity The cost is the result of bound component selectivity estimation

MI :cost of Map Input phaseMO :cost of Map Output phaseRI :cost of Reduce Input phaseRO :cost of Reduce Output phase

MapReduce Framework

Reduce Input Phase (RI)– Read Map output via HTTP and then sort it by key values– RI cost = MO cost

Reduce Output Phase (RO)– Deal with performing the joins– Use the join triple pattern selectivity summary statistics (No

longer used)– For the intermediate jobs, take an upper bound on the Re-

duce Output

MapReduce Framework

Query Plan Generation Need to determine the best query plan

– Possible plans to answer a query has different performance (time & space)

Plan Generation– Greedy approach

Simple Generates a plan very quickly No guarantee for best plan

– Exhaustively search approach (ours) Generate all possible plans

MapReduce Framework

Query Plan Generation Plan Generation by Graph Coloring

– Generate all combinations– For a job, select a subset of NCJ

Dynamically determine the number of jobs

– Once the plan is generated, determine the cost using the cost model

Triple Pattern Graph

Line 1Line 2Line 3Line 4

job1 job2

Join Graph

Results

Comparison with Other Frameworks

Performance comparison between– Our framework – Jena In-Memory and SDB model– BigOWLIM

System for testing Jena and BigOWLIM– 2.80 GHz quad core processor– 8GB main memory (BigOWLIM needed 7 GB for billion triples

dataset)– 1 TB disk space

Cluster of 10 nodes– Pentium IV 2.80 GHz processor– 4GB main memory– 640 GB disk space

Results

Jena In-Memory Model worked well for small datasets– Became slower as the dataset size grew and eventually run out of

memory

BigOWLIM has significantly higher loading time than ours– It builds indexes and pre-fetches triples to main memory

Hadoop cluster takes less than 1 minute to start up– Excluding loading time, ours is faster when there is no bound ob-

Results

As the size of the dataset grows, the increase in time to answer a query does not grow proportionately

Results

Experiment with Number of Reducers

As increase the number of reducers, queries are an-swered faster

The sizes of map output of query 1, 12 and 13 are so small– Can process with one reducer

Conclusions and Future Works We proposed

– a schema to store RDF data in plain text files– An algorithm to determine the best processing plan to an-

swer a SPARQL query– A cost model to be used by the algorithm

Our system is highly scalable– Query answer time does not increase as much as data size

We will extend the work in the future– Build a cloud service based on the framework– Investigate the skewed distribution of the data– Experiment with heterogeneous cluster

Thank you

Question?

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

Documents

Transcript of Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

Temporal RDF(S) Data Storage and Query with HBase

Distributed RDF Query Answering with Dynamic Data Exchange · Distributed RDF Query Answering with Dynamic Data Exchange Anthony Potter, Boris Motik, Yavor Nenov, and Ian Horrocks

Chapter 3 Querying RDF stores with SPARQL. Why an RDF Query Language? Why not use an XML query language? XML at a lower level of abstraction than RDF.

Distributed Query Processing for Federated RDF Data Management · Distributed Query Processing for Federated RDF Data Management vom Promotionsausschuss des Fachbereichs 4: Informatik

Indexing and Query Processing in RDF Quad-Stores

QSMat: Query-Based Materialization for E cient RDF Stream ...klusch/i2s/Paper-4-QSMat.pdf · QSMat: Query-Based Materialization for E cient RDF Stream Processing Christian Mathieu1,

SPARQL- A query language for RDF(s)

Taming Subgraph Isomorphism for RDF Query Processing … · · 2015-06-08Taming Subgraph Isomorphism for RDF Query Processing Jinha Kim y# ... The SPARQL query language is a standard

RDF On the Go: An RDF Storage and Query Processor for ...iswc2010.semanticweb.org/pdf/503.pdf · SPARQL query processor for mobile devices. ... An RDF Storage and Query Processor

Containment and Minimization of RDF/S Query Patterns · Containment and Minimization of RDF/S Query Patterns Giorgos Serfiotis, Ioanna Koffina ... at the schema level ... relational

Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Graphs stored in NoSQL Databases

Chapter 3 Querying RDF stores with SPARQL. TL;DR We will want to query large RDF datasets, e.g. LOD SPARQL is the SQL of RDF SPARQL is a language to query.

Path Query Processing on Very Large RDF Graphs & Learning ... · Pathfinding Path Query Processing on Very Large RDF Graphs & Learning Relations by Pathfinding 09.01.2012 | Informatik

SPARQL RDF Query Language - Jarrar › ... › WebData › Jarrar.LectureNotes.SPARQL.pdf · 2019-03-12 · Jarrar © 2019 1 SPARQL (RDF Query Language) Mustafa Jarrar: Lecture Notes

Introduction to SPARQL - home.mit.bme.huhome.mit.bme.hu/~strausz/KomplexMIalkalmazások/Előadások/2... · •Query languages for RDF and RDFS •SPARQL: A Query Language for RDF

OpenHPI 3.2 - How to Query RDF(S)? - SPARQL(2)

SPARQL Intro: A query language for RDF

SPARQL and RDF query optimization

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

RDF Query Languages