Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

51
Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce Dr. Mohammad Farhan Husain (Amazan) Dr. Bhavani Thuraisingham Dr. Latifur Khan Department of Computer Science University of Texas at Dallas

description

Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce. Dr. Mohammad Farhan Husain ( Amazan ) Dr. Bhavani Thuraisingham Dr. Latifur Khan Department of Computer Science University of Texas at Dallas. Outline. - PowerPoint PPT Presentation

Transcript of Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Page 1: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Dr. Mohammad Farhan Husain (Amazan)Dr. Bhavani ThuraisinghamDr. Latifur Khan

Department of Computer ScienceUniversity of Texas at Dallas

Page 2: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Outline

•Semantic Web Technologies & Cloud Computing Frameworks

•Goal & Motivation•Current Approaches•System Architecture & Storage Schema•SPARQL Query by MapReduce•Query Plan Generation•Experiment•Future Works

Page 3: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Semantic Web Technologies

•Data in machine understandable format•Infer new knowledge•Standards

▫Data representation – RDF Triples

Example:

▫Ontology – OWL, DAML▫Query language - SPARQL

Subject Predicate Object

http://test.com/s1

foaf:name “John Smith”

Page 4: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Cloud Computing Frameworks

•Proprietary▫Amazon S3▫Amazon EC2▫Force.com

•Open source tool▫Hadoop – Apache’s open source

implementation of Google’s proprietary GFS file system MapReduce – functional programming

paradigm using key-value pairs

Page 5: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Outline

•Semantic Web Technologies & Cloud Computing Frameworks

•Goal & Motivation•Current Approaches•System Architecture & Storage Schema•SPARQL Query by MapReduce•Query Plan Generation•Experiment•Future Works

Page 6: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Goal

•To build efficient storage using Hadoop for large amount of data (e.g. billion triples)

•To build an efficient query mechanism•Publish as open source project

▫http://code.google.com/p/hadooprdf/•Integrate with Jena as a Jena Model

Page 7: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Motivation• Current Semantic Web frameworks do not scale to

large number of triples, e.g.▫ Jena In-Memory, Jena RDB, Jena SDB, Jena TDB▫ AllegroGraph▫ Virtuoso Universal Server▫ BigOWLIM▫ RDF-3X▫ Sesame

• There is a lack of distributed framework and persistent storage

• Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability

Page 8: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Outline

•Semantic Web Technologies & Cloud Computing Frameworks

•Goal & Motivation•Current Approaches•System Architecture & Storage Schema•SPARQL Query by MapReduce•Query Plan Generation•Experiment•Future Works

Page 9: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Current Approaches

•State-of-the-art approach▫Store data in HDFS and process query

outside of Hadoop Done in BIOMANTA1 project (details of

querying could not be found)•Our approach

▫Store RDF data in HDFS and query through MapReduce programming

1. http://biomanta.org/

Page 10: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Contributions• Scalable, fault-tolerant framework for large

RDF graphs• Schema optimized for MapReduce• Query rewriting algorithm leveraging schema

and ontology• Query plan generation algorithms

▫Heuristics based greedy algorithms▫Exhaustive search algorithm based on graph

coloring• Query plan generation algorithm for queries

with OPTIONAL blocks

Page 11: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Outline

•Semantic Web Technologies & Cloud Computing Frameworks

•Goal & Motivation•Current Approaches•System Architecture & Storage Schema•SPARQL Query by MapReduce•Query Plan Generation•Experiment•Future Works

Page 12: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

System ArchitectureLUBM Data

Generator

Preprocessor

N-Triples Converter

Predicate Based Splitter

Object Type Based Splitter

Hadoop Distributed File System / Hadoop

Cluster

MapReduce Framework

Query Rewriter

Query Plan Generator

Plan Executor

RDF/XML

Preprocessed Data

2. Jobs

3. Answer

3. Answer

1. Query

Page 13: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Storage Schema

•Data in N-Triples•Using namespaces

▫Example:http://utdallas.edu/res1 utd:resource1

•Predicate based Splits (PS)▫Split data according to Predicates

•Predicate Object based Splits (POS)▫Split further according to the rdf:type of

Objects

Page 14: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

ExampleD0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0

P

File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University

File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0

PS

File: rdf_type_GraduateStudentD0U0:GraduateStudent20

File: rdf_type_UniversityD0U0:University0

File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0

POS

Page 15: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Space Gain•Example

Steps Number of Files Size (GB) Space Gain

N-Triples 20020 24 --

Predicate Split (PS) 17 7.1 70.42%

Predicate Object Split (POS)

41 6.6 72.5%

Data size at various steps for LUBM1000

Page 16: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Outline•Semantic Web Technologies &

Cloud Computing Frameworks•Goal & Motivation•Current Approaches•System Architecture & Storage Schema•SPARQL Query by MapReduce•Query Plan Generation•Experiment•Future Works

Page 17: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

SPARQL Query•SPARQL – SPARQL Protocol And RDF

Query Language•Example

SELECT ?x ?y WHERE{

?z foaf:name ?x ?z foaf:age ?y

} Query

Data

Result

Page 18: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Schema Based SPARQL Query Rewriting• Example query

SELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}

• Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}

Page 19: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Schema Based SPARQL Query Rewriting•Based on rdfs:range defined in ontology•Example:

ub:subOrganizationOf rdfs:range ub:University•Rewritten query:

SELECT ?p WHERE{ ?p ub:worksFor_Department ?x ?x ub:subOrganizationOf_Universityhttp://University0.edu}

Page 20: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Inside Hadoop MapReduce JobsubOrganizationOf_Univ

ersityDepartment1

http://University0.eduDepartment2

http://University1.edu

worksFor_Department

Professor1 Deaprtment1Professor2 Department2

MapMap MapMap

Reduce

Reduce

OutputWF#Professor1

Department1 SO#http://University0.edu

Dep

artm

ent1

WF#

Prof

esso

r1

Dep

artm

ent2

WF#

Prof

esso

r2

FilteringObject ==

http://University0.edu

INPUT

MAP

SHUFFLE&SORT

REDUCE

OUTPUT

Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2

Page 21: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Outline•Semantic Web Technologies &

Cloud Computing Frameworks•Goal & Motivation•Current Approaches•System Architecture & Storage Schema•SPARQL Query by MapReduce•Query Plan Generation•Experiment•Future Works

Page 22: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Query Plan Generation•Challenge

▫One Hadoop job may not be sufficient to answer a query In a single Hadoop job, a single triple pattern

cannot take part in joins on more than one variable simultaneously

Page 23: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Example• Example query:

SELECT ?X, ?Y, ?Z WHERE { ?X pred1 obj1 subj2 ?Z obj2 subj3 ?X ?Z ?Y pred4 obj4 ?Y pred5 ?X }

• Simplified view:1. X2. Z3. XZ4. Y5. XY

Page 24: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Join Graph &Hadoop Jobs

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Valid Job 1

2

3

1

5

4

Z

X

X

X

Y

Valid Job 2

2

3

1

5

4

Z

X

X

X

Y

Invalid Job

Page 25: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Possible Query Plans•A. job1: (x, xz, xy)=yz, job2: (yz, y) = z,

job3: (z, z) = done

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Job 1

2

1,3,5

4

Z

Y

Job 2

2

Job 3

1,3,4,5

Z1,2,3,4,

5

Result

Page 26: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Possible Query Plans B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Job 1

2,3

1

4,5

X

X

X

Job 2

1,2,3,4,

5

Result

Page 27: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Query Plan Generation•Algorithm for query plan generation

▫Query plan is a sequence of Hadoop jobs which answers the query

•Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously

Page 28: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Experiment

•Dataset and queries•Cluster description•Comparison with Jena In-Memory, SDB

and BigOWLIM frameworks•Comparison with RDF-3X•Experiments with number of Reducers•Algorithm runtimes: Greedy vs.

Exhaustive•Some query results

Page 29: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Dataset And Queries

•LUBM▫Dataset generator▫14 benchmark queries▫Generates data of

some imaginary universities

▫Used for query execution performance comparison by many researches

Page 30: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Our Clusters•10 node cluster in SAIAL

lab▫4 GB main memory▫Intel Pentium IV 3.0 GHz

processor▫640 GB hard drive

•OpenCirrus HP labs test bed

Page 31: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Comparison: LUBM Query 2

Page 32: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Comparison: LUBM Query 9

Page 33: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Comparison: LUBM Query 12

Page 34: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Comparison with RDF-3X

Page 35: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Experiment with Number of Reducers

Page 36: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Greedy vs. Exhaustive Plan Generation

Page 37: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Some Query ResultsS

eco

nd

s

Million Triples

Page 38: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Access Control: Motivation

•It’s important to keep the data safe from unwanted access.

•Encryption can be used, but it has no or small semantic value.

•By issuing and manipulating different levels of access control, the agent could access the data intended for him or make inferences.

Page 39: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Access Control in Our Architecture

MapReduce Framework

Query Rewriter

Query Plan Generator

Plan Executor

Acce

ss C

on

trol

Access control module is linked to all the components of MapReduce Framework

Page 40: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Access Control Terminology

•Access Tokens (AT): Denoted by integer numbers allow agents to access security-relevant data.

•Access Token Tuples (ATT): Have the form <AccessToken, Element, ElementType, ElementName> where Element can be Subject, Object, or Predicate, and ElementType can be described as URI , DataType, Literal , Model (Subject), or BlankNode.

Page 41: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Six Access Control Levels• Predicate Data Access: Defined for a particular

predicate. An agent can access the predicate file. For example: An agent possessing ATT <1, Predicate, isPaid, _> can access the entire predicate file isPaid.

• Predicate and Subject Data Access: More restrictive than the previous one. Combining one of these Subject ATT’s with a Predicate data access ATT having the same AT grants the agent access to a specific subject of a specific predicate. For example, having ATT’s <1, Predicate, isPaid, _> and <1, Subject, URI , MichaelScott> permits an agent with AT 1 to access a subject with URI MichaelScott of predicate isPaid.

Page 42: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Access Control Levels (Cont.)

•Predicate and Object: This access level permits a principal to extract the names of subjects satisfying a particular predicate and object.

•Subject Access: One of the less restrictive access control levels. The subject can ne a URI , DataType, or BlankNode.

•Object Access: The object can be a URI , DataType, Literal , or BlankNode.

Page 43: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Access Control Levels (Cont.)

•Subject Model Level Access: This permits an agent to read all necessary predicate files to obtain all objects of a given subject. The ones which are URI objects obtained from the last step are treated as subjects to extract their respective predicates and objects. This iterative process continues until all objects finally become blank nodes or literals. Agents may generate models on a given subject.

Page 44: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Access Token Assignment

•Each agent contains an Access Token list (AT-list) which contains 0 or more ATs assigned to the agents along with their issuing timestamps.

•These timestamps are used to resolve conflicts (explained later).

•The set of triples accessible by an agent is the union of the result sets of the AT’s in the agent’s AT-list.

Page 45: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Conflict•A conflict arises when the following three

conditions occur:▫An agent possesses two AT’s 1 and 2,▫the result set of AT 2 is a proper subset of

AT 1, and▫the timestamp of AT 1 is earlier than the

timestamp of AT 2•Later, more specific AT supersedes the

former, so AT 1 is discarded from the AT-list to resolve the conflict.

Page 46: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Conflict Type

•Subset Conflict: It occurs when AT 2 (later issued) is a conjunction of ATT’s that refine AT 1. For example, AT 1 is defined by <1, Subject, URI, Sam> and AT 2 is defined by <2, Subject, URI, Sam> and <2, Predicate, HasAccounts, _> ATT’s. If AT 2 is issued to the possessor of AT 1 at a later time, then a conflict will occur and AT 1 will be discarded from the agent’s AT-list.

Page 47: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Conflict Type

•Subtype conflict: Subtype conflicts occur when the ATT’s in AT 2 involve data types that are subtypes of those in AT 1. The data types can be those of subjects, objects or both.

Page 48: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Conflict Resolution Algorithm

Page 49: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Experiment

•Policy Enforcement stages▫Embedded enforcement▫Enforcement as post-processing

Page 50: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Results

Scenario 1: “takesCourse”A list of sensitive courses cannot

be viewed by a normal user for any student

Page 51: Data Intensive Query Processing for Semantic Web Data Using Hadoop and MapReduce

Results

Scenario 2: “displayTeachers”A normal user is allowed to view information

about the lecturers only