Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB,...

Linked and graph dataintroduction & lecture 1: what’s the big deal about big data?

George Fletcher

Eindhoven University of TechnologyThe Netherlands

Short course in the International Doctoral School in ICTUniversity of Modena and Reggio Emilia, Italy

22 October 2013

introduction

Goals of this course

Participants will

1. acquire an overview of recent research developments in themanagement of “big” data on the web (i.e., linked (open) andgraph data);

2. gain insight into challenges posed for querying in this contextand current solutions in indexing and query processing forovercoming these challenges; and,

3. develop a perspective on interesting research directions in webdata management.

Syllabus of the course: Tuesday, 22 October

Introduction and overview of linked and graph data, and discussionof related research in the big data field.

lecture 1. what’s the big deal about big data? (10:00-12:00)

I big data ∪ web data

I big data ∩ web data

I our focus: modeling and querying linked and graph data

lecture 2. models for web data (14:00-16:00)

I RDF and RDFS data model

I Updates on RDFS graphs

Syllabus of the course: Wednesday, 23 October

Foundations of query processing over graph and linked data.

lecture 3. querying graph data (11:00-13:00)

I graphs and the macro-structure of the web

I graph database models

I query languages for graphs

lecture 4. querying web data (14:30-16:30)

I SPARQL 1.1

I RDF & SPARQL exercises

Syllabus of the course: Thursday, 24 October

Engineering solutions for efficient query processing over graph andlinked data.

lecture 5. querying web data, cont. (11:00-13:00)

I models for linked data

I querying linked data

lecture 6. indexing web data and query processing (14:30-16:30)

I value-based indexing of web data

I structural indexing of web data

Syllabus of the course: Tuesday, 29 October

Advanced topics and concluding discussion; final exam.

lecture 7. uncertain graphs & wrap up (11:00-13:00)

I uncertain graph data & query model

I annotated graphs

I recap of lectures & open discussion

final session. written exam (14:30-16:30)

Syllabus of the course

course contact

I email: g.h.l.“my family name”@tue.nl

I homepage: google me (george fletcher eindhoven)

I course homepage: found on my “teaching” pagehttp://wwwis.win.tue.nl/~gfletcher/modena/

course materials

I Online links to all background and lecture material will beprovided as the course progresses.

I There are no specific prerequisites for the course participants,other than a general Computer Science background. All othernecessary details and concepts will be introduced as needed ineach lecture.

Who are we?

memy background

I Assistant professor in the Web Engineering group at TUEindhoven

I PhD in Computer Science from Indiana University,Bloomington, USA

I Bachelor’s degrees in Mathematics and Cognitive Sciencefrom the Univ. of North Florida, USA

my research

I in general: data management systems

I PhD thesis in query learning for data integration

I current research focus on the theory, engineering, andapplications of query languages for graph and web data

and you?

Who are we?

memy background

I Assistant professor in the Web Engineering group at TUEindhoven

I PhD in Computer Science from Indiana University,Bloomington, USA

I Bachelor’s degrees in Mathematics and Cognitive Sciencefrom the Univ. of North Florida, USA

my research

I in general: data management systems

I PhD thesis in query learning for data integration

I current research focus on the theory, engineering, andapplications of query languages for graph and web data

and you?

Thank you

My warmest thanks to:

I dr. Federica Mandreoli,

I The International Doctoral School in ICT,

I The University of Modena and Reggio Emilia, and

I all of the participants in this course!

Five minute paper

What are the top two or three things you would like to take awayfrom this course?

lecture 1

what’s the big deal about big data?

Overview of this lecture

I What does “big” data mean?

I What does “web” data mean?

I Refresher on first order logic

Big Data

Our ability to generate and capture data only continues to explode

Indeed, we live in a world that is

I increasingly driven by this generation and consumption ofmassive ever-growing data sets (leading to an emerging“second economy”),

I which in turn is driven by both technology and a rapidlyincreasing “datafication” of all aspects of our public andprivate lives, and well, of most everything.

Managing this Big Data is now central to commerce,entertainment, government, education, ...

Big Data: the hype

Funny Dilbert comic strip removed. Seehttp://dilbert.com/strips/comic/2012-07-29/

Big Data: the hype

Of course, the data management community has been intensivelystudying data for decades

I the prestigious Very Large Data Bases (VLDB) conferencehad its first edition in 1975

I “big data” < “very large data”?

(But of course, the symbiosis of technological trends anddatafication does indeed raise many novel and deep challenges)

Big Data: the hype

Of course, the data management community has been intensivelystudying data for decades

I the prestigious Very Large Data Bases (VLDB) conferencehad its first edition in 1975

I “big data” < “very large data”?

(But of course, the symbiosis of technological trends anddatafication does indeed raise many novel and deep challenges)

Big Data: the hype

Recently, a critical mass of awareness of the ubiquity and theenormous potential value of data has been reached, as articulatedin two landmark essays:

I The end of theory (2008)

I The unreasonable effectiveness of data (2009)

This led to an explosion of interest and energy cutting across thescientific and business worlds, e.g.,

I journal of Science, special issue on “Dealing with Data”(2011)

I Harvard Business Review, special issue on “Spotlight on BigData” (2012)

I Big Data: The Management RevolutionI Data Scientist: The Sexiest Job of the 21st Century

I National Research Council (USA), special report on “Frontiersin massive data analysis” (2013)

Big Data: the hype

Recently, a critical mass of awareness of the ubiquity and theenormous potential value of data has been reached, as articulatedin two landmark essays:

I The end of theory (2008)

I The unreasonable effectiveness of data (2009)

This led to an explosion of interest and energy cutting across thescientific and business worlds, e.g.,

I journal of Science, special issue on “Dealing with Data”(2011)

I Harvard Business Review, special issue on “Spotlight on BigData” (2012)

I Big Data: The Management RevolutionI Data Scientist: The Sexiest Job of the 21st Century

I National Research Council (USA), special report on “Frontiersin massive data analysis” (2013)

Big Data: the reality

Big Data is now experiencing some backlash, tempering andnuancing this enthusiasm

I e.g., On being a data skeptic (2013)I “Observation always involves theory.” – Edwin Hubble

There is, of course, still much truth in all of the enthusiasm aboutBig Data

Due to datafication and technology trends, applications now mustface data scale, heterogeneity, and distribution as the norm, ratherthan as exceptions, as in the past

I often captured as the four V’s of “volume”, “variety”,“veracity”, and “velocity”

Furthermore, there is indeed increased potential for extractingsignificant untapped value from data

I e.g., the successes of Google and Facebook are essentiallybuilt on extracting value from data

However, there is increased difficulty in extracting “value” (a fifthV), due to the other four V’s

In data management, embracing and confronting these challengeshas led to an explosion of NoSQL systems, as alternatives to thedominant paradigm of structured data (i.e., “SQL” stores)

Big Data: NoSQL systems

Serious relaxations in “traditional” data management assumptionsinclude

I uncertain dataI structural heterogeneity

I semi-structured, denormalized dataI query paradigms for such loosely structured data

I looser data consistency

I semantic heterogeneity

NoSQL systems typically embrace one or more of these

I Note, though, that many of these relaxations predate NoSQLand even “SQL” systems (e.g., CODASYL and the networkdata model) ...

Serious relaxations in “traditional” data management assumptionsinclude

I uncertain dataI structural heterogeneity

I semi-structured, denormalized dataI query paradigms for such loosely structured data

I looser data consistency

I semantic heterogeneity

NoSQL systems typically embrace one or more of these

I Note, though, that many of these relaxations predate NoSQLand even “SQL” systems (e.g., CODASYL and the networkdata model) ...

Let’s consider data structural heterogeneity, driven by the need forso-called “polyglot persistence”

Much modern data doesn’t cleanly map to a tight relationalstructure

I missing or multi-valued attributesI e.g., not all site visitors have exactly one phone number

I nested/hierarchical structureI ontologies, folksonomiesI JSON, XML

I loose structureI social networksI chem-/bio- networks

Four broad classes of approaches taken by “NoSQL” systems:

1. key-value databases (essentially scalable hash tables)I example systems: BerkelyDB, Tokyo Cabinet

2. document databases (generalization of key-value model toinclude nested structure, e.g., JSON and XML)

I example systems: MongoDB, CouchDB, Couchbase

3. column-family stores (hybrid of key-value and relationalmodel, where rows can have different schemas)

I example systems: Cassandra, Amazon SimpleDB

4. graph databasesI example systems: Neo4j, IBM DB2 RDF GraphStore

Document and graph stores are firmly rooted in applications arisingin web data management ...

Four broad classes of approaches taken by “NoSQL” systems:

1. key-value databases (essentially scalable hash tables)I example systems: BerkelyDB, Tokyo Cabinet

2. document databases (generalization of key-value model toinclude nested structure, e.g., JSON and XML)

I example systems: MongoDB, CouchDB, Couchbase

3. column-family stores (hybrid of key-value and relationalmodel, where rows can have different schemas)

I example systems: Cassandra, Amazon SimpleDB

4. graph databasesI example systems: Neo4j, IBM DB2 RDF GraphStore

Document and graph stores are firmly rooted in applications arisingin web data management ...

Web Data

The explosion of the Web was both a precursor to and accelerantfor the rise of Big Data.

Historically, web data has been modeled as

I hypertext and SGML

I text and HTML

I XML and JSON

Primarily focusing on the metaphor of the “document” ...

But early “web” scientists had even broader visions for humanknowledge ...

Web Data

The explosion of the Web was both a precursor to and accelerantfor the rise of Big Data.

Historically, web data has been modeled as

I hypertext and SGML

I text and HTML

I XML and JSON

Primarily focusing on the metaphor of the “document” ...

But early “web” scientists had even broader visions for humanknowledge ...

Web DataLeJivre etla representation, dumonde

Zintelhqenr.il a-ecitlivpc pouple inindu exterifiur

Le lirlrae (Ju Jrvrn fait nai'treJans iLiuti'f.n inttiUigences la

dummde etlea creatjofis

Moyens divers de communicaUanavee le mon.de

dans la nature(Objels reols)

I) La pTiotanra.ji'ncu b t n '

I'igure 2

J," u n i v p . r s . T i n t e l l i g e n c e . l a s c i e n c e , l e l i v r e

Les choaes

s .laEealitc.le Cosmos

Les .intelligences

i ponsentlcv choses Fraom

La scienceRemnt et coordonne en ses cadres lespinse.tnde iniUas lesmt,elliaences parttculieres

Les LivresTranscnvent et photographisnt l&science,selon lordre divigd des connais&aiwes\-» r.ollnction dc li»ret pDrdCnL \a Bibl iotheqn'

La Bihlio.qraphieJnventone et catalooue lesla reunion de notices Bibl'ogpapliiquGS •o'""'-.

I t rrpertoire B'blpoq^aphique univcrsel

•L'Encyclopedie. ( T c x t e e t l m s g e )Douier Attas MicfoFilm

Concentre, C/SE.TE et noordnnne lezonlenu des hvrr,$

La Classification.Conforms alordre cfuel'intelliaefice,

deccuvre dsnslee clwses. sect a lSiois a.Jordqnnance ifc Ja science rfwJjvre* de

p?ife ri de 1'Encydopsdie

—*—T—7—hH

—d!-—«

L'Umvers, 1'Intelligcnce, \s Science, le iivre

The Mundaneumas conceived by the Belgian author and peace activist Paul Otlet (1868-1944)

Web Data: linked data

This vision has resurfaced recently in the Linked Data initiative,which is essentially a vision for modeling and sharing graph data onthe web, using web standards

Linked data principles:

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful information,using the standards (RDF, SPARQL)

4. Include links to other URIs so that they can discover morethings

e.g., BBC, NYTimes, Wikipedia, Uniprot, PubMed, DBLP, ...

Linked open data: public sector data as linked data

I e.g, dati.gov.it, data.gov.uk, data.gouv.fr, data.overheid.nl,data.gov, data.eu, datameti.go.jp, ...

Our focus

In this course we study the intersection of big data and web data,namely linked data and graph data

Our focus will be on linked and graph data management (and notinformation retrieval, although the distinction is increasingly fuzzy)

I i.e., scaleably implementing variations of first order logic andtheir application

Specifically, we will investigate the challenges of modeling andquerying linked and graph data

(Equally) important issues not touched upon in this course include:data integration, security, privacy, user interfaces, data integrityand consistency, ...

Our focus

In this course we study the intersection of big data and web data,namely linked data and graph data

Our focus will be on linked and graph data management (and notinformation retrieval, although the distinction is increasingly fuzzy)

I i.e., scaleably implementing variations of first order logic andtheir application

Specifically, we will investigate the challenges of modeling andquerying linked and graph data

(Equally) important issues not touched upon in this course include:data integration, security, privacy, user interfaces, data integrityand consistency, ...

a quick refresher on first order logic

First order logic: ∃,∀,¬,∧,∨,R(x), x = y , x = a

Most of the query languages we will consider are fragments orextensions of first order logic (FO)

that is, the “queries” (i.e., computable mappings from databaseinstances to database instances) expressible in these languages areequivalently expressible as formulas in FO

we next review some basics of FO

FO review: data

Fix some universe U of atomic values.

I A relation schema consists of a name R and finite set ofattribute names attributes(R) = {A1, . . . ,Ak}. The arity of Ris arity(R) = k.

I A fact over relation schema R of arity k is a term of the formR(a1, . . . , ak), where a1, . . . , ak ∈ U.

I alternatively, a tuple over R is a function from attributes(R) toU.

I An instance of relation schema R is a finite set of facts/tuplesover R.

I A database schema (aka “signature”) is a finite set D ofrelation schemas.

I An instance of database schema D is a set of relationinstances, one for each R ∈ D.

FO review: data

Fix some universe U of atomic values.

I A relation schema consists of a name R and finite set ofattribute names attributes(R) = {A1, . . . ,Ak}. The arity of Ris arity(R) = k.

I A fact over relation schema R of arity k is a term of the formR(a1, . . . , ak), where a1, . . . , ak ∈ U.

I alternatively, a tuple over R is a function from attributes(R) toU.

I An instance of relation schema R is a finite set of facts/tuplesover R.

I A database schema (aka “signature”) is a finite set D ofrelation schemas.

I An instance of database schema D is a set of relationinstances, one for each R ∈ D.

FO review: syntax

Fix some database schema D.

SELECT A1,...,Ak

FROM S1,...,Sm

WHERE Cond

where S1, . . . ,Sm ∈ D, Cond is a well-formed selection conditionover A, and A1, . . . ,Ak ∈ A, for A =

⋃S∈{S1,...,Sm} attributes(S).

Relational Algebra (RA)

πA1,...,Ak(σCond(S1 × · · · × Sm))

FO review: syntax

Fix some database schema D.

SELECT A1,...,Ak

FROM S1,...,Sm

WHERE Cond

Relational Algebra (RA)

πA1,...,Ak(σCond(S1 × · · · × Sm))

FO review: syntax

Datalog

result(A1, . . . ,Ak)← S1(A1), . . . ,Sm(Am),C1, . . . ,Cj

where S1, . . . ,Sm ∈ D, Ai is a list of variables of length arity(Si )(for 1 ≤ i ≤ m), Ci is a well-formed selection condition over A (for1 ≤ i ≤ j), and A1, . . . ,Ak ∈ A, for the set A of all variablesappearing in A1, . . . ,Am.

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

List the titles of all books written by Italian speakers.

FO review: syntax

List the titles of all books written by Italian speakers.

FO review: syntax

In SQL

SELECT book.title

FROM author, book

WHERE book.authorID = author.authorID

author.language = ‘Italian’

FO review: syntax

In RAπbook.title(σC (author × book))

where C is

book.authorID = author .authorID ∧ author .language = ‘Italian’

FO review: syntax

In Datalog

result(T )← author(A,N,D, LA), book(B,T ,A,P, LB,Y ),

LA = ‘Italian’

FO review: syntax

List the names of all authors writing in Italian or born in1985.

FO review: syntax

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author)) ∪ πname(σbirthdate=1985(author))

result(N) ← author(A,N,B, L), L = ‘Italian’

result(N) ← author(A,N,B, L),B = 1985

FO review: syntax

SELECT A.Name

FROM Author A

SELECT A.Name

FROM Author A

FO review: syntax

SELECT A.Name

FROM Author A

SELECT A.Name

FROM Author A

FO review: syntax

List the names of all authors writing in Italian and bornin 1985.

FO review: syntax

SELECT A.Name

FROM Author A

INTERSECT

SELECT A.Name

FROM Author A

πname(σlanguage=‘I...’(author)) ∩ πname(σbirthdate=1985(author))

I (N) ← author(A,N,B, L), L = ‘Italian’

B(N) ← author(A,N,B, L),B = 1985

result(N) ← I (N),B(N)

FO review: syntax

SELECT A.Name

FROM Author A

INTERSECT

SELECT A.Name

FROM Author A

FO review: syntax

SELECT A.Name

FROM Author A

INTERSECT

SELECT A.Name

FROM Author A

FO review: syntax

List the names of all authors writing in Italian not born in1985.

FO review: syntax

SELECT A.Name

FROM Author A

EXCEPT

SELECT A.Name

FROM Author A

πname(σlanguage=‘I...’(author))− πname(σbirthdate=1985(author))

result(N) ← I (N), not B(N)

FO review: syntax

SELECT A.Name

FROM Author A

EXCEPT

SELECT A.Name

FROM Author A

FO review: syntax

SELECT A.Name

FROM Author A

EXCEPT

SELECT A.Name

FROM Author A

FO review: syntax

List the IDs of all books which are sold in every store.

FO review: syntax

πbookID(book)− πbookID ((πstoreID(store)× πbookID(book))− sells)

FO review: syntax

In SQL

SELECT B.bookID

FROM book B

WHERE NOT EXISTS

(SELECT bookID, storeID

FROM book, store

WHERE bookID=B.bookID

EXCEPT

sells)

FO review: syntax

In Datalog

all(S ,B) ← book(B,T ,A,P, L,Y ), store(S ,Ad ,Ph)

missing(B) ← all(S ,B), not sells(S ,B)

result(B) ← book(B,T ,A,P, L,Y ), not missing(B)

FO review: syntax

in TRC, the Tuple Relational Calculus (i.e., essentially straight FO)

{t | ∃b ∈ book(t.bookID = b.bookID∧∀s ∈ store∃` ∈ sell(`.storeID = s.storeID

∧ `.bookID = b.bookID))}

FO review: complexity of query evaluation

Fact. SQL (without aggregation and other bells-and-whistles), RA,TRC, and (safe non-recursive) Datalog are all equivalent inexpressive power.

LetI FO denote the full TRC

I i.e., any of the above languages

I Conj denote the TRC using only ∃ and ∧I i.e., the “conjunctive” queriesI corresponds in SQL to basic SELECT-FROM-WHERE blocksI corresponds to the {σ, π,×} fragment of RAI corresponds to single positive datalog rules

I AConj denote the “acylic” conjunctive queriesI queries with join treesI full def in a later lecture

Fact. The complexity of query evaluation is as follows

FO Conj AConj

combined PSPACE-complete NP-complete LOGCFL-completedata Logspace Logspace Linear time

where combined means in the size of the query and the databaseand data means in the size of the database (i.e., for some fixedquery).

FO review: exercise

In Datalog, express the following.

1. Retrieve the IDs of all books sold by stores in Modena.

2. Retrieve the IDs of all books written by an Italian speaker.

3. Retrieve the IDs of all books published in a language differentfrom that of the book’s author.

4. Retrieve the IDs of all authors who have books published inevery language (known in the database).

References & Credits, 1/2

There is no “required” reading for this lecture. The following arejust recommended background reading and credits.

I The end of theory: the data deluge makes the scientific method obsolete.Chris Anderson. Wired Magazine 16(7), 2008.http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

I The second economy. W. Brian Arthur. McKinsey Quarterly, 2011.http://www.mckinsey.com/insights/strategy/the_second_economy

I The rise of big data. Kenneth Cukier and Viktor Mayer-Schoenberger.Foreign Affairs, May/June 2013.http://www.foreignaffairs.com/articles/139104/kenneth-neil-cukier-and-viktor-mayer-schoenberger/the-rise-of-big-data

I The unreasonable effectiveness of data. Alon Halevy, Peter Norvig, andFernando Pereira. IEEE Intelligent Systems, March/April 2009.http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/35179.pdf

References & Credits, 2/2

I Towards a Big Data reference architecture. Markus Maier. MSc thesis,Eindhoven University of Technology, 2013.http://www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf

I Frontiers in massive data analysis. National Research Council of theNational Academies. Washington, D.C., 2013.http://www.nap.edu/catalog.php?record_id=18374 (register for free PDF)

I On being a data skeptic. Cathy O’Neil. O’Reilly Media, 2013.

http://cdn.oreillystatic.com/oreilly/radarreport/0636920032328/On_Being_a_Data_Skeptic.pdf

I NoSQL distilled: a brief guide to the emerging world of polyglotpersistence. Pramod J. Sadalage and Martin Fowler. Addison-Wesley,2013. http://martinfowler.com/nosql.html

Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB,...

Documents

Transcript of Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB,...