CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this...

64
CompSci 516 Data Intensive Computing Systems Lecture 23 Data Integration Instructor: Sudeepa Roy Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 1

Transcript of CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this...

Page 1: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

CompSci 516DataIntensiveComputingSystems

Lecture23DataIntegration

Instructor:Sudeepa Roy

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

1

Page 2: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Announcements

• Noclassnextweek– thanksgivingrecess!– Wemeetagainon11/30(Wed)

• Finalreportfirstdraftdueon11/28(Mon)night– butcanupdateuntilFriday12/2night– sendmeanemailifyouupdate

• Iwillpostamessageonpiazzalookingforthreegroupswhowillpresenton11/30(Wed)– theremainingsevengroupspresenton12/2(Fri)– 10minutestalk/demoforeachgroup(8minstalk+2minsquestions)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 2

Page 3: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Today’stopic

• Anoverviewofdataintegration

• Someoptionaladditionalslidesattheend

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 3

Page 4: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

ReadingMaterialOptionalReading:

• The“PrinciplesofDataIntegration”bookbyAnHai Doan,Alon Halevy,ZackIvesThelectureslidesarebasedonCh.1,3,5ofthisbook

• DataintegrationPODS2005tutorialbyPhokion Kolaitis(moreonthetheoreticalaspects)

4DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 5: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

WhatisDataIntegration?1/2

• InternetandWWWhaverevolutionizedpeople’saccesstodigitaldata

• Wetakeitforgrantedthatasearchqueryintoabrowsertapsintomillionsondocumentsanddatabasesandreturnswhatwearelookingfor

• SystemsontheInternetmustefficientlyandaccuratelyprocessandservealargeamountpfdata

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5

Page 6: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

WhatisDataIntegration?2/2

• UnliketraditionalRDBMS,thenewservicesneedtheabilityto– Sharedataamongmultipleorganizations– Integratedataonaflexibleandefficientfashion

• Dataintegration:– Asetoftechniquesthatenablebuildingsystemsgearedforflexiblesharingandintegrationofdataacrossmultipleautonomousdataproviders.

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 6

Page 7: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Whydataintegration?1/2

• Withissueslikenormalizations,andtrade-offsindesignchoices,differentpeopledesigndifferentschemasforthesamedata

• Sometimesdifferentneedsaswell– notallattributesareneededbyallpeople

• Sometimespeoplewanttosharetheirdata– collaborators– researcherswhowanttopublishdataforothers’use

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 7

Page 8: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Whydataintegration?2/2

• IntheWeb,– Manywebsitespostingjobapplications,hotelorflightdeals,movieinformation

– Tokeepupwithnewinformationandfornewneed,youmayhavetolookatallofthem

– Butnowtherearewebsiteswhereyoucanaccessall– e.g.TripAdvisorhelpsyouseethepriceofthesamehotelonthehotelwebsite,hotels.com,booking.com,expedia,….

• Butthistypeofdataintegrationhasitschallengestoo

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 8

Page 9: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Whydataintegration?2/2

• IntheWeb,– Manywebsitespostingjobapplications,hotelorflightdeals,movieinformation

– Tokeepupwithnewinformationandfornewneed,youmayhavetolookatallofthem

– Butnowtherearewebsiteswhereyoucanaccessall– e.g.TripAdvisorhelpsyouseethepriceofthesamehotelonthehotelwebsite,hotels.com,booking.com,expedia,….

• Butthistypeofdataintegrationhasitschallengestoo

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 9

Page 10: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

ChallengesinDataIntegration:1.Query

• Offeruniformaccesstoasetofautonomousandheterogeneousdatasources

• Query:– querydisparatedatasources,sometimesupdatethem

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 10

Page 11: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Challenges inDataIntegration:2.Numberofsources

• Numberofsources:– challengingevenfor10or2datasources– amplifiedforhundredsofsourcessayinWeb-scale

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 11

Page 12: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Challenges inDataIntegration:3.Heterogeneity

• Heterogeneity:– datasourcesweredevelopedindependentlyofeachother

– databases,files,html– differentschemaandreferences– somestructuredsomeunstructured

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 12

Page 13: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Challenges inDataIntegration:4.Autonomy

• Autonomy:– thesourcesmaynotbelongtothesameadministrativeentity

– eventhenmayberunbydifferentorganizations– maynothavefullaccesstothedata– theremaybeprivacyconcerns– thesourcesmaychangetheirformatsandaccesspatternsatanytimewithoutnotifying

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 13

Page 14: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

VirtualDataIntegrationArchitecture

• Threecomponents– Datasources– Wrappers– MediatedSchema

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 14

MediatedSchema

Wrapper

Wrapper Wrapper

Wrapper

RDBMS1 RDBMS2

Page 15: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

VirtualDataIntegrationArchitecture

• Datasources– canbeanydatamodellikerelationaldbmswithSQLinterface

– XML withXqueryinterface

– HTML

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 15

MediatedSchema

Wrapper

Wrapper Wrapper

Wrapper

RDBMS1 RDBMS2

Page 16: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

VirtualDataIntegrationArchitecture

• Wrappers– programsthatsend

queriestoadatasource

– receivesanswers– applysomebasic

transformations• e.g.toawebform

source– translatequerytoa

httprequestwithaurl– whentheanswer

comesbackasanhtmlfile,extracttuples

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 16

MediatedSchema

Wrapper

Wrapper Wrapper

Wrapper

RDBMS1 RDBMS2

Page 17: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

VirtualDataIntegrationArchitecture

• Mediatedschema– builtforthedata

integrationapplication– containsonlythe

aspectsthatarerelevant

– maynotcontainallattrbutes

– doesnotstoreanydatatypically

– logicalschemaforposingqueriesbytheusers

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 17

MediatedSchema

Wrapper

Wrapper Wrapper

Wrapper

RDBMS1 RDBMS2

Page 18: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

SourceDescriptions• Specifythepropertyofthe

sourcesthatthesystemneedstoknow

• maincomponentsaresemanticmappings– relatetheschemaofthe

sourcestotheattributesinthemediatedschema

• specifieddeclaratively• betweendatasourcesand

mediatedschema– notbetweentwosources

• alsospecifies– whethersourcesare

completeornot– limitedaccesspatternsto

sources

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 18

MediatedSchema

Wrapper

Wrapper Wrapper

Wrapper

RDBMS1 RDBMS2

Page 19: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Example

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19

Movies

name,actors,director,genre

Cinemas

place,movie,start

CinemasinNYC

cinema,title,startTime

CinemasinSF

location,movie,startingTime

Reviews

title,date,grade,review

Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description

S1 S2 S3 S4 S5

• sourceS2maynotcontainallthemovieshowingtimesintheentirecountry

• sourceS3maybeknowntocontainallmovieshowingtimesinNewYork

• inordertogetananswerfromthesourceS1,thereneedstobeaninputforatleastoneofitsattributes

Page 20: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Example:QueryontheMediatedSchema

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 20

Movies

name,actors,director,genre

Cinemas

place,movie,start

CinemasinNYC

cinema,title,startTime

CinemasinSF

location,movie,startingTime

Reviews

title,date,grade,review

Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description

S1 S2 S3 S4 S5

• showtimesofmoviesinNYCdirectedbyWoodyAllen

SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"

Page 21: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Example:Reformulationonsourcedatabases:1/5

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 21

Movies

name,actors,director,genre

Cinemas

place,movie,start

CinemasinNYC

cinema,title,startTime

CinemasinSF

location,movie,startingTime

Reviews

title,date,grade,review

Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description

S1 S2 S3 S4 S5

SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"

• TuplesforMoviecanbeobtainedfromsourceS1– buttheattributetitleneedstobereformulated

toname

• S2:maynotcontainallthemovieshowingtimesintheentirecountry

• S3:knowntocontainallmovietimesinNYC

• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes

Page 22: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Example:Reformulationonsourcedatabases:2/5

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 22

Movies

name,actors,director,genre

Cinemas

place,movie,start

CinemasinNYC

cinema,title,startTime

CinemasinSF

location,movie,startingTime

Reviews

title,date,grade,review

Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description

S1 S2 S3 S4 S5

SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"

• TuplesforPlayscanbeobtainedfromeithersourceS2orS3– Sincethelatteriscompleteforshowings

inNYC,wechooseitoverS2

• S2:maynotcontainallthemovieshowingtimesintheentirecountry

• S3:knowntocontainallmovietimesinNYC

• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes

Page 23: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Example:Reformulationonsourcedatabases:3/5

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 23

Movies

name,actors,director,genre

Cinemas

place,movie,start

CinemasinNYC

cinema,title,startTime

CinemasinSF

location,movie,startingTime

Reviews

title,date,grade,review

Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description

S1 S2 S3 S4 S5

SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"

• SourceS3requiresthetitleofamovieasinput– butsuchatitleisnotspecifiedinthequery– thequeryplanmustfirstaccesssourceS1– thenfeedthemovietitlesreturnedfromS1as

inputstoS3

• S2:maynotcontainallthemovieshowingtimesintheentirecountry

• S3:knowntocontainallmovietimesinNYC

• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes

Page 24: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Example:Reformulationonsourcedatabases:4/5

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 24

Movies

name,actors,director,genre

Cinemas

place,movie,start

CinemasinNYC

cinema,title,startTime

CinemasinSF

location,movie,startingTime

Reviews

title,date,grade,review

Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description

S1 S2 S3 S4 S5

SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"

• Optionsoflogicalqueryplan:– accessS1,S3– couldaccessS1thenS2aswell(possiblynotcomplete)

• Thenqueryoptimization– asintraditionaldatabasesystem– takealogicalplanoutputaphysicalplan

• S2:maynotcontainallthemovieshowingtimesintheentirecountry

• S3:knowntocontainallmovietimesinNYC

• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes

Page 25: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Example:Reformulationonsourcedatabases:5/5

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 25

Movies

name,actors,director,genre

Cinemas

place,movie,start

CinemasinNYC

cinema,title,startTime

CinemasinSF

location,movie,startingTime

Reviews

title,date,grade,review

Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description

S1 S2 S3 S4 S5

SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"

• Thenqueryexecution– executethephysicalqueryplan– Mayasktheoptimizertoreconsidertheplan(unlike

RDBMS),e.g.ifS3istooslow

• sometimescontingenciesareincludedinoriginalplan– tradeoffbetweencomplexityofplanandabilityto

respondtounexpectedevents

• S2:maynotcontainallthemovieshowingtimesintheentirecountry

• S3:knowntocontainallmovietimesinNYC

• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes

Page 26: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:1/4

• Relationandattributenames– “description”inthemediatedschema(MS)thesameas“review”inS5

– “name”ofActorsinMSandisS3inNYCCinemas arenotthesame

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 26

S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)

S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)

S5:MovieGenres(title,genre)

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

Mediatedschema

S7:MovieYears(title,year)

S6:MovieDirectors(title,dir)

Page 27: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:2/4

• Tabularorganization– InMS,Actorstoresmovietitleandactorname– InS1,ajoinisneededthathastobespecifiedbythemapping

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 27

S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)

S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)

S5:MovieGenres(title,genre)

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

Mediatedschema

S7:MovieYears(title,year)

S6:MovieDirectors(title,dir)

Page 28: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:3/4

• Domaincoverage– thecoverageandlevelofdetailmaydiffer– S1storesmoreinfoaboutactorsthaninMS

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 28

S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)

S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)

S5:MovieGenres(title,genre)

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

Mediatedschema

S7:MovieYears(title,year)

S6:MovieDirectors(title,dir)

Page 29: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:3/4

• Datalevelvariations– GPAasalettergradevs.anumericscoreof4.0scale– S1storesactornamesintwocolumns,MSstoresinone

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 29

S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)

S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)

S5:MovieGenres(title,genre)

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

Mediatedschema

S7:MovieYears(title,year)

S6:MovieDirectors(title,dir)

Page 30: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

ThreeDesiredPropertiesofSchemaMappingLanguages1/3

• Flexibility– significantdifferencesbetweendisparateschemas– thelanguagesshouldbeveryflexible– shouldbeabletoexpressawidevarietyofrelationshipsbetweenschemas

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 30

Page 31: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

ThreeDesiredPropertiesofSchemaMappingLanguages2/3

• Efficientreformulation– ourgoalistousetheschemamappingtoreformulatequeries

– weshouldbeabletodevelopreformulationalgorithmswhosepropertiesarewellunderstoodandareefficientinpractice

– oftencompeteswithflexibility,becausemoreexpressivelanguagesaretypicallyhardertoreasonabout

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 31

Page 32: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

ThreeDesiredPropertiesofSchemaMappingLanguages3/3

• Easyupdate– foraformalismtobeusefulinpractice,itneedstobeeasytoaddandremovesources

– Ifaddinganewdatasourcepotentiallyrequiresinspectingallothersources,theresultingsystemwillbehardtomanageforalargenumberofsources

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 32

Page 33: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Threestandardschemamappinglanguages

1. Global-as-View(GAV)2. Local-as-View(LAV)3. Global-Local-as-View(GLAV)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 33

Page 34: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Global-as-View(GAV)

• GAVdefinesthemediatedschema(MS)asasetofviewsoverthedatasources– MediatedSchema=Globalschema– Anintuitiveapproach

• Mediatedschema(MS)G– Gi =somerelationinG– Xi denotesattributesinGi

• SourceschemaS1,S2,…,Sn

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 34

Page 35: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

GAVDefinition• AGAVschemamappingMisasetofexpressionsoftheform:

• Gi(Xi)⊇ Q(S1,S2,…,Sn)– openworldassumption– instancescomputedforMSareassumedtobeincomplete

• or,Gi(Xi)=Q(S1,S2,…,Sn)– closedworldassumption– instancescomputedforMSareassumedtobecomplete

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 35

Page 36: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

GAVExample• Movie(title,director,year,genre)⊇ S1.Movie(MID,title),

S1.MovieDetail(MID,director,genre,year)• Movie(title,director,year,genre)⊇ S5.MovieGenres(title,genre),

S6.MovieDirectors(title,director),S7.MovieYears(title,year)

• Plays(movie,location,startTime)⊇ S2.Cinemas(location,movie,startTime)• Plays(movie,location,startTime)⊇ S3.NYCCinemas(location,movie,startTime)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 36

S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)

S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)

S5:MovieGenres(title,genre)

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

Mediatedschema

S7:MovieYears(title,year)

S6:MovieDirectors(title,dir)

Page 37: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Discussions:GAV1/2

• SupposewehaveadatasourceS8thatstoredpairsof(actor,director)whoworkedtogetheronmovies.

• TheonlywaytomodelthissourceinGAViswiththefollowingtwodescriptionsthatuseNULL:– Actors(NULL,actor)⊇ S8(actor,director)– Movie(NULL,director,NULL,NULL)⊇ S8(actor,director)

• ThesedescriptionscreatetuplesinthemediatedschemathatincludeNULLsinallcolumnsexceptone

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 37

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

S8:ActTogether(actor,director)

Page 38: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Discussions:GAV2/2

• IfthesourceS8includesthetuples(Keaton,Allen)and(Pacino,Coppolag),thenthetuplescomputedforthemediatedschemawouldbe:– Actors(NULL,Keaton),Actors(NULL,Pacino)– Movie(NULL,Allen,NULL,NULL),Movie(NULL,Coppola,NULL,NULL)

• NowsupposewehavethefollowingquerythatrecreatesS8:– Q(actor,director):- Actors(title,actor),Movie(title,director,genre,

year)• WewouldnotbeabletoretrievethetuplesfromS8becausethe

sourcedescriptionslosttherelationshipbetweenactoranddirector.

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 38

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

S8:ActTogether(actor,director)

Page 39: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

LocalAsView(LAV)

• describeseachdatasourceaspreciselyaspossibleandindependentlyofanyothersources– oppositeapproachtoGAV

• Mediatedschema(MS)G• SourceschemaS1,S2,…,Sn

– Xi denotesattributesinSi

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 39

Page 40: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

LAVDefinition• ALAVschemamappingMisasetofexpressionsoftheform:

• Si(Xi)⊆ Qi(G)– openworldassumption

• or,Si(Xi)=Qi(G)– closedworldassumption– butcompletenessaboutdatasources,notabouttheMS

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 40

Page 41: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

LAVExample• S5.MovieGenres(title,genre)⊆Movie(title,director,year,genre)• S6.MovieDirectors(title,director)⊆Movie(title,director,year,genre)• S7.MovieYears(title,year)⊆Movie(title,director,year,genre)• S8(actor,dir)⊆Movie(title,director,year,genre),Actors(title,actor)• Canalsospecifyconstraintsonthecontents• S9(title,year,“comedy”)⊆Movie(title,director,year,“comedy”),year≥1970

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 41

S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)

S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)

S5:MovieGenres(title,genre)

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

Mediatedschema

S7:MovieYears(title,year)

S6:MovieDirectors(title,dir)

Page 42: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Global-Local-As-View(GLAV)

• GAVandLAVcanbecombinedintoGLAV• Hastheexpressivepowerofboth• Theexpressionsintheschemamappinginclude

– aqueryoverthedatasourcesonthelefthandside– aqueryonthemediatedschemaontheright-handside

• Mediatedschema(MS)G• SourceschemaS1,S2,…,Sn

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 42

Page 43: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

GLAVDefinition• AGLAVschemamappingMisasetofexpressionsoftheform:

• QS(X)⊆ QG(X)– openworldassumption

• or,QS(X)=QG(X)– closedworldassumption

• QG isaqueryoverGwhoseheadvariablesareX• QS isaqueryoverdatasourcesS1,S2,…,Sn wherethehead

variablesarealsoS

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 43

Page 44: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

GLAVExample• SupposeS1isknowntohavecomediesproducedafter1970only

• S1.Movie(MID,title),S1.MovieDetail(MID,director,genre,year)⊆Movie(title,director,“comedy",year),year≥ 1970

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 44

S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)

S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)

S5:MovieGenres(title,genre)

Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description

Mediatedschema

S7:MovieYears(title,year)

S6:MovieDirectors(title,dir)

Page 45: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Optional/AdditionalSlides

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 45

Page 46: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

SchemaMatchingsandMappings

• Specify“matches”,e.g.– attribute“name”inonesourcecorrespondstoattribute“title”inanother

– “location”isaconcatenationof“city,state,zipcode”

• Elaboratematchesintosemantic“mappings”– usingquerieslikeSQL

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 46

optionalslide

Page 47: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Challenges

• Thetasksofcreatingthematchesandmappingsareoftendifficult– they requireadeepunderstandingofthesemanticsoftheschemasofthedatasourcesandofthe mediatedschema

– Thisknowledgeistypicallydistributedamongmultiplepeople

– thesepeoplearenotnecessarilydatabaseexperts andmayneedhelp

• Thereisnoalgorithmthatwilltaketwoarbitrarydatabaseschemasandflawlesslyproducecorrectmatchesandmappings– goalistocreatetoolsthateducethetimebygivingsuggestionstothedesigner

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 47

optionalslide

Page 48: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Twodatabaseschemas

• Attributesandtablesinaschemaarecalleditselements• Theaggregatorisnotinterestedinallthedetailsoftheproduct,but

onlyintheattributesthatareshowntoitscustomers• SchemaDVD-VENDORhas14elements

– 11attributes(e.g.,id,title,andyear)andthreetables(e.g.,Movies)• SchemaAGGREGATORhasfiveelements

– fourattributesandonetable.

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 48

DVD-VENDOR

Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)

AGGREGATOR

Items(name,releaseInfo,classification,price)

optionalslide

Page 49: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Semanticmapping

• AqueryexpressionthatrelatesaschemaSwithaschemaT– recallGAV,LAV,GLAV

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 49

DVD-VENDOR

Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)

AGGREGATOR

Items(name,releaseInfo,classification,price)

optionalslide

Page 50: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Semanticmapping- Example1

• “thetitleofMoviesintheDVD-VENDORschemaisthenameattributeinItemsintheAGGREGATORschema.”

• SELECT name as title• FROM Items

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 50

DVD-VENDOR

Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)

AGGREGATOR

Items(name,releaseInfo,classification,price)

DVD-VENDORFROMAGGREGATOR

optionalslide

Page 51: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Semanticmapping- Example2

• “getthepriceattributeoftheItemsrelationintheAGGREGATORschemabyjoiningtheProductsandLocationstablesintheDVD-VENDORschema..”

• SELECT ( basePrice * (1 + taxRate )) AS price• FROM Products , Locations• WHERE Products . saleLocID = Locations .lid

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 51

DVD-VENDOR

Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)

AGGREGATOR

Items(name,releaseInfo,classification,price)

AGGREGATORFROMDVD-VENDOR

optionalslide

Page 52: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Semanticmapping- Example3

• “GettheentiretupleinItemstablefromDVD-VENDOR”

• SELECT title AS name , releaseDate AS releaseInfo , rating AS• classification , basePrice * (1 + taxRate ) AS price• FROM Movies , Products , Locations• WHERE Movies .id = Products .mid AND Products . saleLocID =• Locations .lid

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 52

DVD-VENDOR

Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)

AGGREGATOR

Items(name,releaseInfo,classification,price)

AGGREGATORFROMDVD-VENDOR

optionalslide

Page 53: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

SemanticMatches

• RelatesasetofelementsinschemaStoasetofelementsinschemaT– withoutspecifyingthedetailsofthenatureofrelationship(asSQLqueries)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 53

optionalslide

Page 54: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Whyarematchingandmappingdifficult?

• Thesemanticsmaynotbefullycapturedintheschemas– “rating”mayimplymovierating,customerrating,etc– sometimesaccompaniedbyEnglishtext,hardforsystemstoparseand

understand• Schemacluescanbeunreliable

– twoelementsmayhavethesamenamebutdifferentmeaning,like“name”or“title”

• Semanticscanbesubjective– what“plot-summary”means– sometimesacommitteeofexpertsvote

• Combiningdatamaybedifficult– needtofigureoutajoinpath– full/left/rightouterjoinorinnerjoin– mayneedfilterconditions– thedesignerhasfiguretheseoutinspectingalargeamountofdata– erroneousandlaborprone

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 54

optionalslide

Page 55: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

Componentsinaschemamatchingsystem

• Matchers• Combiners• ConstraintEnforcers• MatchSelectors

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 55

optionalslide

Page 56: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

1.Matchers

• schemas→similaritymatrix• takestwoschemasSandTasinput• outputsasimilaritymatrix• assignstoeachelementpairsofSandtofTanumberbetween0and1– higherthenumber,sandtaremoresimilar

• e.g.– name≈<name:1,title:0:5>– releaseInfo ≈<releaseDate :0:6,releaseCompany:0.4>– price≈<basePrice :0:5>

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 56

optionalslide

Page 57: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

TypesofMatchers

• Name-basedmatchers– comparesthenamesofelements– butalmostneverwrittenthesameway– usestechniquesforstringmatchingaseditdistance,Jaccard measureetc.;synonyms;normalization(capitalletters);hyphens;etc

• Instance-basedmatchers– Lookatdatainstances,buildsrecognizers(dictionaries),computesoverlaps,classification

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 57

optionalslide

Page 58: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

2.Combiners

• matrix⨉ ….⨉matrix→matrix

• mergesthesimilaritymatricesoutputbythematchersintoasingleone

• cantaketheaverage,minimum,maximum,oraweightedsumofthesimilarityscores

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 58

optionalslide

Page 59: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

TypesofCombiners

• Averagecombiners:– Supposekmatcherstopredictthescoresbetweentheelementsi ofschemaSandtheelementtj ofschemaT

– thenanaveragecombinerwillcomputethescorebetweenthesetwoelementsastheaveragefromthesekmatchers

• Hand-craftedscripts– e.g.ifsi =address,returnscoreofnaïve-bayesclassifer,elseaverage

• Weightedcombiner– givesweightstoeachmatcher

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 59

optionalslide

Page 60: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

3.ConstraintEnforcers

• matrix⨉ constraints→matrix

• Inadditiontocluesandheuristics,domainknowledgeplaysanimportantroleinpruningcandidatematches– e.g.knowingthatmanymovietitlescontainfourwordsormore,butmostlocationnamesdonot,canhelpusguessthatItems.name ismorelikelytomatchMovies.title thanLocations.name

• anenforcerenforces suchconstraintsonthecandidatematches– ittransformsthesimilaritymatrixproducedbythecombinerintoanotheronethatbetterreflects thetruesimilarities

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 60

optionalslide

Page 61: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

TypesofConstraintEnforcers:1/2

• Theremaybehardorsoftdomainintegrityconstraints• Hardconstraintsmustbeapplied

– Theenforcerwillnotoutputanymatchcombinationthatviolatesthem

• Softconstraintsareofmoreheuristicnature,andmayactuallybeviolatedincorrectmatchcombinations– theenforcerwilltrytominimizethenumber(andweight)ofthe

softconstraintsbeingviolated– butmaystilloutputamatchcombinationthatviolatesoneor

moreofthem• Formally,weattachacosttoeachconstraint

– Forhardconstraints,thecostis1– forsoftconstraints,thecostcanbeanypositivenumber.

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 61

optionalslide

Page 62: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

TypesofConstraintEnforcers:2/2

• e.g.– c1:IfAItems.code,thenAisakey(weight=infinity)

– c2IfAItems.desc,thenanyrandomsampleof100datainstancesofAmusthaveanaveragelengthofatleast20words (weight=1.5)

– c3:IfmorethanhalfoftheattributesofTableUmatchesthoseofTableV,thenUissimilartoV(weight=1)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 62

optionalslide

Page 63: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

4.MatchSelectors

• matrix→matches

• Producesmatchesfromthesimilaritymatrixoutputbytheconstraintenforcer

• name≈<title:0:5>• releaseInfo ≈<releaseDate :0:6>• classication ≈ <rating:0:3>• price≈<basePrice :0:5>• Giventhethreshold0.5,thematchselectorproducesmatches:

– name≈ title,releaseInfo ≈ releaseDate,andprice≈ basePrice

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 63

optionalslide

Page 64: CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this book • Data integration PODS 2005 tutorial by PhokionKolaitis (more on the theoretical

TypesofMatchSelectors

• Thesimplestselectionstrategyisthresholding– allpairsofschemaelementswithsimilarityscoreexceedingagiventhresholdarereturnedasmatches

• Morecomplexstrategiesincludeformulatingtheselectionasanoptimizationproblemoveraweightedbipartitegraph

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 64

optionalslide