CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this...
Transcript of CompSci516 Data Intensive Computing Systems · The lecture slides are based on Ch. 1, 3, 5 of this...
CompSci 516DataIntensiveComputingSystems
Lecture23DataIntegration
Instructor:Sudeepa Roy
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
1
Announcements
• Noclassnextweek– thanksgivingrecess!– Wemeetagainon11/30(Wed)
• Finalreportfirstdraftdueon11/28(Mon)night– butcanupdateuntilFriday12/2night– sendmeanemailifyouupdate
• Iwillpostamessageonpiazzalookingforthreegroupswhowillpresenton11/30(Wed)– theremainingsevengroupspresenton12/2(Fri)– 10minutestalk/demoforeachgroup(8minstalk+2minsquestions)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 2
Today’stopic
• Anoverviewofdataintegration
• Someoptionaladditionalslidesattheend
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 3
ReadingMaterialOptionalReading:
• The“PrinciplesofDataIntegration”bookbyAnHai Doan,Alon Halevy,ZackIvesThelectureslidesarebasedonCh.1,3,5ofthisbook
• DataintegrationPODS2005tutorialbyPhokion Kolaitis(moreonthetheoreticalaspects)
4DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
WhatisDataIntegration?1/2
• InternetandWWWhaverevolutionizedpeople’saccesstodigitaldata
• Wetakeitforgrantedthatasearchqueryintoabrowsertapsintomillionsondocumentsanddatabasesandreturnswhatwearelookingfor
• SystemsontheInternetmustefficientlyandaccuratelyprocessandservealargeamountpfdata
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5
WhatisDataIntegration?2/2
• UnliketraditionalRDBMS,thenewservicesneedtheabilityto– Sharedataamongmultipleorganizations– Integratedataonaflexibleandefficientfashion
• Dataintegration:– Asetoftechniquesthatenablebuildingsystemsgearedforflexiblesharingandintegrationofdataacrossmultipleautonomousdataproviders.
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 6
Whydataintegration?1/2
• Withissueslikenormalizations,andtrade-offsindesignchoices,differentpeopledesigndifferentschemasforthesamedata
• Sometimesdifferentneedsaswell– notallattributesareneededbyallpeople
• Sometimespeoplewanttosharetheirdata– collaborators– researcherswhowanttopublishdataforothers’use
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 7
Whydataintegration?2/2
• IntheWeb,– Manywebsitespostingjobapplications,hotelorflightdeals,movieinformation
– Tokeepupwithnewinformationandfornewneed,youmayhavetolookatallofthem
– Butnowtherearewebsiteswhereyoucanaccessall– e.g.TripAdvisorhelpsyouseethepriceofthesamehotelonthehotelwebsite,hotels.com,booking.com,expedia,….
• Butthistypeofdataintegrationhasitschallengestoo
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 8
Whydataintegration?2/2
• IntheWeb,– Manywebsitespostingjobapplications,hotelorflightdeals,movieinformation
– Tokeepupwithnewinformationandfornewneed,youmayhavetolookatallofthem
– Butnowtherearewebsiteswhereyoucanaccessall– e.g.TripAdvisorhelpsyouseethepriceofthesamehotelonthehotelwebsite,hotels.com,booking.com,expedia,….
• Butthistypeofdataintegrationhasitschallengestoo
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 9
ChallengesinDataIntegration:1.Query
• Offeruniformaccesstoasetofautonomousandheterogeneousdatasources
• Query:– querydisparatedatasources,sometimesupdatethem
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 10
Challenges inDataIntegration:2.Numberofsources
• Numberofsources:– challengingevenfor10or2datasources– amplifiedforhundredsofsourcessayinWeb-scale
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 11
Challenges inDataIntegration:3.Heterogeneity
• Heterogeneity:– datasourcesweredevelopedindependentlyofeachother
– databases,files,html– differentschemaandreferences– somestructuredsomeunstructured
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 12
Challenges inDataIntegration:4.Autonomy
• Autonomy:– thesourcesmaynotbelongtothesameadministrativeentity
– eventhenmayberunbydifferentorganizations– maynothavefullaccesstothedata– theremaybeprivacyconcerns– thesourcesmaychangetheirformatsandaccesspatternsatanytimewithoutnotifying
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 13
VirtualDataIntegrationArchitecture
• Threecomponents– Datasources– Wrappers– MediatedSchema
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 14
MediatedSchema
Wrapper
Wrapper Wrapper
Wrapper
RDBMS1 RDBMS2
VirtualDataIntegrationArchitecture
• Datasources– canbeanydatamodellikerelationaldbmswithSQLinterface
– XML withXqueryinterface
– HTML
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 15
MediatedSchema
Wrapper
Wrapper Wrapper
Wrapper
RDBMS1 RDBMS2
VirtualDataIntegrationArchitecture
• Wrappers– programsthatsend
queriestoadatasource
– receivesanswers– applysomebasic
transformations• e.g.toawebform
source– translatequerytoa
httprequestwithaurl– whentheanswer
comesbackasanhtmlfile,extracttuples
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 16
MediatedSchema
Wrapper
Wrapper Wrapper
Wrapper
RDBMS1 RDBMS2
VirtualDataIntegrationArchitecture
• Mediatedschema– builtforthedata
integrationapplication– containsonlythe
aspectsthatarerelevant
– maynotcontainallattrbutes
– doesnotstoreanydatatypically
– logicalschemaforposingqueriesbytheusers
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 17
MediatedSchema
Wrapper
Wrapper Wrapper
Wrapper
RDBMS1 RDBMS2
SourceDescriptions• Specifythepropertyofthe
sourcesthatthesystemneedstoknow
• maincomponentsaresemanticmappings– relatetheschemaofthe
sourcestotheattributesinthemediatedschema
• specifieddeclaratively• betweendatasourcesand
mediatedschema– notbetweentwosources
• alsospecifies– whethersourcesare
completeornot– limitedaccesspatternsto
sources
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 18
MediatedSchema
Wrapper
Wrapper Wrapper
Wrapper
RDBMS1 RDBMS2
Example
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19
Movies
name,actors,director,genre
Cinemas
place,movie,start
CinemasinNYC
cinema,title,startTime
CinemasinSF
location,movie,startingTime
Reviews
title,date,grade,review
Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description
S1 S2 S3 S4 S5
• sourceS2maynotcontainallthemovieshowingtimesintheentirecountry
• sourceS3maybeknowntocontainallmovieshowingtimesinNewYork
• inordertogetananswerfromthesourceS1,thereneedstobeaninputforatleastoneofitsattributes
Example:QueryontheMediatedSchema
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 20
Movies
name,actors,director,genre
Cinemas
place,movie,start
CinemasinNYC
cinema,title,startTime
CinemasinSF
location,movie,startingTime
Reviews
title,date,grade,review
Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description
S1 S2 S3 S4 S5
• showtimesofmoviesinNYCdirectedbyWoodyAllen
SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"
Example:Reformulationonsourcedatabases:1/5
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 21
Movies
name,actors,director,genre
Cinemas
place,movie,start
CinemasinNYC
cinema,title,startTime
CinemasinSF
location,movie,startingTime
Reviews
title,date,grade,review
Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description
S1 S2 S3 S4 S5
SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"
• TuplesforMoviecanbeobtainedfromsourceS1– buttheattributetitleneedstobereformulated
toname
• S2:maynotcontainallthemovieshowingtimesintheentirecountry
• S3:knowntocontainallmovietimesinNYC
• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes
Example:Reformulationonsourcedatabases:2/5
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 22
Movies
name,actors,director,genre
Cinemas
place,movie,start
CinemasinNYC
cinema,title,startTime
CinemasinSF
location,movie,startingTime
Reviews
title,date,grade,review
Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description
S1 S2 S3 S4 S5
SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"
• TuplesforPlayscanbeobtainedfromeithersourceS2orS3– Sincethelatteriscompleteforshowings
inNYC,wechooseitoverS2
• S2:maynotcontainallthemovieshowingtimesintheentirecountry
• S3:knowntocontainallmovietimesinNYC
• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes
Example:Reformulationonsourcedatabases:3/5
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 23
Movies
name,actors,director,genre
Cinemas
place,movie,start
CinemasinNYC
cinema,title,startTime
CinemasinSF
location,movie,startingTime
Reviews
title,date,grade,review
Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description
S1 S2 S3 S4 S5
SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"
• SourceS3requiresthetitleofamovieasinput– butsuchatitleisnotspecifiedinthequery– thequeryplanmustfirstaccesssourceS1– thenfeedthemovietitlesreturnedfromS1as
inputstoS3
• S2:maynotcontainallthemovieshowingtimesintheentirecountry
• S3:knowntocontainallmovietimesinNYC
• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes
Example:Reformulationonsourcedatabases:4/5
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 24
Movies
name,actors,director,genre
Cinemas
place,movie,start
CinemasinNYC
cinema,title,startTime
CinemasinSF
location,movie,startingTime
Reviews
title,date,grade,review
Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description
S1 S2 S3 S4 S5
SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"
• Optionsoflogicalqueryplan:– accessS1,S3– couldaccessS1thenS2aswell(possiblynotcomplete)
• Thenqueryoptimization– asintraditionaldatabasesystem– takealogicalplanoutputaphysicalplan
• S2:maynotcontainallthemovieshowingtimesintheentirecountry
• S3:knowntocontainallmovietimesinNYC
• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes
Example:Reformulationonsourcedatabases:5/5
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 25
Movies
name,actors,director,genre
Cinemas
place,movie,start
CinemasinNYC
cinema,title,startTime
CinemasinSF
location,movie,startingTime
Reviews
title,date,grade,review
Movie:Title,director,year,genreActors:title,actorPlays:movie,location,startTimeReviews:title,rating,description
S1 S2 S3 S4 S5
SELECT title, startTimeFROM Movie, PlaysWHERE Movie.title = Plays.movie ANDlocation="New York" ANDdirector="Woody Allen"
• Thenqueryexecution– executethephysicalqueryplan– Mayasktheoptimizertoreconsidertheplan(unlike
RDBMS),e.g.ifS3istooslow
• sometimescontingenciesareincludedinoriginalplan– tradeoffbetweencomplexityofplanandabilityto
respondtounexpectedevents
• S2:maynotcontainallthemovieshowingtimesintheentirecountry
• S3:knowntocontainallmovietimesinNYC
• S1: togetananswerthereneedstobeaninputforatleastoneofitsattributes
SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:1/4
• Relationandattributenames– “description”inthemediatedschema(MS)thesameas“review”inS5
– “name”ofActorsinMSandisS3inNYCCinemas arenotthesame
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 26
S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)
S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)
S5:MovieGenres(title,genre)
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
Mediatedschema
S7:MovieYears(title,year)
S6:MovieDirectors(title,dir)
SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:2/4
• Tabularorganization– InMS,Actorstoresmovietitleandactorname– InS1,ajoinisneededthathastobespecifiedbythemapping
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 27
S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)
S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)
S5:MovieGenres(title,genre)
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
Mediatedschema
S7:MovieYears(title,year)
S6:MovieDirectors(title,dir)
SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:3/4
• Domaincoverage– thecoverageandlevelofdetailmaydiffer– S1storesmoreinfoaboutactorsthaninMS
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 28
S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)
S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)
S5:MovieGenres(title,genre)
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
Mediatedschema
S7:MovieYears(title,year)
S6:MovieDirectors(title,dir)
SchemaMappingshouldhandlethediscrepanciesbetweensourceandthemediatedschema:3/4
• Datalevelvariations– GPAasalettergradevs.anumericscoreof4.0scale– S1storesactornamesintwocolumns,MSstoresinone
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 29
S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)
S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)
S5:MovieGenres(title,genre)
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
Mediatedschema
S7:MovieYears(title,year)
S6:MovieDirectors(title,dir)
ThreeDesiredPropertiesofSchemaMappingLanguages1/3
• Flexibility– significantdifferencesbetweendisparateschemas– thelanguagesshouldbeveryflexible– shouldbeabletoexpressawidevarietyofrelationshipsbetweenschemas
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 30
ThreeDesiredPropertiesofSchemaMappingLanguages2/3
• Efficientreformulation– ourgoalistousetheschemamappingtoreformulatequeries
– weshouldbeabletodevelopreformulationalgorithmswhosepropertiesarewellunderstoodandareefficientinpractice
– oftencompeteswithflexibility,becausemoreexpressivelanguagesaretypicallyhardertoreasonabout
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 31
ThreeDesiredPropertiesofSchemaMappingLanguages3/3
• Easyupdate– foraformalismtobeusefulinpractice,itneedstobeeasytoaddandremovesources
– Ifaddinganewdatasourcepotentiallyrequiresinspectingallothersources,theresultingsystemwillbehardtomanageforalargenumberofsources
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 32
Threestandardschemamappinglanguages
1. Global-as-View(GAV)2. Local-as-View(LAV)3. Global-Local-as-View(GLAV)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 33
Global-as-View(GAV)
• GAVdefinesthemediatedschema(MS)asasetofviewsoverthedatasources– MediatedSchema=Globalschema– Anintuitiveapproach
• Mediatedschema(MS)G– Gi =somerelationinG– Xi denotesattributesinGi
• SourceschemaS1,S2,…,Sn
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 34
GAVDefinition• AGAVschemamappingMisasetofexpressionsoftheform:
• Gi(Xi)⊇ Q(S1,S2,…,Sn)– openworldassumption– instancescomputedforMSareassumedtobeincomplete
• or,Gi(Xi)=Q(S1,S2,…,Sn)– closedworldassumption– instancescomputedforMSareassumedtobecomplete
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 35
GAVExample• Movie(title,director,year,genre)⊇ S1.Movie(MID,title),
S1.MovieDetail(MID,director,genre,year)• Movie(title,director,year,genre)⊇ S5.MovieGenres(title,genre),
S6.MovieDirectors(title,director),S7.MovieYears(title,year)
• Plays(movie,location,startTime)⊇ S2.Cinemas(location,movie,startTime)• Plays(movie,location,startTime)⊇ S3.NYCCinemas(location,movie,startTime)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 36
S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)
S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)
S5:MovieGenres(title,genre)
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
Mediatedschema
S7:MovieYears(title,year)
S6:MovieDirectors(title,dir)
Discussions:GAV1/2
• SupposewehaveadatasourceS8thatstoredpairsof(actor,director)whoworkedtogetheronmovies.
• TheonlywaytomodelthissourceinGAViswiththefollowingtwodescriptionsthatuseNULL:– Actors(NULL,actor)⊇ S8(actor,director)– Movie(NULL,director,NULL,NULL)⊇ S8(actor,director)
• ThesedescriptionscreatetuplesinthemediatedschemathatincludeNULLsinallcolumnsexceptone
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 37
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
S8:ActTogether(actor,director)
Discussions:GAV2/2
• IfthesourceS8includesthetuples(Keaton,Allen)and(Pacino,Coppolag),thenthetuplescomputedforthemediatedschemawouldbe:– Actors(NULL,Keaton),Actors(NULL,Pacino)– Movie(NULL,Allen,NULL,NULL),Movie(NULL,Coppola,NULL,NULL)
• NowsupposewehavethefollowingquerythatrecreatesS8:– Q(actor,director):- Actors(title,actor),Movie(title,director,genre,
year)• WewouldnotbeabletoretrievethetuplesfromS8becausethe
sourcedescriptionslosttherelationshipbetweenactoranddirector.
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 38
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
S8:ActTogether(actor,director)
LocalAsView(LAV)
• describeseachdatasourceaspreciselyaspossibleandindependentlyofanyothersources– oppositeapproachtoGAV
• Mediatedschema(MS)G• SourceschemaS1,S2,…,Sn
– Xi denotesattributesinSi
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 39
LAVDefinition• ALAVschemamappingMisasetofexpressionsoftheform:
• Si(Xi)⊆ Qi(G)– openworldassumption
• or,Si(Xi)=Qi(G)– closedworldassumption– butcompletenessaboutdatasources,notabouttheMS
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 40
LAVExample• S5.MovieGenres(title,genre)⊆Movie(title,director,year,genre)• S6.MovieDirectors(title,director)⊆Movie(title,director,year,genre)• S7.MovieYears(title,year)⊆Movie(title,director,year,genre)• S8(actor,dir)⊆Movie(title,director,year,genre),Actors(title,actor)• Canalsospecifyconstraintsonthecontents• S9(title,year,“comedy”)⊆Movie(title,director,year,“comedy”),year≥1970
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 41
S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)
S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)
S5:MovieGenres(title,genre)
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
Mediatedschema
S7:MovieYears(title,year)
S6:MovieDirectors(title,dir)
Global-Local-As-View(GLAV)
• GAVandLAVcanbecombinedintoGLAV• Hastheexpressivepowerofboth• Theexpressionsintheschemamappinginclude
– aqueryoverthedatasourcesonthelefthandside– aqueryonthemediatedschemaontheright-handside
• Mediatedschema(MS)G• SourceschemaS1,S2,…,Sn
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 42
GLAVDefinition• AGLAVschemamappingMisasetofexpressionsoftheform:
• QS(X)⊆ QG(X)– openworldassumption
• or,QS(X)=QG(X)– closedworldassumption
• QG isaqueryoverGwhoseheadvariablesareX• QS isaqueryoverdatasourcesS1,S2,…,Sn wherethehead
variablesarealsoS
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 43
GLAVExample• SupposeS1isknowntohavecomediesproducedafter1970only
• S1.Movie(MID,title),S1.MovieDetail(MID,director,genre,year)⊆Movie(title,director,“comedy",year),year≥ 1970
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 44
S1:Actor(AID,firstName,lastName,nationality,yearof Birth)Movie(MID,title),AcrtorPlays(AID,MID)MovieDetail (MID,directorm genre,year)
S2:Cinemas(place,movie,start)S3:NYCCinemas(name,title,startTime)S4:Reviews(title,date,grade,review)
S5:MovieGenres(title,genre)
Movie:Title,director,year,genreActor:title,namePlays:movie,location,startTimeReviews:title,rating,description
Mediatedschema
S7:MovieYears(title,year)
S6:MovieDirectors(title,dir)
Optional/AdditionalSlides
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 45
SchemaMatchingsandMappings
• Specify“matches”,e.g.– attribute“name”inonesourcecorrespondstoattribute“title”inanother
– “location”isaconcatenationof“city,state,zipcode”
• Elaboratematchesintosemantic“mappings”– usingquerieslikeSQL
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 46
optionalslide
Challenges
• Thetasksofcreatingthematchesandmappingsareoftendifficult– they requireadeepunderstandingofthesemanticsoftheschemasofthedatasourcesandofthe mediatedschema
– Thisknowledgeistypicallydistributedamongmultiplepeople
– thesepeoplearenotnecessarilydatabaseexperts andmayneedhelp
• Thereisnoalgorithmthatwilltaketwoarbitrarydatabaseschemasandflawlesslyproducecorrectmatchesandmappings– goalistocreatetoolsthateducethetimebygivingsuggestionstothedesigner
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 47
optionalslide
Twodatabaseschemas
• Attributesandtablesinaschemaarecalleditselements• Theaggregatorisnotinterestedinallthedetailsoftheproduct,but
onlyintheattributesthatareshowntoitscustomers• SchemaDVD-VENDORhas14elements
– 11attributes(e.g.,id,title,andyear)andthreetables(e.g.,Movies)• SchemaAGGREGATORhasfiveelements
– fourattributesandonetable.
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 48
DVD-VENDOR
Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)
AGGREGATOR
Items(name,releaseInfo,classification,price)
optionalslide
Semanticmapping
• AqueryexpressionthatrelatesaschemaSwithaschemaT– recallGAV,LAV,GLAV
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 49
DVD-VENDOR
Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)
AGGREGATOR
Items(name,releaseInfo,classification,price)
optionalslide
Semanticmapping- Example1
• “thetitleofMoviesintheDVD-VENDORschemaisthenameattributeinItemsintheAGGREGATORschema.”
• SELECT name as title• FROM Items
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 50
DVD-VENDOR
Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)
AGGREGATOR
Items(name,releaseInfo,classification,price)
DVD-VENDORFROMAGGREGATOR
optionalslide
Semanticmapping- Example2
• “getthepriceattributeoftheItemsrelationintheAGGREGATORschemabyjoiningtheProductsandLocationstablesintheDVD-VENDORschema..”
• SELECT ( basePrice * (1 + taxRate )) AS price• FROM Products , Locations• WHERE Products . saleLocID = Locations .lid
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 51
DVD-VENDOR
Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)
AGGREGATOR
Items(name,releaseInfo,classification,price)
AGGREGATORFROMDVD-VENDOR
optionalslide
Semanticmapping- Example3
• “GettheentiretupleinItemstablefromDVD-VENDOR”
• SELECT title AS name , releaseDate AS releaseInfo , rating AS• classification , basePrice * (1 + taxRate ) AS price• FROM Movies , Products , Locations• WHERE Movies .id = Products .mid AND Products . saleLocID =• Locations .lid
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 52
DVD-VENDOR
Movies(id,title,year)Products(mid,releaseDate,releaseCompany,basePrice,rating,saleLocID)Locations(lid,name,taxRate)
AGGREGATOR
Items(name,releaseInfo,classification,price)
AGGREGATORFROMDVD-VENDOR
optionalslide
SemanticMatches
• RelatesasetofelementsinschemaStoasetofelementsinschemaT– withoutspecifyingthedetailsofthenatureofrelationship(asSQLqueries)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 53
optionalslide
Whyarematchingandmappingdifficult?
• Thesemanticsmaynotbefullycapturedintheschemas– “rating”mayimplymovierating,customerrating,etc– sometimesaccompaniedbyEnglishtext,hardforsystemstoparseand
understand• Schemacluescanbeunreliable
– twoelementsmayhavethesamenamebutdifferentmeaning,like“name”or“title”
• Semanticscanbesubjective– what“plot-summary”means– sometimesacommitteeofexpertsvote
• Combiningdatamaybedifficult– needtofigureoutajoinpath– full/left/rightouterjoinorinnerjoin– mayneedfilterconditions– thedesignerhasfiguretheseoutinspectingalargeamountofdata– erroneousandlaborprone
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 54
optionalslide
Componentsinaschemamatchingsystem
• Matchers• Combiners• ConstraintEnforcers• MatchSelectors
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 55
optionalslide
1.Matchers
• schemas→similaritymatrix• takestwoschemasSandTasinput• outputsasimilaritymatrix• assignstoeachelementpairsofSandtofTanumberbetween0and1– higherthenumber,sandtaremoresimilar
• e.g.– name≈<name:1,title:0:5>– releaseInfo ≈<releaseDate :0:6,releaseCompany:0.4>– price≈<basePrice :0:5>
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 56
optionalslide
TypesofMatchers
• Name-basedmatchers– comparesthenamesofelements– butalmostneverwrittenthesameway– usestechniquesforstringmatchingaseditdistance,Jaccard measureetc.;synonyms;normalization(capitalletters);hyphens;etc
• Instance-basedmatchers– Lookatdatainstances,buildsrecognizers(dictionaries),computesoverlaps,classification
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 57
optionalslide
2.Combiners
• matrix⨉ ….⨉matrix→matrix
• mergesthesimilaritymatricesoutputbythematchersintoasingleone
• cantaketheaverage,minimum,maximum,oraweightedsumofthesimilarityscores
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 58
optionalslide
TypesofCombiners
• Averagecombiners:– Supposekmatcherstopredictthescoresbetweentheelementsi ofschemaSandtheelementtj ofschemaT
– thenanaveragecombinerwillcomputethescorebetweenthesetwoelementsastheaveragefromthesekmatchers
• Hand-craftedscripts– e.g.ifsi =address,returnscoreofnaïve-bayesclassifer,elseaverage
• Weightedcombiner– givesweightstoeachmatcher
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 59
optionalslide
3.ConstraintEnforcers
• matrix⨉ constraints→matrix
• Inadditiontocluesandheuristics,domainknowledgeplaysanimportantroleinpruningcandidatematches– e.g.knowingthatmanymovietitlescontainfourwordsormore,butmostlocationnamesdonot,canhelpusguessthatItems.name ismorelikelytomatchMovies.title thanLocations.name
• anenforcerenforces suchconstraintsonthecandidatematches– ittransformsthesimilaritymatrixproducedbythecombinerintoanotheronethatbetterreflects thetruesimilarities
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 60
optionalslide
TypesofConstraintEnforcers:1/2
• Theremaybehardorsoftdomainintegrityconstraints• Hardconstraintsmustbeapplied
– Theenforcerwillnotoutputanymatchcombinationthatviolatesthem
• Softconstraintsareofmoreheuristicnature,andmayactuallybeviolatedincorrectmatchcombinations– theenforcerwilltrytominimizethenumber(andweight)ofthe
softconstraintsbeingviolated– butmaystilloutputamatchcombinationthatviolatesoneor
moreofthem• Formally,weattachacosttoeachconstraint
– Forhardconstraints,thecostis1– forsoftconstraints,thecostcanbeanypositivenumber.
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 61
optionalslide
TypesofConstraintEnforcers:2/2
• e.g.– c1:IfAItems.code,thenAisakey(weight=infinity)
– c2IfAItems.desc,thenanyrandomsampleof100datainstancesofAmusthaveanaveragelengthofatleast20words (weight=1.5)
– c3:IfmorethanhalfoftheattributesofTableUmatchesthoseofTableV,thenUissimilartoV(weight=1)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 62
optionalslide
4.MatchSelectors
• matrix→matches
• Producesmatchesfromthesimilaritymatrixoutputbytheconstraintenforcer
• name≈<title:0:5>• releaseInfo ≈<releaseDate :0:6>• classication ≈ <rating:0:3>• price≈<basePrice :0:5>• Giventhethreshold0.5,thematchselectorproducesmatches:
– name≈ title,releaseInfo ≈ releaseDate,andprice≈ basePrice
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 63
optionalslide
TypesofMatchSelectors
• Thesimplestselectionstrategyisthresholding– allpairsofschemaelementswithsimilarityscoreexceedingagiventhresholdarereturnedasmatches
• Morecomplexstrategiesincludeformulatingtheselectionasanoptimizationproblemoveraweightedbipartitegraph
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 64
optionalslide