CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing...

Click here to load reader

  • date post

    14-Mar-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing...

  • CompSci 516DataIntensiveComputingSystems

    Lecture1Introduction

    andDataModels

    Instructor:Sudeepa Roy

    1DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

  • CourseWebsite

    • http://www.cs.duke.edu/courses/fall16/compsci516/

    • Pleasecheckfrequentlyforupdates

    DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems 2

  • Instructor• Sudeepa Roy

    [email protected]– https://users.cs.duke.edu/~sudeepa/– officehour:Mondays1:30-2:30pm,LSRCD325

    • Aboutmyself– AssistantProfessorinCS– PhD:UPenn,Postdoc:Univ.ofWashington– JoinedDukeCSinFall2015– Researchinterests:

    • Databases(theoryandapplications)• DataAnalysis,causality,explaininganswers• Uncertaindata,dataprovenance,crowdsourcing

    3DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • TA

    • Junghoon Kang– [email protected]– officehour:Thursdays1–2pm,NorthN303B

    • Noofficehourthisweekandon09/05,Mon(Memorialday)

    • AdditionalofficehourbyJungnextweekTuesday1pm– 2pm,NorthN303B

    4DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Logistics

    • Homeworksubmission:Sakai– Allenrolledstudentsarealreadythere

    • Discussionforum:Piazza– Allenrolledstudentsarealreadythere– SendmeanemailifyouhavenotreceivedawelcomeemailfromPiazza

    • Lectureslideswillbeuploadedbeforetheclass– butwillbeupdatedaftertheclass

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5

  • Grading

    • ThreeHomework:30%• Project:20%• Midterm:20%• Final:30%

    6DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

  • GradingStrategy• Relativegrading

    – TopperoftheclassgetsA+irrespectiveofthenumber,andallandonly“aboveexpectation”performancesgetA+

    – Nolowestgradeorfixeddistribution– soeveryonecangetverygoodgradesbyworkinghard!

    – Theactualgradedistributionattheendwilldependontheperformanceoftheentireclassonallthecomponents

    – Ifyouareaborderlinecasefortwogrades,yourclassparticipationandeffortthroughoutthesemesterwillputyouinthehighergrade

    • Activelyparticipateintheclass!• Askquestionsinclassandonpiazza• Answereachother’squestionsonpiazza• Send(anonymousornot)feedback,suggestions,orconcernsonPiazza

    7DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

  • Homework• Duein~3weeksaftertheyareposted/previoushw isdue

    – 2weeksshouldbeenough– Startearly

    • Nolatedays– contacttheinstructorifyouhavea*valid*reasontobelate– Anotherexam,project,hw isnotavalidreason– wewillalwaysbefair

    toall– Computercrash/suddeninterviewtrips/medicalissues(following

    officialprocedures)maycountasvalidreasons– Noguaranteethatyourrequestwillbegranted– again,startearly!

    • Tobedoneindividually

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 8

  • HomeworkOverview• Youwilllearnhowtousetraditionalandnewdatabasesystemsinthe

    homework– Havetolearnthemmostlyonyourownfollowingtutorialsavailableonline

    andwithsomehelpfromtheTA

    • HW1andHW2alreadyonsakai!Havefun!– Workonthematyourownpace

    • HW1:TraditionalDBMS– SQLandPostgres– Dueon09/16(Fri)

    • HW2:Distributeddataprocessing– SparkandAWS– WillreceiveinstructionsforAWS(secondpartofthehw)– Dueon10/12(Wed)

    • HW3:NOSQL– e.g.MongoDBorDynamoDB– Willbepostedlater(afterTransactions)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 9

  • Exams

    • Midterm– Oct5(Wed)• Final– TBD(byuniv schedule)

    • Inclass• Closedbook,closednotes,noelectronicdevices• Totalweight:20+30%=50%• Examswilltestyourunderstandingofthematerial

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 10

  • Projects• 20%weight• Ingroupsofatmost3

    – YoucanlookforgroupmembersthroughPiazzabyannouncingyourgeneralareaofinterestorifyouhaveaprobleminmind

    – Eachgroupmembershoulddoapprox.equalwork• Workdoneshouldbe(atleast)equivalenttotwoHWs,i.e.

    – theworkforonehw *2*#groupmembers• Takeitveryseriously!

    – showyourcreativityandresearcher-side• ThereisanACMSIGMODStudentResearchCompetition

    – Withpublicationand$asreward,andwinnersgotoACM-widecompetition– Separatecategoriesforundergradsandgrads– Deadline:November18,2016(justtheabstract)– http://sigmod2017.org/student-research-competition/

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 11

  • ProjectTopics• Anythingrelatedto“Data”

    – Datamanagement/processing/cleaning– Datavisualization– Dataexplorationoranalysis– Applicationsofdata(toanyfield)– Theoreticalfindingswithdata– Newtoolfordataanalysis

    • Chooseaprojectaccordingtoyourresearchinterest

    • Youcancheckoutmajordatabaseconferencesforideas,e.g.– Demonstrations (buildaprototypesolvingaproblemorimprovingUI)

    • SIGMOD’16:http://sigmod2016.org/sigmod_demo_list.shtml• VLDB’16:http://vldb2016.persistent.com/demonstrations.php

    – Researchpapers(solveaproblem,doexperimentswithdata)• SIGMOD’16:http://sigmod2016.org/sigmod_research_list.shtmlvldb 16• VLDB’16:http://www.vldb.org/pvldb/vol9.html

    – Youcancheckoutpreviousyearstoo,andconferencesfromyourownresearcharea

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 12

  • ProjectDeliverables1. Projectproposal(due:9/21,1-3pages)

    – problemselectionispartoftheproject– 3weeksfromnow– butstartasap,lookforproblems,dorelatedworkstudy,findan

    interestingquestion,doafewiterationswithme(trytomeetmeonce),allbythedeadline

    2. Midtermprogressreport(due:10/21,3-5pages)3. Finalprojectreport(due:11/28,4-8pages)4. Afinal~10minsprojectpresentationand/ordemonstration

    (inthelast1-2classes)

    • Thesamedocumentwillbeupdated– Templateandhighlevelideaswillbepostedonsakai soon

    13

  • ProjectEvaluationCriteriaScaleof100:1. Well-motivated?102. Novel?103. Comprehensiverelatedworksurvey?104. Qualityofwriting?10

    – shouldreflectallotherfactorstooexceptclasspresentation5. Classpresentation/demo?15

    – shouldreflectallotherfactorstooexceptwriting6. Technicalcontributions?45

    – Problemformulation/Algorithms/Experiments/Theory/System/Userinterface/Efficiency/Usability/Datasetexplorationetc.

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 14

  • ReadingMaterial

    • Willmostlyfollowthe”cowbook”byRamakrishnan-Gehrke– Thechapternumberswillbeposted

    • Youdonothavetobuythebooks,butitwillbegoodtoconsultthemfromtimetotime

    • Youshouldbepreparedtodoquiteabitofreadingfromvariousbooksandpapers

    15DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Whatisthiscourseabout?

    • Thisisagraduate-leveldatabasecourseinCS

    • Wewillcoverprinciples,internals,andapplicationsofdatabasesystemsindepth

    • Wewillalsohaveanintroductiontoafewadvancedresearchtopicsindatabases(laterinthecourse)

    16DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • AQuickSurvey• Haveyoutakenanundergraddatabasecourseearlier

    – CS316/equivalent?

    • Areyoufamiliarwith– SQL?– RA?(σ, Π, ´, ⨝, r, È, Ç, -)– Keys, foreign keys?– Indexindatabases?– Logic:∧,∨,∀,∃,¬,∈, =>– Transactions?– Map-reduce/Spark?

    • Haveyoueverworkedwithadataset?– relationaldatabase,text,csv,XML

    • Haveyoueverusedadatabasesystem?– PostGres,MySQL,SQLServer,SQLAzure

    17

  • Whatwillbecovered?• Databaseconcepts

    – DataModels,SQL,Views,Constraints,RA,Normalization

    • Principlesandinternalsofdatabasemanagementsystems(DBMS)– Indexing,QueryExecution-Algorithms-Optimization,Transactions,

    ParallelandDistributedQueryProcessing,MapReduce

    • Advancedandresearchtopicsindatabases– e.g.Datalog,NOSQL,Datamining,Datawarehouse– Morewillbeaddedinthe“TBD”lectures

    • Wewillgofastforsomebasictopicsindatabases– Datamodel,SQL,RA

    18DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Background• Youshouldhavesomeunderstanding(attheCS

    undergraduatelevel)– datastructure,discretemaths,algorithms– databases– orhavetolearntheseyourselfasnecessary

    • Needtopickupnewcodingframeworkandprogramminglanguagesonyourown– andhowtoprocessdatausingthem– Homeworkassignmentswillmostlybeself-taught– …withhelpfromtheTA

    • Willinvolvesomemathematicalandanalyticalreasoningtoo

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19

  • Whyshouldwecareaboutdatabases?

    • Weareinadata-drivenworld

    • “BigData”issupposedtochangethemodeofoperationforalmosteverysinglefield– Science,Technology,Healthcare,Business,Manufacturing,Journalism,Government,Education,…

    • Wemustknowhowtocollect,store,process,andanalyzesuchdata

    20DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Whyshouldwecareaboutdatabases?

    • From“BigData”wiki:“TheLargeHadronColliderexperimentsrepresentabout150millionsensorsdeliveringdata40 milliontimespersecond.Therearenearly600 millioncollisionspersecond.IfallsensordatawererecordedinLHC,….thisisequivalentto500quintillion(5×1020)bytesperday,almost200timesmorethanalltheothersourcescombinedintheworld.”

    21

    Science

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Whyshouldwecareaboutdatabases?

    • From“BigData”wiki:– eBay.com usestwodatawarehousesat7.5PB(x1012)and40PBaswellasa40PBHadoopclusterforsearch,consumerrecommendations,andmerchandising

    – Facebookhandles50 billionphotosfromitsuserbase– AsofAugust2012,Googlewashandlingroughly100 billionsearchespermonth

    22

    Technology

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Whyshouldwecareaboutdatabases?

    • From“BigData”wiki:– Healthcare:digitizationofpatient’sdata,prescriptiveanalytics

    – Media:Tailorarticlesandadvertisementsthatreachtargetedpeople,validateclaims

    • “ComputationalJournalism”projectinDukeDBgroup

    – Manufacturing:supplyplanning– Sports:improvetraining,understandingcompetitors

    23

    HealthcareMediaManufacturingSports…..

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Whyshouldwecareaboutdatabases?

    • Simplystoringsuchlargedatasetsinaflatfilestopsworkingatsomepoint– Needefficientmodel,storage,andprocessing

    • ADBMStakescareofsuchissues– theuseronlyhastorunqueriestoprocesssuchdatasets– muchsimplerthanwritinglowlevelcode

    24DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • Today

    • DBMS• DataModels

    • [RG]1.1,1.3-1.5

    25DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • WhatisaDatabase?

    • Adatabaseisacollectionofdata– typicallyrelatedanddescribingactivitiesofanorganization

    • Adatabasemaycontaininformationabout– Entities

    • students,faculty,courses,classroom– Relationshipsbetweenentities

    • students’enrollment,facultyteachingcourses,roomsforcourses

    26DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • WhyuseaDBMS• i.e.whynotusefilesystemandaprogramminglanguage?

    • Supposeacompanyhasalargecollectionofdataonemployees,departments,products,salesetc.

    • Requirements:– Quicklyanswerquestionsondata

    • Notethatallthedatamaynotfitinmainmemory– Concurrentaccess:applychangesconsistently– Restrictedaccess(e.g.salary)

    27DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • WhyuseaDBMS?

    • ADBMSisapieceofsoftware(i.e.abigprogramwrittenbysomeoneelse)thatmakesthesetaskseasier– Quickaccess– Robustaccess– Safeaccess– Simpleraccess

    • Next:somenicepropertiesofaDBMS

    28DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • WhyuseaDBMS?

    1. DataIndependence– Applicationprogramsshouldnotbeexposedtothedata

    representationandstorage– DBMSprovidesanabstractviewofthedata

    2. EfficientDataAccess– ADBMSutilizesavarietyofsophisticatedtechniquesto

    storeandretrievedata(fromdisk)efficiently

    29DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • WhyuseaDBMS?

    3. DataIntegrityandSecurity– DBMSenforces“integrityconstraints”– e.g.check

    whethertotalsalaryislessthanthebudget– DBMSenforces“accesscontrols”– whethersalary

    informationcanbeaccessesbyaparticularuser

    4. DataAdministration– Centralizedprofessionaldataadministrationby

    experienceduserscanmanagedataaccess,organizedatarepresentationtominimizeredundancy,andfinetunethestorage

    30DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • WhyuseaDBMS?

    5. ConcurrentAccessandCrashRecovery– DBMSschedulesconcurrentaccessestothedatasuch

    thattheusersthinkthatthedataisbeingaccessedbyonlyoneuseratatime

    – DBMSprotectsdatafromsystemfailures

    6. ReducedApplicationDevelopmentTime– Supportsmanyfunctionsthatarecommontoanumber

    ofapplicationsaccessingdata– Provideshigh-levelinterface– Facilitatesquickandrobustapplicationdevelopment

    31DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

  • WhenNOTtouseaDBMS?• DBMSisoptimizedforcertainkindofworkloadsand

    manipulations

    • Theremaybeapplicationswithtightreal-timeconstraintsorafewwell-definedcriticaloperations

    • AbstractviewofthedataprovidedbyDBMSmaynotsuffice

    • Toruncomplex,statistical/MLanalyticsonlargedatasets

    32DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

  • DataModel• Applicationsneedtomodelsomerealworldunits• Entities:

    – Students,Departments,Courses,Faculty,Organization,Employee,…

    • Relationships:– Courseenrollmentsbystudents,Productsalesbyanorganization

    • Adatamodelisacollectionofhigh-leveldatadescriptionconstructsthathidemanylow-levelstoragedetails

    33DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • DataModelCanSpecify:

    1. Structureofthedata– likearraysorstructs inaprogramminglanguage– butatahigherlevel(conceptualmodel)

    2. Operationsonthedata– unlikeaprogramminglanguage,notanyoperationcanbeperformed– allowlimitedsetsofqueriesandmodifications– astrength,notaweakness!

    3. Constraintsonthedata– whatthedatacanbe– e.g.amoviehasexactlyonetitle

    34DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • ImportantDataModels

    • StructuredData• Semi-structuredData• UnstructuredData

    Whatarethese?

    35DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • ImportantDataModels• StructuredData

    – Allelementshaveafixedformat– RelationalModel(table)

    • Semi-structuredData– Somestructurebutnotfixed– Hierarchicallynestedtagged-elementsintreestructure– XML

    • UnstructuredData– Nostructure– text,image,audio,video

    36DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

  • RelationalDataModel

    • ProposedbyEdward(Ted)Codd in1970– wonTuringawardforit!

    • Motivation:– Simplicity– Betterlogicalandphysicaldataindependence

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 37

  • RelationalDataModel

    • ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)

    • orderdoesnotmatter• andallrecordsaredistinct

    • however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?

    38DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    sid name login age gpa53666 Jones [email protected] 18 3.453688 Smith [email protected] 18 3.253650 Smith [email protected] 19 3.853831 Madayan [email protected] 11 1.853832 Guldu [email protected] 12 2.0

    Students

  • RelationalDataModel

    • ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)– orderdoesnotmatter– andallrecordsaredistinct

    • however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?

    39DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    sid name login age gpa53666 Jones [email protected] 18 3.453688 Smith [email protected] 18 3.253650 Smith [email protected] 19 3.853831 Madayan [email protected] 11 1.853832 Guldu [email protected] 12 2.0

    Students

  • Bagvs.Set

    • Bag:{1,1,2,2,3,2,1,5,6,1} Set:{1,2,3,5,6}• Why“bagsemantic”andnot“setsemantic”instandardDBMSs?

    – Primarilyperformancereasons– Duplicateeliminationisexpensive(requiressorting)– Someoperationslike“projection”s aremuchmoreefficientonbagsthan

    sets

    40DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    sid name login age gpa53666 Jones [email protected] 18 3.453688 Smith [email protected] 18 3.253650 Smith [email protected] 19 3.853831 Madayan [email protected] 11 1.853832 Guldu [email protected] 12 2.0

    Students

  • RelationalDataModel

    41DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

    sid name login age gpa53666 Jones [email protected] 18 3.453688 Smith [email protected] 18 3.253650 Smith [email protected] 19 3.853831 Madayan [email protected] 11 1.853832 Guldu [email protected] 12 2.0

    Students Attribute/Column/Field

    Tuple/Row/Record

    Value

    Whatisapoorlychosenattributeinthisrelation?

    • Relationaldatabase=asetofrelations• ARelation:madeupoftwoparts

    1. Schema2. Instance

  • SchemaandInstance• Oneschemacanhavemultipleinstances

    • Schema:– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumne.g.Students(sid:string,name:string,login:string,age:integer,gpa:real).

    • Instance:– Whenwefillinactualdatavaluesinaschema– atable,hasrowsandcolumns– eachrow/tuplefollowstheschemaanddomainconstraints– #Rows=cardinality,#fields=degree/arity– examplebelow

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 42

    Cardinality = 3, degree = 5sid name login age gpa

    53666 Jones [email protected] 18 3.4

    53688 Smith [email protected] 18 3.2

    53650 Smith [email protected] 19 3.8

  • RelationalDatabase:Definitions• Relationaldatabase=asetofrelations• Relations:madeupof2parts:

    • Schema– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumn– Students(sid:string,name:string,login:string,age:integer,gpa:

    real).– Instance :atable,hasrowsandcolumns

    • eachrow/tuplefollowstheschemaanddomainconstraints• #Rows=cardinality,#fields=degree/arity.

    • Canthinkofarelationasaset ofrowsortuples,i.e.,allrowsaredistinct– however,itistruefortherelationalmodel,notforstandardDBMSthat

    allowduplicaterows.Why?DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 43

    Cardinality = 3, degree = 5, all rows distinct

  • LevelsofAbstractionsinaDBMS

    • Physicalschema– Storageasfiles,rowvs.

    columnstore,indexes– willdiscussthesein

    laterlectures

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 44

    Disk

    PhysicalSchema

    LogicalSchema

    ExternalSchema External Schema ExternalSchema

  • LevelsofAbstractionsinaDBMS

    • Logical/Conceptualschema– describesthestoreddatainthe

    physicalschema

    • Decidedbyconceptualschemadesign

    – e.g.ERDiagram• notcoveredinthiscourse

    – Normalization• willbecovered

    Students(sid:string,name:string,login:string,age:integer,gpa:real)

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 45

    Disk

    PhysicalSchema

    LogicalSchema

    ExternalSchema External Schema ExternalSchema

  • LevelsofAbstractionsinaDBMS

    • Externalschema– different“views”ofthe

    databasetodifferentusers

    – willdiscussviewslater

    • Onephysicalandlogicalschemabuttherecanbemultipleexternalschemas

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 46

    Disk

    PhysicalSchema

    LogicalSchema

    ExternalSchema External Schema ExternalSchema

  • DataIndependence

    • Applicationprogramsareinsulatedfromchangesinthewaythedataisstructuredandstored

    • AveryimportantpropertyofaDBMS

    • LogicalandPhysical

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 47

  • LogicalDataIndependence• Userscanbeshieldedfromchangesinthelogical

    structureofdata• e.g.Students:

    Students(sid:string,name:string,login:string,age:integer,gpa:real)• Divideintotworelations

    Students_public(sid:string,name:string,login:string)Students_private(sid:string,age:integer,gpa:real)

    • Stilla“view”Studentscanbeobtainedusingtheabovenewrelations– by“joining”themwithsid

    • AuserwhoqueriesthisviewStudentswillgetthesameanswerasbefore

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 48

  • PhysicalDataIndependence

    • Thelogical/conceptualschemainsulatesusersfromchangesinphysicalstoragedetails– howthedataisstoredondisk– thefilestructure– thechoiceofindexes

    • Theapplicationremainsunaltered– Buttheperformancemaybeaffectedbysuchchanges

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 49

  • Semi-structuredDataandXML• XML:ExtensibleMarkupLanguage

    • Willnotbecoveredindetailinclass,butmanydatasetsavailabletodownloadareinthisform– YouwilldownloadtheDBLPdatasetinXMLformatandtransformintorelationalform(inHW1)

    • Datadoesnothaveafixedschema– “Attributes”arepartofthedata– Thedatais“self-describing”– Tree-structured

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 50

  • XML:Example

  • Attributevs.Elements

    • Elementscanberepeatedandnested• Attributesareuniqueandatomic

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 52

  • WhyXML?+ Servesasamodelsuitableforintegrationofdatabasescontainingsimilardatawithdifferentschemas

    – e.g.trytointegratetwostudentdatabases:S1(sid,name,gpa)andS2(sid,dept,year)

    – Manynullsifdoneinrelationalmodel,veryeasyinXML

    +Flexible– easytochangetheschemaanddata

    - Makesqueryprocessingmoredifficult

    Whichoneiseasier?• XML(semi-structured)torelational(structured)or• relational(structured)toXML(semi-structured)?

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 53

  • XMLtoRelationalModel• Problem1:Repeatedattributes

    RamakrishnanGehrkeDatabaseManagementSystemsMcGrawHill

    Whatisagoodrelationalschema?

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 54

  • XMLtoRelationalModel• Problem1:Repeatedattributes

    RamakrishnanGehrkeDatabaseManagementSystemsMcGrawHill

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 55

    Title Publisher Author1 Author2

  • XMLtoRelationalModel• Problem1:Repeatedattributes

    Garcia-MolinaUllmanWidomDatabaseSystems– TheCompleteBookPrenticeHall

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 56

    Title Publisher Author1 Author2

    Doesnotwork

  • XMLtoRelationalModel

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 57

    BookId Title Publisher

    b1 DatabaseManagementSystems

    McGrawHill

    b2 DatabaseSystems– TheCompleteBook

    PrenticeHall

    BookBookId Author

    b1 Ramakrishnan

    b1 Gehrke

    b2 Garcia-Molina

    b2 Ullman

    b2 Widom

    BookAuthoredBy

  • XMLtoRelationalModel• Problem2:Missingattributes

    RamakrishnanGehrkeDatabaseManagementSystemsMcGrawHillThird

    Garcia-MolinaUllmanWidomDatabaseSystems– TheComplete

    BookPrenticeHall

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 58

    BookId

    Title Publisher Edition

    b1 DatabaseManagementSystems

    McGrawHill

    Third

    b2 DatabaseSystems–TheCompleteBook

    PrenticeHall

    null

  • Summary• Relationaldatamodelisthemoststandardfordatabasemanagements– semi-structuredmodel/XMLisalsousedinpractice– youwillusetheminhw assignments

    – unstructureddata(text/photo/video)isunavoidable,butwon’tbecoveredinthisclass

    • ADBMSprovidesdataindependenceandinsulatestheapplicationprogrammerfrommanylowleveldetails

    • Wewilllearnaboutthoselowleveldetailsaswellashighleveldatamanagementinthiscourse

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 59

  • Veryimportant

    UnderstandtheCourse-Policy

    See“whatisallowed/notallowed”

    willberemindedineveryhwassignmenttoo

    DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 60