CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing...

60
CompSci 516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy 1 Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems

Transcript of CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing...

Page 1: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

CompSci 516DataIntensiveComputingSystems

Lecture1Introduction

andDataModels

Instructor:Sudeepa Roy

1DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

Page 2: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

CourseWebsite

• http://www.cs.duke.edu/courses/fall16/compsci516/

• Pleasecheckfrequentlyforupdates

DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems 2

Page 3: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Instructor• Sudeepa Roy

[email protected]– https://users.cs.duke.edu/~sudeepa/– officehour:Mondays1:30-2:30pm,LSRCD325

• Aboutmyself– AssistantProfessorinCS– PhD:UPenn,Postdoc:Univ.ofWashington– JoinedDukeCSinFall2015– Researchinterests:

• Databases(theoryandapplications)• DataAnalysis,causality,explaininganswers• Uncertaindata,dataprovenance,crowdsourcing

3DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 4: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

TA

• Junghoon Kang– [email protected]– officehour:Thursdays1–2pm,NorthN303B

• Noofficehourthisweekandon09/05,Mon(Memorialday)

• AdditionalofficehourbyJungnextweekTuesday1pm– 2pm,NorthN303B

4DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 5: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Logistics

• Homeworksubmission:Sakai– Allenrolledstudentsarealreadythere

• Discussionforum:Piazza– Allenrolledstudentsarealreadythere– SendmeanemailifyouhavenotreceivedawelcomeemailfromPiazza

• Lectureslideswillbeuploadedbeforetheclass– butwillbeupdatedaftertheclass

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5

Page 6: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Grading

• ThreeHomework:30%• Project:20%• Midterm:20%• Final:30%

6DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

Page 7: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

GradingStrategy• Relativegrading

– TopperoftheclassgetsA+irrespectiveofthenumber,andallandonly“aboveexpectation”performancesgetA+

– Nolowestgradeorfixeddistribution– soeveryonecangetverygoodgradesbyworkinghard!

– Theactualgradedistributionattheendwilldependontheperformanceoftheentireclassonallthecomponents

– Ifyouareaborderlinecasefortwogrades,yourclassparticipationandeffortthroughoutthesemesterwillputyouinthehighergrade

• Activelyparticipateintheclass!• Askquestionsinclassandonpiazza• Answereachother’squestionsonpiazza• Send(anonymousornot)feedback,suggestions,orconcernsonPiazza

7DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

Page 8: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Homework• Duein~3weeksaftertheyareposted/previoushw isdue

– 2weeksshouldbeenough– Startearly

• Nolatedays– contacttheinstructorifyouhavea*valid*reasontobelate– Anotherexam,project,hw isnotavalidreason– wewillalwaysbefair

toall– Computercrash/suddeninterviewtrips/medicalissues(following

officialprocedures)maycountasvalidreasons– Noguaranteethatyourrequestwillbegranted– again,startearly!

• Tobedoneindividually

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 8

Page 9: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

HomeworkOverview• Youwilllearnhowtousetraditionalandnewdatabasesystemsinthe

homework– Havetolearnthemmostlyonyourownfollowingtutorialsavailableonline

andwithsomehelpfromtheTA

• HW1andHW2alreadyonsakai!Havefun!– Workonthematyourownpace

• HW1:TraditionalDBMS– SQLandPostgres– Dueon09/16(Fri)

• HW2:Distributeddataprocessing– SparkandAWS– WillreceiveinstructionsforAWS(secondpartofthehw)– Dueon10/12(Wed)

• HW3:NOSQL– e.g.MongoDBorDynamoDB– Willbepostedlater(afterTransactions)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 9

Page 10: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Exams

• Midterm– Oct5(Wed)• Final– TBD(byuniv schedule)

• Inclass• Closedbook,closednotes,noelectronicdevices• Totalweight:20+30%=50%• Examswilltestyourunderstandingofthematerial

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 10

Page 11: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Projects• 20%weight• Ingroupsofatmost3

– YoucanlookforgroupmembersthroughPiazzabyannouncingyourgeneralareaofinterestorifyouhaveaprobleminmind

– Eachgroupmembershoulddoapprox.equalwork

• Workdoneshouldbe(atleast)equivalenttotwoHWs,i.e.– theworkforonehw *2*#groupmembers

• Takeitveryseriously!– showyourcreativityandresearcher-side

• ThereisanACMSIGMODStudentResearchCompetition– Withpublicationand$asreward,andwinnersgotoACM-widecompetition– Separatecategoriesforundergradsandgrads– Deadline:November18,2016(justtheabstract)– http://sigmod2017.org/student-research-competition/

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 11

Page 12: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

ProjectTopics• Anythingrelatedto“Data”

– Datamanagement/processing/cleaning– Datavisualization– Dataexplorationoranalysis– Applicationsofdata(toanyfield)– Theoreticalfindingswithdata– Newtoolfordataanalysis

• Chooseaprojectaccordingtoyourresearchinterest

• Youcancheckoutmajordatabaseconferencesforideas,e.g.– Demonstrations (buildaprototypesolvingaproblemorimprovingUI)

• SIGMOD’16:http://sigmod2016.org/sigmod_demo_list.shtml• VLDB’16:http://vldb2016.persistent.com/demonstrations.php

– Researchpapers(solveaproblem,doexperimentswithdata)• SIGMOD’16:http://sigmod2016.org/sigmod_research_list.shtmlvldb 16• VLDB’16:http://www.vldb.org/pvldb/vol9.html

– Youcancheckoutpreviousyearstoo,andconferencesfromyourownresearcharea

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 12

Page 13: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

ProjectDeliverables1. Projectproposal(due:9/21,1-3pages)

– problemselectionispartoftheproject– 3weeksfromnow– butstartasap,lookforproblems,dorelatedworkstudy,findan

interestingquestion,doafewiterationswithme(trytomeetmeonce),allbythedeadline

2. Midtermprogressreport(due:10/21,3-5pages)3. Finalprojectreport(due:11/28,4-8pages)4. Afinal~10minsprojectpresentationand/ordemonstration

(inthelast1-2classes)

• Thesamedocumentwillbeupdated– Templateandhighlevelideaswillbepostedonsakai soon

13

Page 14: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

ProjectEvaluationCriteriaScaleof100:1. Well-motivated?102. Novel?103. Comprehensiverelatedworksurvey?104. Qualityofwriting?10

– shouldreflectallotherfactorstooexceptclasspresentation

5. Classpresentation/demo?15– shouldreflectallotherfactorstooexceptwriting

6. Technicalcontributions?45– Problemformulation/Algorithms/Experiments/Theory/System/

Userinterface/Efficiency/Usability/Datasetexplorationetc.

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 14

Page 15: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

ReadingMaterial

• Willmostlyfollowthe”cowbook”byRamakrishnan-Gehrke– Thechapternumberswillbeposted

• Youdonothavetobuythebooks,butitwillbegoodtoconsultthemfromtimetotime

• Youshouldbepreparedtodoquiteabitofreadingfromvariousbooksandpapers

15DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 16: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Whatisthiscourseabout?

• Thisisagraduate-leveldatabasecourseinCS

• Wewillcoverprinciples,internals,andapplicationsofdatabasesystemsindepth

• Wewillalsohaveanintroductiontoafewadvancedresearchtopicsindatabases(laterinthecourse)

16DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 17: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

AQuickSurvey• Haveyoutakenanundergraddatabasecourseearlier

– CS316/equivalent?

• Areyoufamiliarwith– SQL?– RA?(σ, Π, ´, ⨝, r, È, Ç, -)– Keys, foreign keys?– Indexindatabases?– Logic:∧,∨,∀,∃,¬,∈, =>

– Transactions?– Map-reduce/Spark?

• Haveyoueverworkedwithadataset?– relationaldatabase,text,csv,XML

• Haveyoueverusedadatabasesystem?– PostGres,MySQL,SQLServer,SQLAzure

17

Page 18: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Whatwillbecovered?• Databaseconcepts

– DataModels,SQL,Views,Constraints,RA,Normalization

• Principlesandinternalsofdatabasemanagementsystems(DBMS)– Indexing,QueryExecution-Algorithms-Optimization,Transactions,

ParallelandDistributedQueryProcessing,MapReduce

• Advancedandresearchtopicsindatabases– e.g.Datalog,NOSQL,Datamining,Datawarehouse– Morewillbeaddedinthe“TBD”lectures

• Wewillgofastforsomebasictopicsindatabases– Datamodel,SQL,RA

18DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 19: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Background• Youshouldhavesomeunderstanding(attheCS

undergraduatelevel)– datastructure,discretemaths,algorithms– databases– orhavetolearntheseyourselfasnecessary

• Needtopickupnewcodingframeworkandprogramminglanguagesonyourown– andhowtoprocessdatausingthem– Homeworkassignmentswillmostlybeself-taught– …withhelpfromtheTA

• Willinvolvesomemathematicalandanalyticalreasoningtoo

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19

Page 20: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Whyshouldwecareaboutdatabases?

• Weareinadata-drivenworld

• “BigData”issupposedtochangethemodeofoperationforalmosteverysinglefield– Science,Technology,Healthcare,Business,Manufacturing,Journalism,Government,Education,…

• Wemustknowhowtocollect,store,process,andanalyzesuchdata

20DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 21: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:“TheLargeHadronColliderexperimentsrepresentabout150millionsensorsdeliveringdata40 milliontimespersecond.Therearenearly600 millioncollisionspersecond.IfallsensordatawererecordedinLHC,….thisisequivalentto500quintillion(5×1020)bytesperday,almost200timesmorethanalltheothersourcescombinedintheworld.”

21

Science

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 22: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:– eBay.com usestwodatawarehousesat7.5PB(x1012)and40PBaswellasa40PBHadoopclusterforsearch,consumerrecommendations,andmerchandising

– Facebookhandles50 billionphotosfromitsuserbase– AsofAugust2012,Googlewashandlingroughly100 billionsearchespermonth

22

Technology

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 23: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:– Healthcare:digitizationofpatient’sdata,prescriptiveanalytics

– Media:Tailorarticlesandadvertisementsthatreachtargetedpeople,validateclaims

• “ComputationalJournalism”projectinDukeDBgroup

– Manufacturing:supplyplanning– Sports:improvetraining,understandingcompetitors

23

HealthcareMediaManufacturingSports…..

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 24: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Whyshouldwecareaboutdatabases?

• Simplystoringsuchlargedatasetsinaflatfilestopsworkingatsomepoint– Needefficientmodel,storage,andprocessing

• ADBMStakescareofsuchissues– theuseronlyhastorunqueriestoprocesssuchdatasets– muchsimplerthanwritinglowlevelcode

24DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 25: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Today

• DBMS• DataModels

• [RG]1.1,1.3-1.5

25DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 26: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhatisaDatabase?

• Adatabaseisacollectionofdata– typicallyrelatedanddescribingactivitiesofanorganization

• Adatabasemaycontaininformationabout– Entities

• students,faculty,courses,classroom

– Relationshipsbetweenentities• students’enrollment,facultyteachingcourses,roomsforcourses

26DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 27: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhyuseaDBMS• i.e.whynotusefilesystemandaprogramminglanguage?

• Supposeacompanyhasalargecollectionofdataonemployees,departments,products,salesetc.

• Requirements:– Quicklyanswerquestionsondata

• Notethatallthedatamaynotfitinmainmemory– Concurrentaccess:applychangesconsistently– Restrictedaccess(e.g.salary)

27DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 28: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhyuseaDBMS?

• ADBMSisapieceofsoftware(i.e.abigprogramwrittenbysomeoneelse)thatmakesthesetaskseasier– Quickaccess– Robustaccess– Safeaccess– Simpleraccess

• Next:somenicepropertiesofaDBMS

28DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 29: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhyuseaDBMS?

1. DataIndependence– Applicationprogramsshouldnotbeexposedtothedata

representationandstorage– DBMSprovidesanabstractviewofthedata

2. EfficientDataAccess– ADBMSutilizesavarietyofsophisticatedtechniquesto

storeandretrievedata(fromdisk)efficiently

29DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 30: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhyuseaDBMS?

3. DataIntegrityandSecurity– DBMSenforces“integrityconstraints”– e.g.check

whethertotalsalaryislessthanthebudget– DBMSenforces“accesscontrols”– whethersalary

informationcanbeaccessesbyaparticularuser

4. DataAdministration– Centralizedprofessionaldataadministrationby

experienceduserscanmanagedataaccess,organizedatarepresentationtominimizeredundancy,andfinetunethestorage

30DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 31: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhyuseaDBMS?

5. ConcurrentAccessandCrashRecovery– DBMSschedulesconcurrentaccessestothedatasuch

thattheusersthinkthatthedataisbeingaccessedbyonlyoneuseratatime

– DBMSprotectsdatafromsystemfailures

6. ReducedApplicationDevelopmentTime– Supportsmanyfunctionsthatarecommontoanumber

ofapplicationsaccessingdata– Provideshigh-levelinterface– Facilitatesquickandrobustapplicationdevelopment

31DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

Page 32: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhenNOTtouseaDBMS?• DBMSisoptimizedforcertainkindofworkloadsand

manipulations

• Theremaybeapplicationswithtightreal-timeconstraintsorafewwell-definedcriticaloperations

• AbstractviewofthedataprovidedbyDBMSmaynotsuffice

• Toruncomplex,statistical/MLanalyticsonlargedatasets

32DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems

Page 33: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

DataModel• Applicationsneedtomodelsomerealworldunits• Entities:

– Students,Departments,Courses,Faculty,Organization,Employee,…

• Relationships:– Courseenrollmentsbystudents,Productsalesbyanorganization

• Adatamodelisacollectionofhigh-leveldatadescriptionconstructsthathidemanylow-levelstoragedetails

33DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 34: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

DataModelCanSpecify:

1. Structureofthedata– likearraysorstructs inaprogramminglanguage– butatahigherlevel(conceptualmodel)

2. Operationsonthedata– unlikeaprogramminglanguage,notanyoperationcanbeperformed– allowlimitedsetsofqueriesandmodifications– astrength,notaweakness!

3. Constraintsonthedata– whatthedatacanbe– e.g.amoviehasexactlyonetitle

34DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 35: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

ImportantDataModels

• StructuredData• Semi-structuredData• UnstructuredData

Whatarethese?

35DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 36: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

ImportantDataModels• StructuredData

– Allelementshaveafixedformat– RelationalModel(table)

• Semi-structuredData– Somestructurebutnotfixed– Hierarchicallynestedtagged-elementsintreestructure– XML

• UnstructuredData– Nostructure– text,image,audio,video

36DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 37: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

RelationalDataModel

• ProposedbyEdward(Ted)Codd in1970– wonTuringawardforit!

• Motivation:– Simplicity– Betterlogicalandphysicaldataindependence

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 37

Page 38: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

RelationalDataModel

• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)

• orderdoesnotmatter• andallrecordsaredistinct

• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?

38DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students

Page 39: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

RelationalDataModel

• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)– orderdoesnotmatter– andallrecordsaredistinct

• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?

39DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students

Page 40: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Bagvs.Set

• Bag:{1,1,2,2,3,2,1,5,6,1} Set:{1,2,3,5,6}• Why“bagsemantic”andnot“setsemantic”instandardDBMSs?

– Primarilyperformancereasons– Duplicateeliminationisexpensive(requiressorting)– Someoperationslike“projection”s aremuchmoreefficientonbagsthan

sets

40DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students

Page 41: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

RelationalDataModel

41DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students Attribute/Column/Field

Tuple/Row/Record

Value

Whatisapoorlychosenattributeinthisrelation?

• Relationaldatabase=asetofrelations• ARelation:madeupoftwoparts

1. Schema2. Instance

Page 42: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

SchemaandInstance• Oneschemacanhavemultipleinstances

• Schema:– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumne.g.Students(sid:string,name:string,login:string,age:integer,gpa:real).

• Instance:– Whenwefillinactualdatavaluesinaschema– atable,hasrowsandcolumns– eachrow/tuplefollowstheschemaanddomainconstraints– #Rows=cardinality,#fields=degree/arity– examplebelow

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 42

Cardinality = 3, degree = 5sid name login age gpa

53666 Jones jones@cs 18 3.4

53688 Smith smith@ee 18 3.2

53650 Smith smith1@math 19 3.8

Page 43: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

RelationalDatabase:Definitions• Relationaldatabase=asetofrelations• Relations:madeupof2parts:

• Schema– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumn– Students(sid:string,name:string,login:string,age:integer,gpa:

real).– Instance :atable,hasrowsandcolumns

• eachrow/tuplefollowstheschemaanddomainconstraints• #Rows=cardinality,#fields=degree/arity.

• Canthinkofarelationasaset ofrowsortuples,i.e.,allrowsaredistinct– however,itistruefortherelationalmodel,notforstandardDBMSthat

allowduplicaterows.Why?DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 43

Cardinality = 3, degree = 5, all rows distinct

Page 44: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

LevelsofAbstractionsinaDBMS

• Physicalschema– Storageasfiles,rowvs.

columnstore,indexes– willdiscussthesein

laterlectures

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 44

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

Page 45: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

LevelsofAbstractionsinaDBMS

• Logical/Conceptualschema– describesthestoreddatainthe

physicalschema

• Decidedbyconceptualschemadesign

– e.g.ERDiagram• notcoveredinthiscourse

– Normalization• willbecovered

Students(sid:string,name:string,login:string,age:integer,gpa:real)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 45

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

Page 46: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

LevelsofAbstractionsinaDBMS

• Externalschema– different“views”ofthe

databasetodifferentusers

– willdiscussviewslater

• Onephysicalandlogicalschemabuttherecanbemultipleexternalschemas

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 46

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

Page 47: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

DataIndependence

• Applicationprogramsareinsulatedfromchangesinthewaythedataisstructuredandstored

• AveryimportantpropertyofaDBMS

• LogicalandPhysical

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 47

Page 48: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

LogicalDataIndependence• Userscanbeshieldedfromchangesinthelogical

structureofdata• e.g.Students:

Students(sid:string,name:string,login:string,age:integer,gpa:real)• Divideintotworelations

Students_public(sid:string,name:string,login:string)Students_private(sid:string,age:integer,gpa:real)

• Stilla“view”Studentscanbeobtainedusingtheabovenewrelations– by“joining”themwithsid

• AuserwhoqueriesthisviewStudentswillgetthesameanswerasbefore

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 48

Page 49: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

PhysicalDataIndependence

• Thelogical/conceptualschemainsulatesusersfromchangesinphysicalstoragedetails– howthedataisstoredondisk– thefilestructure– thechoiceofindexes

• Theapplicationremainsunaltered– Buttheperformancemaybeaffectedbysuchchanges

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 49

Page 50: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Semi-structuredDataandXML• XML:ExtensibleMarkupLanguage

• Willnotbecoveredindetailinclass,butmanydatasetsavailabletodownloadareinthisform– YouwilldownloadtheDBLPdatasetinXMLformatandtransformintorelationalform(inHW1)

• Datadoesnothaveafixedschema– “Attributes”arepartofthedata– Thedatais“self-describing”– Tree-structured

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 50

Page 51: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

XML:Example<articlemdate="2011-01-11”key="journals/acta/Saxena96">

<author>SanjeevSaxena</author><title>ParallelIntegerSortingandSimulationAmongstCRCW

Models.</title><pages>607-619</pages><year>1996</year><volume>33</volume><journal>Acta Inf.</journal><number>7</number><url>db/journals/acta/acta33.html#Saxena96</url><ee>http://dx.doi.org/10.1007/BF03036466</ee>

</article>

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 51

Attributes

Elements

Page 52: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Attributevs.Elements

• Elementscanberepeatedandnested• Attributesareuniqueandatomic

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 52

Page 53: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

WhyXML?+ Servesasamodelsuitableforintegrationofdatabasescontainingsimilardatawithdifferentschemas

– e.g.trytointegratetwostudentdatabases:S1(sid,name,gpa)andS2(sid,dept,year)

– Manynullsifdoneinrelationalmodel,veryeasyinXML

+Flexible– easytochangetheschemaanddata

- Makesqueryprocessingmoredifficult

Whichoneiseasier?• XML(semi-structured)torelational(structured)or• relational(structured)toXML(semi-structured)?

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 53

Page 54: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

XMLtoRelationalModel• Problem1:Repeatedattributes<book>

<author>Ramakrishnan</author><author>Gehrke</author><title>DatabaseManagementSystems</title><pubisher>McGrawHill

</book>

Whatisagoodrelationalschema?

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 54

Page 55: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

XMLtoRelationalModel• Problem1:Repeatedattributes<book>

<author>Ramakrishnan</author><author>Gehrke</author><title>DatabaseManagementSystems</title><pubisher>McGrawHill</publisher>

</book>

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 55

Title Publisher Author1 Author2

Page 56: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

XMLtoRelationalModel• Problem1:Repeatedattributes<book>

<author>Garcia-Molina</author><author>Ullman</author><author>Widom</author><title>DatabaseSystems– TheCompleteBook</title><pubisher>PrenticeHall</publisher>

</book>

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 56

Title Publisher Author1 Author2

Doesnotwork

Page 57: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

XMLtoRelationalModel

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 57

BookId Title Publisher

b1 DatabaseManagementSystems

McGrawHill

b2 DatabaseSystems– TheCompleteBook

PrenticeHall

BookBookId Author

b1 Ramakrishnan

b1 Gehrke

b2 Garcia-Molina

b2 Ullman

b2 Widom

BookAuthoredBy

Page 58: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

XMLtoRelationalModel• Problem2:Missingattributes

<book><author>Ramakrishnan</author><author>Gehrke</author><title>DatabaseManagementSystems</title><pubisher>McGrawHill<edition>Third</edition>

</book><book>

<author>Garcia-Molina</author><author>Ullman</author><author>Widom</author><title>DatabaseSystems– TheComplete

Book</title><pubisher>PrenticeHall</publisher>

</book>

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 58

BookId

Title Publisher Edition

b1 DatabaseManagementSystems

McGrawHill

Third

b2 DatabaseSystems–TheCompleteBook

PrenticeHall

null

Page 59: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Summary• Relationaldatamodelisthemoststandardfordatabasemanagements– semi-structuredmodel/XMLisalsousedinpractice– youwillusetheminhw assignments

– unstructureddata(text/photo/video)isunavoidable,butwon’tbecoveredinthisclass

• ADBMSprovidesdataindependenceandinsulatestheapplicationprogrammerfrommanylowleveldetails

• Wewilllearnaboutthoselowleveldetailsaswellashighleveldatamanagementinthiscourse

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 59

Page 60: CompSci516 Data Intensive Computing Systems · 2016-08-31 · CompSci516 Data Intensive Computing Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy Duke CS, Fall

Veryimportant

UnderstandtheCourse-Policy

See“whatisallowed/notallowed”

willberemindedineveryhwassignmenttoo

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 60