Download - CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Transcript
Page 1: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

CompSci 516DataIntensiveComputingSystems

Lecture18NoSQLand

ColumnStore

Instructor:Sudeepa Roy

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

1

Page 2: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Announcements

• HW3(lastHW)hasbeenpostedonSakai

• SameproblemsasinHW1butinMongoDB(NOSQL)

• Dueintwoweeksaftertoday’slecture(~11/16)

2DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Page 3: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ReadingMaterialNOSQL:• “ScalableSQLandNoSQLDataStores”RickCattell,SIGMODRecord,December2010(Vol.39,No.4)• seewebpagehttp://cattell.net/datastores/ forupdatesandmorepointers

ColumnStore:• D.Abadi,P.Boncz,S.Harizopoulos,S.Idreos andS.Madden.TheDesignandImplementationof

ModernColumn-OrientedDatabaseSystems.FoundationsandTrendsinDatabases,vol.5,no.3,pp.197–280,2012.

• SeeVLDB2009tutorial:http://nms.csail.mit.edu/~stavros/pubs/tutorial2009-column_stores.pdf

Optional:• “Dynamo:Amazon’sHighlyAvailableKey-valueStore”ByGiuseppeDeCandia et.al.SOSP

2007

• “Bigtable:ADistributedStorageSystemforStructuredData”FayChanget.al.OSDI2006

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 3

Page 4: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

NoSQL

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 4

Page 5: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5

Page 6: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Sofar-- RDBMS

• RelationalDataModel• RelationalDatabaseSystems(RDBMS)• RDBMSshave– acompletepre-definedfixedschema– aSQLinterface– andACIDtransactions

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 6

Page 7: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Today• NoSQL:”new”databasesystems– nottypicallyRDBMS– relaxonsomerequirements,gainefficiencyandscalability

• Newsystemschoosetouse/notuseseveralconceptswelearntsofar– e.g.SystemXdoesnotuselocksbutusemulti-versionCC(MVCC)or,

– SystemYusesasynchronousreplication• therefore,itisimportanttounderstandthebasics(Lectures1-17)eveniftheyarenotusedinsomenewsystems!

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 7

Page 8: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Warnings!

• MaterialfromCattell’spaper(2010-11)–someinfowillbeoutdated– seewebpagehttp://cattell.net/datastores/ forupdatesandmorepointers

• WewillfocusonthebasicideasofNoSQLsystems

• Optional readingslidesattheend– therearealsocomparisontablesintheCattell’spaperifyouareinterested

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 8

Page 9: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

OLAPvs.OLTP

• OLTP(OnLine TransactionProcessing)– Recalltransactions!– Multipleconcurrentread-writerequests– Commercialapplications(banking,onlineshopping)– Datachangesfrequently– ACIDproperties,concurrencycontrol,recovery

• OLAP(OnLine AnalyticalProcessing)– Manyaggregate/group-byqueries– multidimensionaldata– Datamostlystatic– WillstudyOLAPCubesoon

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 9

Page 10: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

NewSystems• WewillexamineanumberofSQLandso- called“NoSQL”systemsor“datastores”

• DesignedtoscalesimpleOLTP-styleapplicationloads– todoupdatesaswellasreads– incontrasttotraditionalDBMSsanddatawarehouses– toprovidegoodhorizontalscalabilityforsimpleread/writedatabaseoperationsdistributedovermanyservers

• OriginallymotivatedbyWeb2.0applications– thesesystemsaredesignedtoscaletothousandsormillionsofusers

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 10

Page 11: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

NewSystemsvs.RDMS• Whenyoustudyanewsystem,compareitwithRDBMS-sonits– datamodel– consistencymechanisms– storagemechanisms– durabilityguarantees– availability– querysupport

• Thesesystemstypicallysacrificesomeofthesedimensions– e.g.database-widetransactionconsistency,inordertoachieveothers,e.g.higheravailabilityandscalability

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 11

Page 12: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

NoSQL

• Manyofthenewsystemsarereferredtoas“NoSQL”datastores

• NoSQLstandsfor“NotOnlySQL”or“NotRelational”– notentirelyagreedupon

• Next:sixkeyfeaturesofNoSQLsystems

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 12

Page 13: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

NoSQL:SixKeyFeatures

1. theabilitytohorizontallyscale“simpleoperations”throughputovermanyservers

2. theabilitytoreplicateandtodistribute(partition)dataovermanyservers

3. asimplecalllevelinterfaceorprotocol(incontrasttoSQLbinding)

4. aweakerconcurrencymodelthantheACIDtransactionsofmostrelational(SQL)databasesystems

5. efficientuseofdistributedindexesandRAMfordatastorage6. theabilitytodynamicallyaddnewattributestodatarecords

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 13

Page 14: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ImportantExamplesofNewSystems

• Threesystemsprovideda“proofofconcept”andinspiredmanyotherdatastores

1. Memcached2. Amazon’sDynamo3. Google’sBigTable

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 14

Page 15: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

1.Memcached:mainfeatures

• popularopensourcecache

• supportsdistributedhashing(later)

• demonstratedthatin-memoryindexes canbehighlyscalable,distributing andreplicatingobjectsovermultiplenodes

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 15

Page 16: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

2.Dynamo:mainfeatures

• pioneeredtheideaofeventualconsistencyasawaytoachievehigheravailabilityandscalability

• datafetchedarenotguaranteedtobeup-to-date

• butupdatesareguaranteedtobepropagatedtoallnodeseventually

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 16

Page 17: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

3.BigTable :mainfeatures

• demonstratedthatpersistentrecordstoragecouldbescaledtothousandsofnodes

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 17

Page 18: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

BASE(notACIDJ)

• RecallACIDforRDBMSdesiredpropertiesoftransactions:– Atomicity,Consistency,Isolation,andDurability

• NOSQLsystemstypicallydonotprovideACID

• BasicallyAvailable• Softstate• Eventuallyconsistent

DukeCS,Fall2016 CompSci 516:DataIntensiveComputingSystems 18

Page 19: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ACIDvs.BASE• TheideaisthatbygivingupACIDconstraints,onecanachievemuchhigherperformanceandscalability

• Thesystemsdifferinhowmuchtheygiveup– e.g.mostofthesystemscallthemselves“eventuallyconsistent”,meaningthatupdatesareeventuallypropagatedtoallnodes

– butmanyofthemprovidemechanismsforsomedegreeofconsistency,suchasmulti-versionconcurrencycontrol(MVCC)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19

Page 20: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

“CAP”Theorem

• OftenEricBrewer’sCAPtheoremcitedforNoSQL

• A systemcanhaveonlytwooutofthreeofthefollowingproperties:– Consistency,– Availability– Partition-tolerance

• TheNoSQLsystemsgenerallygiveupconsistency– However,thetrade-offsarecomplex

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 20

Page 21: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

TwofociforNoSQLsystems

1. “Simple”operations

2. HorizontalScalability

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 21

Page 22: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

1.“Simple”Operations

• Readingorwritingasmallnumberofrelatedrecordsineachoperation– e.g.keylookups– readsandwritesofonerecordorasmallnumberofrecords

• Thisisincontrasttocomplexqueries,joins,orread-mostlyaccess

• Inspiredbyweb,wheremillionsofusersmaybothreadandwritedatainsimpleoperations– e.g.searchandupdatemulti-serverdatabasesofelectronic

mail,personalprofiles,webpostings,wikis,customerrecords,onlinedatingrecords,classifiedads,andmanyotherkindsofdata

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 22

Page 23: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

2.HorizontalScalability

• Shared-NothingHorizontalScaling

• Theabilitytodistributeboththedataandtheloadofthesesimpleoperationsovermanyservers– withnoRAMordisksharedamongtheservers

• Not“vertical”scaling– whereadatabasesystemutilizesmanycoresand/orCPUsthatshareRAManddisks

• Someofthesystemswedescribeprovidebothverticalandhorizontalscalability

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 23

Page 24: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

2.Horizontalvs.VerticalScaling

• Effectiveuseofmultiplecores(verticalscaling)isimportant– butthenumberofcoresthatcansharememoryislimited

• horizontalscalinggenerallyislessexpensive– canusecommodityservers

• Note:horizontalandverticalpartitioningarenotrelatedtohorizontalandverticalscaling– exceptthattheyarebothusefulforhorizontalscaling(Lecture17)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 24

Page 25: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

WhatisdifferentinNOSQLsystems

• WhenyoustudyanewNOSQLsystem,noticehowitdiffersfromRDBMSintermsof

1. ConcurrencyControl2. DataStorageMedium3. Replication4. Transactions

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 25

Page 26: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ChoicesinNOSQLsystems:1.ConcurrencyControl

a) Locks– somesystemsprovideone-user-at-a-timereadorupdatelocks– MongoDBprovideslockingatafieldlevel

b) MVCCc) None– donotprovideatomicity– multipleuserscaneditinparallel– noguaranteewhichversionyouwillread

d) ACID– pre-analyzetransactionstoavoidconflicts– nodeadlocksandnowaitsonlocks

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 26

Page 27: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ChoicesinNOSQLsystems:2.DataStorageMedium

a) StorageinRAM– snapshotsorreplicationtodisk– poorperformancewhenoverflowsRAM

b) Diskstorage– cachinginRAM

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 27

Page 28: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ChoicesinNOSQLsystems:3.Replication

• whethermirrorcopiesarealwaysinsynca) Synchronousb) Asynchronous– faster,butupdatesmaybelostinacrash

c) Both– localcopiessynchronously,remotecopies

asynchronously

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 28

Page 29: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ChoicesinNOSQLsystems:4.TransactionMechanisms

a) supportb) donotsupportc) inbetween– supportlocaltransactionsonlywithinasingle

objector“shard”– shard=ahorizontalpartitionofdataina

database

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 29

Page 30: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ComparisonfromCattell’spaper(2011)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 30

Page 31: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DataModelTerminologyforNoSQL

• UnlikeSQL/RDBMS,theterminologyforNoSQLisofteninconsistent– wearefollowingnotationsinCattell’spaper

• Allsystemsprovideawaytostorescalarvalues– e.g.numbersandstrings

• Someofthemalsoprovideawaytostoremorecomplexnestedorreferencevalues

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 31

Page 32: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DataModelTerminologyforNoSQL

• Thesystemsallstoresetsofattribute-valuepairs– butusefourdifferentdatastructures

1. Tuple2. Document3. ExtensibleRecord4. Object

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 32

Page 33: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

1.Tuple

• Sameasbefore• A“tuple”isarowinarelationaltable– attributenamesarepre-definedinaschema– thevaluesmustbescalar– thevaluesarereferencedbyattributename– incontrasttoanarrayorlist,wheretheyarereferencedbyordinalposition

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 33

Page 34: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

2.Document

• Allowsvaluestobenesteddocumentsorlistsaswellasscalarvalues

• Theattributenamesaredynamicallydefinedforeachdocumentatruntime

• Adocumentdiffersfromatupleinthattheattributesarenotdefinedinaglobalschema– anda widerrangeofvaluesarepermitted

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 34

Page 35: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

3.ExtensibleRecord

• A hybrid betweenatupleandadocument• familiesofattributesaredefinedinaschema• butnewattributescanbeadded(withinanattributefamily)onaper-recordbasis

• Attributesmaybelist-valued

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 35

Page 36: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

4.Object

• Analogoustoanobjectinprogramminglanguages– butwithouttheproceduralmethods

• Valuesmaybereferencesornestedobjects

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 36

Page 37: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DataStoreCategories• Thedatastoresaregroupedaccordingtotheirdatamodel• Key-valueStores:

– storevaluesandanindextofindthem– basedonaprogrammer- definedkey

• DocumentStores:– storedocuments– Thedocumentsareindexedandasimplequerymechanismis

provided• ExtensibleRecordStores:

– storeextensiblerecordsthatcanbepartitionedverticallyandhorizontallyacrossnodes

– Somepaperscallthese“widecolumnstores”• RelationalDatabases:

– store(andindexandquery)tuples– e.g.thenewRDBMSsthatprovidehorizontalscaling

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 37

Page 38: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ExampleNOSQLsystems

• Key-valueStores:– ProjectVoldemort,Riak,Redis,Scalaris,TokyoCabinet,Memcached/Membrain/Membase

• DocumentStores:– AmazonSimpleDB,CouchDB,MongoDB,Terrastore

• ExtensibleRecordStores:– Hbase,HyperTable,Cassandra,Yahoo’sPNUTS

• RelationalDatabases:– MySQLCluster,VoltDB,Clustrix,ScaleDB,ScaleBase,NimbusDB,GoogleMegastore(alayeronBigTable)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 38

Page 39: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Key-valuestore:1/2• Thesimplestdatastores• datamodelsimilartothememcached distributedin-memorycache– withasinglekey-valueindexforallthedata– doesnotprovidesecondaryindicesorkeys

• butunlikememcached,generallyprovide– apersistencemechanism– additionalfunctionalitylike replication,versioning,locking,transactions,sorting,etc

• Theclientinterfaceprovidesinserts,deletes,andindexlookups

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 39

Page 40: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Key-valuestore:2/2• Allkey-valuestoresprovidescalabilitythroughkey

distributionovernodes• Voldemort,Riak,TokyoCabinet,andenhancedmemcached

systemscanstoredatainRAMorondisk– TheothersstoredatainRAM,andprovidediskasbackup,orrely

onreplicationandrecoverysothatabackupisnotneeded• Scalaris andenhancedmemcached systemsusesynchronous

replication– therestuseasynchronous

• Scalaris andTokyoCabinetimplementtransactions– theothersdonot.

• VoldemortandRiak usemulti-versionconcurrencycontrol– theothersuselocks

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 40

Page 41: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

UseCase:Key-valuestore

• ifyouhaveasimpleapplicationwithonlyonekindofobject,andyou onlyneedtolookupobjectsupbasedononeattribute

• Supposeyouhaveawebapplication– thatdoesmanyRDBMSqueriestocreateatailoredpagewhenauserlogsin

– Supposeittakesseveralsecondstoexecutethosequeries,andtheuser’sdataisrarelychanged

– youmightwanttostoretheuser’stailoredpageasasingleobjectinakey-valuestore

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 41

Page 42: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Documentstore:1/3• Documentstoressupportmorecomplexdatathanthekey-valuestores

• “documentstore”maybeconfusing– thesesystemscouldstore“documents”inthetraditionalsense(articles,MicrosoftWordfiles,etc.)

– butadocumentinthesesystemscanbeanykindof“pointerless object”

• Unlikethekey-valuestores,thesesystemsgenerallysupport– secondaryindexes– multipletypesofdocuments(objects)perdatabase,and– nesteddocumentsorlists

• LikeotherNoSQLsystems,thedocumentstoresdonotprovideACIDtransactionalproperties

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 42

Page 43: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Documentstore:2/3• Thedocumentstoresareschema-less,exceptfor

– attributes(whicharesimplyaname,andarenotpre- specified)– collections(whicharesimplyagroupingofdocuments),and– indexesdefinedoncollections(explicitlydefined,exceptinSimpleDB)– Therearesomedifferencesintheirdatamodels,e.g.SimpleDB does

notallownesteddocuments

• Thedocumentstoresareverysimilarbutusedifferentterminology– e.g.aSimpleDB Domain=CouchDB Database=MongoDBCollection

(=Terrastore Bucket)– SimpleDB callsdocuments“items”– anattributeisafieldinCouchDB,orakeyinMongoDB(orTerrastore)

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 43

Page 44: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Documentstore:3/3• Unlikethekey-valuestores,thedocumentstores“typically”provideamechanismtoquerycollectionsbasedonmultipleattributevalueconstraints

• donotprovideexplicitlocks– haveweakerconcurrencyandatomicitypropertiesthantraditionalACID-compliantdatabases

• Documentscanbedistributedovernodesinallofthesystems– Allofthesystemscanachievescalabilitybyreading(potentially)out-of-datereplicas

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 44

Page 45: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Usecase:DocumentStore

• applicationwithmultipledifferentkindsofobjects– e.g.inaDepartmentofMotorVehiclesapplication,withvehiclesanddrivers

• whereyouneedtolookupobjectsbasedonmultiplefields– e.g.,adriver’sname,licensenumber,ownedvehicle,orbirthdate

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 45

Page 46: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ExtensibleRecordStores:1/1

• MotivatedbyGoogle’ssuccesswithBigTable– stilltherecentextensiblerecordstorescannotcomeclosetoBigTable’s

scalability• Basicdatamodelisrowsandcolumns• Basicscalabilitymodelissplittingbothrowsandcolumnsover

multiplenodes• Rowsaresplitacrossnodesthroughsharding ontheprimarykey

– Theytypicallysplitbyrangeratherthanahashfunction• Columnsofatablearedistributedovermultiplenodesbyusing

“columngroups”– awayforthecustomertoindicatewhichcolumnsarebeststored

together• Bothhorizontalandverticalpartitioningcanbeusedsimultaneously

onthesametable

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 46

Page 47: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Usecase:ExtensibleRecordStore• usescasessimilartothosefordocumentstores:

– multiplekindsofobjects,withlookupsbasedonanyfield.

• However,aimedathigherthroughput,andmayprovidestrongerconcurrencyguarantees,– atthecostofslightlymorecomplexity thanthedocumentstores

• SupposestoringcustomerinformationforaneBay-styleapplication,andyouwanttopartitionyourdatabothhorizontallyandvertically:– clustercustomersbycountry,sothatyoucanefficientlysearchallof

thecustomersinonecountry– separatetherarely-changed“core”customerinformationsuchas

customeraddressesandemailaddressesinoneplace,and– putcertainfrequently-updatedcustomerinformation(suchascurrent

bidsinprogress)inadifferentplace,toimproveperformanceDukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 47

Page 48: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ScalableRDBMS:1/1

• SomeRDBMSsareexpectedtoprovidescalabilitycomparablewithNoSQLdatastores

• But,withtwoprovisos:– Usesmall-scopeoperations: Operationsthatspanmanynodes,

e.g.joinsovermanytables,willnotscalewellwithsharding– Usesmall-scopetransactions:Likewise,transactionsthatspan

manynodesaregoingtobeveryinefficient,withthecommunicationand2PCoverhead

• TypicalNOSQLsystemsmakethesetwoimpossible• ScalableRDBMSallowsthem,butpenalizesacustomerfor

theseoperations• Havehigher-levelSQLlanguageandACIDproperties

– butpayapricewhentheyspannodes

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 48

Page 49: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Usecase:ScalableRDBMS

• Ifyourapplicationrequiresmanytableswithdifferenttypesofdata– arelationalschemacentralizesandsimplifiesdatadefinitionandSQLsimplifiesoperations

– orforprojectswithmanyprogrammers• However,moreusefuliftheapplicationdoesnotrequire– updatesorjoinsthatspanmanynodes– transactioncoordination– or,datamovement

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 49

Page 50: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ConsistentHashing

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 50

inDynamoDB

Page 51: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ConsistentHashing(CH)• Recalldynamichashingschemes• Ifthe#ofslots(directorysize)changes,thenalmostallkeyshadtoberemapped

• Inconsistenthashing(CH),with#keys=Kand#slots=N,onlyK/Nkeysneedtoberemappedonaverage

• AppliestothedesignofDistributedHashTable(DHTs)forUniformLoadDistribution– partitionakeyspace amongasetofsites/nodes– additionallyprovideanoverlaynetworkthatconnectsnodessuchthatthenodesresponsibleforanykeycanbeefficientlylocated

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 51

Page 52: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DynamoDB :CH1/2• [ref.theDynamoDB paper,sec4.3]• Mustscaleincrementally• Consistenthashingisusedtodynamicallydistribute

dataarounda“ring”ofnodes(=sites)• Theoutputofahashfunctionistreatedasacircular

ring• Eachnodeisassignedarandomvalueinthisspace

– representsthe“position”onthering

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 52

• Dataitemidentifiedbyakey• Assigntoanodebyhashing

thekeyto

Page 53: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DynamoDB :CH2/2• Dataitemidentifiedbyakey• Assigntoanodebyhashingthekeytoyielditsposition

onthering• Walktheringclockwisetofindthefirstnodewitha

positionlargerthantheitem’sposition• Eachnodeisresponsiblefortheregioninthering

betweenitanditspredecessornodeonthering

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 53

• Note:• departureorarrivalofanodeonly

affectsitsimmediateneighbor• Theothernodesremainunaffected• K/Nonaverage!

Page 54: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DynamoDB :ChallengesinCH• However,thisbasicCHalgorithmposessomechallenges1. Randompositionassignmentofeachnodeonthering

leadstonon-uniformdataandloaddistribution2. Thebasicalgorithmisoblivioustotheheterogeneityinthe

performanceofthenodes

• Solution:DynamousesavariantofCH

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 54

• Eachnodegetsassignedtomultiplepointsinthering

• called“virtualnode”• onenodetakescareofmultiple

virtualnodes

Page 55: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DynamoDB:VirtualNodes• Usingvirtualnodeshasadvantages

1. Ifanodebecomesunavailable(duetofailuresorroutinemaintenance),theloadhandledbythisnodeisevenlydispersedacrosstheremainingavailablenodes

2. Whenanodebecomesavailableagain,oranewnodeisaddedtothesystem,thenewlyavailablenodeacceptsaroughlyequivalentamountofloadfromeachoftheotheravailablenodes

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 55

3. Thenumberofvirtualnodesthatanodeisresponsiblecandecidedbasedonitscapacity,accountingforheterogeneityinthephysicalinfrastructure

Page 56: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DynamoDB:Replication• Dynamoreplicatesitsdataonmultiple(N)hostsforhigh

availabilityanddurability• Eachkeykisassignedtoacoordinatorwhichisinchargeof

replication– coordinatorhandlesallkeysinitsrange

• Coordinatorreplicateseachkeyitisinchargeof– bystoringitlocally– replicatingitattheN-1clockwisesuccesor nodesinthering

• EachnodeisinchargeofregionoftheringbetweenitanditsN-th predecessor

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 56

NodeBreplicateskeyKatnodesCandDNodeDwillstorekeysintherange(A,B],(B,C],(C,D]Note:theremaybe<N“physical”nodes

Page 57: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

CHHistory• ProposedbyCStheoreticiansfromMIT:

– Karger-Lehman-Leighton-Panigrahy-Levine-Lewin– “ConsistentHashingandRandomTrees:DistributedCachingProtocols

forRelievingHotSpotsontheWorldWideWeb”– STOC1997

• ConsistenthashinggavebirthtoAkamaiTechnologies– FoundedbyDannyLewinandTomLeightonin1998– Akamai’scontentdeliverynetworkisoneofthelargestdistributed

computingplatforms– Nowmarketcap$12Band6200employees– Managingweb-presenceofmanymajorcompanies

• 2001:TheconceptofDistributedHashTable(DHT)isproposed(howtolookforafile)andCHwasre-purposed

• NowusedinDynamo,Couchbase,Cassandra,Voldemort,Riak,..

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 57

Page 58: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

SQLvs.NOSQL

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 58

Argumentsforbothsidesstillacontroversialtopic

Page 59: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

WhychooseRDBMSoverNoSQL:1/31. Ifnewrelationalsystemscandoeverything

thataNoSQLsystemcan,withanalogousperformanceandscalability(?),andwiththeconvenienceoftransactionsandSQL,NoSQLisnotneeded

2. RelationalDBMSshavetakenandretainedmajoritymarketshareoverothercompetitorsinthepast30years– (network,object,andXMLDBMSs)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 59

Page 60: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

WhychooseRDBMSoverNoSQL:2/33. SuccessfulrelationalDBMSshavebeenbuilt

tohandleotherspecificapplicationloads inthepast:– read-onlyorread-mostlydatawarehousing– OLTPonmulti-coremulti-diskCPUs– in-memorydatabases– distributeddatabases,and– nowhorizontallyscaleddatabases

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 60

Page 61: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

WhychooseRDBMSoverNoSQL:3/3

4. Whileno“onesizefitsall”intheSQLproductsthemselves,thereisacommoninterfacewithSQL,transactions,andrelationalschemathatgiveadvantagesintraining,continuity,anddatainterchange

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 61

Page 62: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

WhychooseNoSQLoverRDBMS:1/31. Wehaven’tyetseengoodbenchmarksshowing

thatRDBMSscanachievescaling comparablewithNoSQLsystemslikeGoogle’sBigTable

2. Ifyouonlyrequirealookupofobjectsbasedonasinglekey– thenakey-valuestoreisadequateandprobablyeasiertounderstand

thanarelationalDBMS– Likewiseforadocumentstoreonasimpleapplication:youonlypay

thelearningcurveforthelevelofcomplexityyourequire

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 62

Page 63: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

WhychooseNoSQLoverRDBMS:2/3

3. Someapplicationsrequireaflexibleschema– allowingeachobjectinacollectiontohavedifferentattributes

– WhilesomeRDBMSsallowefficient“packing”oftupleswithmissingattributes,andsomeallowaddingnewattributesatruntime,thisisuncommon

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 63

Page 64: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

WhychooseNoSQLoverRDBMS:3/3

4. ArelationalDBMSmakes“expensive”(multi- nodemulti-table)operations“tooeasy”– NoSQLsystemsmakethemimpossibleorobviouslyexpensiveforprogrammers

5. WhileRDBMSshavemaintainedmajoritymarketshareovertheyears,otherproductshaveestablishedsmallerbutnon-trivialmarketsinareaswherethereisaneedforparticularcapabilities– e.g.indexedobjectswithproductslikeBerkeleyDB,orgraph-following

operationswithobject-orientedDBMSs

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 64

Page 65: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

ColumnStore

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 65

Page 66: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

Rowvs.ColumnStore

• Rowstore– storeallattributesofatupletogether– storagelike“row-majororder”inamatrix

• Columnstore– storeallrowsforanattribute(column)together– storagelike“column-majororder”inamatrix

• e.g.– MonetDB,Vertica(earlier,C-store),SAP/SybaseIQ,GoogleBigtable (withcolumngroups)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 66

Page 67: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 67

Ack:SlidefromVLDB2009tutorialonColumnstore

Page 68: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 68

Ack:SlidefromVLDB2009tutorialonColumnstore

Page 69: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 69

Ack:SlidefromVLDB2009tutorialonColumnstore

Page 70: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 70

Ack:SlidefromVLDB2009tutorialonColumnstore

Page 71: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 71

Ack:SlidefromVLDB2009tutorialonColumnstore

Page 72: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 72

AdditionalandOptionalSlidesonMongoDB

(MaybeusefulforHW3)https://docs.mongodb.com

Page 73: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

MongoDB

• MongoDBisanopensourcedocumentstorewritteninC++• providesindexesoncollections• lockless• providesadocumentquerymechanism• supportsautomaticsharding• Replicationismostlyusedforfailover• doesnotprovidetheglobalconsistencyofatraditionalDBMS

– butyoucangetlocalconsistencyontheup-to-dateprimarycopyofadocument

• supportsdynamicquerieswithautomaticuseofindices,likeRDBMSs

• alsosupportsmap-reduce– helpscomplexaggregationsacrossdocs

• providesatomicoperationsonfields

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 73

Optionalslide:Readyourself

Page 74: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

MongoDB:AtomicOpsonFields• Theupdatecommandsupports“modifiers”thatfacilitateatomic

changestoindividualvalues– $setsetsavalue– $inc incrementsavalue– $pushappendsavaluetoanarray– $pushAll appendsseveralvaluestoanarray– $pullremovesavaluefromanarray,and$pullAll removesseveral

valuesfromanarray• Sincetheseupdatesnormallyoccur“inplace”,theyavoidthe

overheadofareturntriptotheserver• Thereisan“updateifcurrent”conventionforchangingadocument

onlyiffieldvaluesmatchagivenpreviousvalue• MongoDBsupportsafindAndModify commandtoperforman

atomicupdateandimmediatelyreturntheupdateddocument– usefulforimplementingqueuesandotherdatastructuresrequiring

atomicity

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 74

Optionalslide:Readyourself

Page 75: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

MongoDB:Index• MongoDBindicesareexplicitlydefinedusinganensureIndex call– anyexistingindicesareautomaticallyusedforqueryprocessing

• Tofindallproductsreleasedlastyear(2015)orlatercostingunder$100youcouldwrite:

• db.products.find({released:{$gte:newDate(2015,1,1,)},price{‘$lte’:100},})

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 75

Optionalslide:Readyourself

Page 76: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

MongoDB:Data

• MongoDBstoresdatainabinaryJSON-likeformatcalledBSON– BSONsupportsboolean,integer,float,date,stringandbinarytypes

–MongoDBcanalsosupportlargebinaryobjects,eg.imagesandvideos

– Thesearestoredinchunksthatcanbestreamedbacktotheclientforefficientdelivery

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 76

Optionalslide:Readyourself

Page 77: CompSci516 Data Intensive Computing Systems · 2016-11-02 · 2. Horizontal vs. Vertical Scaling • Effective use of multiple cores (vertical scaling) is important – but the number

MongoDB:Replication

• MongoDBsupportsmaster-slavereplicationwithautomaticfailoverandrecovery– Replication(andrecovery)isdoneatthelevelofshards

– Replicationisasynchronousforhigherperformance,sosomeupdatesmaybelostonacrash

DukeCS,Spring2016 CompSci516:DataIntensiveComputingSystems 77

Optionalslide:Readyourself