The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf•...
Transcript of The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf•...
![Page 1: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/1.jpg)
The Hadoop Stack, Part 2 Introduction to HBase
CSE40822–CloudComputing–Fall2018Prof.DouglasThain
UniversityofNotreDame
![Page 2: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/2.jpg)
Three Case Studies
• Workflow:PigLatin• AdataflowlanguageandexecutionsystemthatprovidesanSQL-likewayofcomposingworkflowsofmultipleMap-Reducejobs.
• Storage:HBase• ANoSQlstoragesystemthatbringsahigherdegreeofstructuretotheflat-filenatureofHDFS.
• Execution:Spark• Anin-memorydataanalysissystemthatcanuseHadoopasapersistencelayer,enablingalgorithmsthatarenoteasilyexpressedinMap-Reduce.
![Page 3: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/3.jpg)
References
• FayChangetal,Bigtable:ADistributedStorageSystemforStructuredData,OSDI2006• http://research.google.com/archive/bigtable.html
• IntroductiontoHbaseSchemaDesign,AmandeepKhurana,Login;Magazine,2012.• https://www.usenix.org/publications/login/october-2012-volume-37-number-5/introduction-hbase-schema-design
• ApacheHBaseDocumentation• http://hbase.apache.org
![Page 4: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/4.jpg)
From HDFS to HBase
• HDFSprovidesuswithafilesystemconsistingofarbitrarilylargefilesthatcanonlybewrittenonceandareshouldbereadsequentially,endtoend.ThisworksfineforsequentialOLAPquerieslikePig.• OLTPworkloadswanttoreadandwriteindividualcellsinalargetable.(e.g.updateinventoryandpriceasorderscomein.)• HBaseimplementsOLTPinteractionsontopofHDFSbyusingadditionalstorageandmemorytoorganizethetables,andwritingthembacktoHDFSasneeded.• HBaseisanopensourceimplementationoftheBigTabledesignpublishedbyGoogle.
![Page 5: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/5.jpg)
HBase Data Model
• Adatabaseconsistsofmultipletables.• Eachtableconsistsofmultiplerows,sortedbyrowkey.• Eachrowcontainsarowkeyandoneormorecolumnfamilies.• Eachcolumnfamilyisdefinedwhenthetableiscreated.• Columnfamiliescancontainmultiplecolumns.(family:column)• Acellisuniquelyidentifiedby(table,row,family:column).• Acellcontainsanuninterpretedarrayofbytesandatimestamp.
![Page 6: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/6.jpg)
Data in Tabular Form
Name Home Office
Key First Last Phone Email Phone Email
101 Florian Krepsbach 555-1212 [email protected]
666-1212
102 Marilyn Tollerud 555-1213 666-1213
103 Pastor Inqvist 555-1214 [email protected]
![Page 7: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/7.jpg)
Data in Tabular Form
Name Home Office Social
Key First Middle Last Phone Email Phone Email FacebookID
101 Florian Garfield Krepsbach 555-1212
666-1212
102 Marilyn Tollerud 555-1213
666-1213
103 Pastor Inqvist 555-1214
Newcolumnscanbeaddedatruntime.
Columnfamiliescannotbeaddedatruntime.
![Page 8: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/8.jpg)
Don’t be fooled by the picture: Hbase is really a sparse table.
![Page 9: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/9.jpg)
Nested Data Representation TablePeople(Name,Home,Office){
101:{ Timestamp:T403; Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}, Home:{Phone=“555-1212”,Email=“[email protected]”}, Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{ Timestamp:T593; Name:{First=“Marilyn”,Last=“Tollerud”}, Home:{Phone=“555-1213”}, Office:{Phone=“666-1213”}},…
}
![Page 10: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/10.jpg)
Nested Data Representation GETPeople:101
101:{ Timestamp:T403; Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}, Home:{Phone=“555-1212”,Email=“[email protected]”}, Office:{Phone=“666-1212”,Email=“[email protected]”}}
GETPeople:101:Name
People:101:Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}GETPeople:101:Name:First
People:101:Name:First=“Florian”
![Page 11: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/11.jpg)
Fundamental Operations
• CREATEtable,families• PUTtable,rowid,family:column,value• PUTtable,rowid,whole-row• GETtable,rowid• SCANtable(WITHfilters)• DROPtable
![Page 12: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/12.jpg)
Consistency Model
• Atomicity:Entirerowsareupdatedatomicallyornotatall.• Consistency:
• AGETisguaranteedtoreturnacompleterowthatexistedatsomepointinthetable’shistory.(Checkthetimestamptobesure!)• ASCANmustincludealldatawrittenpriortothescan,andmayincludeupdatessinceitstarted.
• Isolation:Notguaranteedoutsideasinglerow.• Durability:Allsuccessfulwriteshavebeenmadedurableondisk.
![Page 13: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/13.jpg)
The Implementation
• I’llgiveyoutheideainseveralsteps:• Idea1:Putanentiretableinonefile.(Itdoesn’twork.)• Idea2:Log+onefile.(Better,butdoesn’tscaletolargedata.)• Idea3:Log+onefilepercolumnfamily.(Gettingbetter!)• Idea4:Partitiontableintoregionsbykey.
• Keepinmindthat“onefile”meansonegiganticfilestoredinHDFS.Wedon’thavetoworryaboutthedetailsofhowthefileissplitintoblocks.
![Page 14: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/14.jpg)
Idea 1: Put the Table in a Single File
• Howdowedothefollowingoperations?• CREATE• DELETE• SCAN• GET• PUT
101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...
File“People”
![Page 15: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/15.jpg)
Variable-Length Data is Fundamentally Hard!
Funreadingonthetopic:http://www.joelonsoftware.com/articles/fog0000000319.html
ID FirstName LastName Phone101 Florian Krepsbach 555-3434102 Marilyn Tollerud 555-1213103 Pastor Ingvist 555-1214
101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...
SQLTable:People(ID:Integer,FirstName:CHAR[20],LastName:Char[20],Phone:CHAR[8])UPDATEPeopleSETPhone=“555-3434”WHEREID=403;
HBaseTablePeople(ID,Name,Home,Office)PUTPeople,403,Home:Phone,555-3434
Eachrowisexactly52byteslong.Tomovetothenextrow,justfseek(file,+52);TogettoRow401,fseek(file,401*52);Overwritethedatainplace.
?
![Page 16: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/16.jpg)
Idea 2: One Tablet + Transaction Log
101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...
TableforPeople
PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….
TransactionLogforTablePeople:
Changesareappliedonlytothelogfile,Thentheresultingrecordiscachedinmemory.Readsmustconsultbothmemoryanddisk.
MemoryCacheforTablePeople:
101 102
GETPeople:101 GETPeople:103PUTPeople:101:Office:Phone=“555-3434”
![Page 17: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/17.jpg)
Idea 2 Requires Periodic Log Compression
101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...
TableforPeopleonDisk:(Old)
PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….
TransactionLogforTablePeople:
101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“555-3434”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”,Email=“[email protected]”},...
TableforPeopleonDisk:(New)
Writeoutanewcopyofthetable,withallofthechangesapplied.Deletethelogandmemorycache,andstartover.
![Page 18: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/18.jpg)
Idea 3: Partition by Column Family
DataforColumnFamilyName
TabletsforPeopleonDisk:(Old)
PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….
TransactionLogforTablePeople:
TabletsforPeopleonDisk:(New)
Writeoutanewcopyofthetablet,withallofthechangesapplied.Deletethelogandmemorycache,andstartover.
DataforColumnFamilyHome
DataforColumnFamilyOffice
DataforColumnFamilyHome(Changed)
DataforColumnFamilyOffice(Changed)
DataforColumnFamilyName
![Page 19: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/19.jpg)
Idea 4: Split Into Regions
Region1:Keys100-200
Region2:Keys200-300
Region3:Keys300-400
Region4:Keys400-500
RegionServer
RegionMaster
RegionServer
RegionServer
RegionServer
TransactionLog
MemoryCache
Tablet
Tablet
Tablet
HbaseClient
(DetailofOneRegionServer)
ColumnFamilyName
ColumnFamilyHome
ColumnFamilyOffice
WherearetheserversfortablePeople?
Accessdataintables
![Page 20: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/20.jpg)
ColumnFamilyName ColumnFamilyHome ColumnFamilyOffice
Region1Keys101-200
Region2Keys201-300
Region3Keys301-400
TablePeople
![Page 21: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/21.jpg)
The consistency model is tightly coupled to the scalability of the implementation!
![Page 22: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/22.jpg)
Consistency Model
• Atomicity:Entirerowsareupdatedatomicallyornotatall.• Consistency:
• AGETisguaranteedtoreturnacompleterowthatexistedatsomepointinthetable’shistory.(Checkthetimestamptobesure!)• ASCANmustincludealldatawrittenpriortothescan,andmayincludeupdatessinceitstarted.
• Isolation:Notguaranteedoutsideasinglerow.• Durability:Allsuccessfulwriteshavebeenmadedurableondisk.
![Page 23: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and](https://reader033.fdocuments.us/reader033/viewer/2022042214/5eb950abb9ccfe567452a09c/html5/thumbnails/23.jpg)
Questions for Discussion
• Whatarethetradeoffsbetweenhavingaverytallverticaltableversushavingaverywidehorizontaltable?• Supposeatablestartsoffsmall,andthenalargenumberofrowsareaddedtothetable.Whathappensandwhatshouldthesystemdo?• UsingthePeopleexample,comeupwithsomeexamplesofsimplequeriesthatareveryfast,andsomethatareveryslow.Isthereanythingthatcanbedoneabouttheslowqueries?• WoulditbepossibletohaveSCANreturnaconsistentsnapshotofthesystem?Howwouldthatwork?