the hadoop stack, part 2 introducon to hbasedthain/courses/cse40822/... · the hadoop stack, part 2...

23
The Hadoop Stack, Part 2 Introduc5on to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 Prof. Douglas Thain University of Notre Dame

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

The Hadoop Stack, Part 2 Introduc5on to HBase

CSE40822–CloudCompu0ng–Spring2016Prof.DouglasThain

UniversityofNotreDame

Page 2: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Three Case Studies

• Workflow:PigLa0n• Adataflowlanguageandexecu0onsystemthatprovidesanSQL-likewayofcomposingworkflowsofmul0pleMap-Reducejobs.

• Storage:HBase• ANoSQlstoragesystemthatbringsahigherdegreeofstructuretotheflat-filenatureofHDFS.

• Execu0on:Spark• Anin-memorydataanalysissystemthatcanuseHadoopasapersistencelayer,enablingalgorithmsthatarenoteasilyexpressedinMap-Reduce.

Page 3: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

References

•  FayChangetal,Bigtable:ADistributedStorageSystemforStructuredData,OSDI2006•  h[p://research.google.com/archive/bigtable.html

•  Introduc0ontoHbaseSchemaDesign,AmandeepKhurana,Login;Magazine,2012.•  h[ps://www.usenix.org/publica0ons/login/october-2012-volume-37-number-5/introduc0on-hbase-schema-design

• ApacheHBaseDocumenta0on•  h[p://hbase.apache.org

Page 4: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

From HDFS to HBase

• HDFSprovidesuswithafilesystemconsis0ngofarbitrarilylargefilesthatcanonlybewri[enonceandareshouldbereadsequen0ally,endtoend.Thisworksfineforsequen0alOLAPquerieslikePig.• OLTPworkloadswanttoreadandwriteindividualcellsinalargetable.(e.g.updateinventoryandpriceasorderscomein.)• HBaseimplementsOLTPinterac0onsontopofHDFSbyusingaddi0onalstorageandmemorytoorganizethetables,andwri0ngthembacktoHDFSasneeded.• HBaseisanopensourceimplementa0onoftheBigTabledesignpublishedbyGoogle.

Page 5: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

HBase Data Model

• Adatabaseconsistsofmul0pletables.•  Eachtableconsistsofmul0plerows,sortedbyrowkey.•  Eachrowcontainsarowkeyandoneormorecolumnfamilies.•  Eachcolumnfamilyisdefinedwhenthetableiscreated.• Columnfamiliescancontainmul0plecolumns.(family:column)• Acellisuniquelyiden0fiedby(table,row,family:column).• Acellcontainsanuninterpretedarrayofbytesanda0mestamp.

Page 6: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Data in Tabular Form

Name Home Office

Key First Last Phone Email Phone Email

101 Florian Krepsbach 555-1212 [email protected]

666-1212

[email protected]

102 Marilyn Tollerud 555-1213 666-1213

103 Pastor Inqvist 555-1214 [email protected]

Page 7: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Data in Tabular Form

Name Home Office Social

Key First Middle Last Phone Email Phone Email FacebookID

101 Florian Garfield Krepsbach 555-1212

[email protected]

666-1212

[email protected]

102 Marilyn Tollerud 555-1213

666-1213

103 Pastor Inqvist 555-1214

[email protected]

Newcolumnscanbeaddedatrun0me.

Columnfamiliescannotbeaddedatrun0me.

Page 8: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Don’t be fooled by the picture: Hbase is really a sparse table.

Page 9: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Nested Data Representa5on TablePeople(Name,Home,Office){

101:{ Timestamp:T403; Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}, Home:{Phone=“555-1212”,Email=“[email protected]”}, Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{ Timestamp:T593; Name:{First=“Marilyn”,Last=“Tollerud”}, Home:{Phone=“555-1213”}, Office:{Phone=“666-1213”}},…

}

Page 10: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Nested Data Representa5on GETPeople:101

101:{ Timestamp:T403; Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}, Home:{Phone=“555-1212”,Email=“[email protected]”}, Office:{Phone=“666-1212”,Email=“[email protected]”}}

GETPeople:101:Name

People:101:Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}GETPeople:101:Name:First

People:101:Name:First=“Florian”

Page 11: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Fundamental Opera5ons

• CREATEtable,families• PUTtable,rowid,family:column,value• PUTtable,rowid,whole-row• GETtable,rowid•  SCANtable(WITHfilters)• DROPtable

Page 12: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Consistency Model

• Atomicity:En0rerowsareupdatedatomicallyornotatall.• Consistency:

•  AGETisguaranteedtoreturnacompleterowthatexistedatsomepointinthetable’shistory.(Checkthe0mestamptobesure!)•  ASCANmustincludealldatawri[enpriortothescan,andmayincludeupdatessinceitstarted.

•  Isola0on:Notguaranteedoutsideasinglerow.• Durability:Allsuccessfulwriteshavebeenmadedurableondisk.

Page 13: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

The Implementa5on

•  I’llgiveyoutheideainseveralsteps:•  Idea1:Putanen0retableinonefile.(Itdoesn’twork.)•  Idea2:Log+onefile.(Be[er,butdoesn’tscaletolargedata.)•  Idea3:Log+onefilepercolumnfamily.(Getngbe[er!)•  Idea4:Par00ontableintoregionsbykey.

• Keepinmindthat“onefile”meansonegiganFcfilestoredinHDFS.Wedon’thavetoworryaboutthedetailsofhowthefileissplitintoblocks.

Page 14: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Idea 1: Put the Table in a Single File

• Howdowedothefollowingopera0ons?•  CREATE•  DELETE•  SCAN•  GET•  PUT

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

File“People”

Page 15: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Variable-Length Data is Fundamentally Hard!

Funreadingonthetopic:h[p://www.joelonsovware.com/ar0cles/fog0000000319.html

ID FirstName LastName Phone101 Florian Krepsbach 555-3434102 Marilyn Tollerud 555-1213103 Pastor Ingvist 555-1214

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

SQLTable:People(ID:Integer,FirstName:CHAR[20],LastName:Char[20],Phone:CHAR[8])UPDATEPeopleSETPhone=“555-3434”WHEREID=403;

HBaseTablePeople(ID,Name,Home,Office)PUTPeople,403,Home:Phone,555-3434

Eachrowisexactly52byteslong.Tomovetothenextrow,justfseek(file,+52);TogettoRow401,fseek(file,401*52);Overwritethedatainplace.

?

Page 16: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Idea 2: One Tablet + Transac5on Log

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

TableforPeople

PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….

TransacFonLogforTablePeople:

Changesareappliedonlytothelogfile,Thentheresul0ngrecordiscachedinmemory.Readsmustconsultbothmemoryanddisk.

MemoryCacheforTablePeople:

101 102

GETPeople:101 GETPeople:103PUTPeople:101:Office:Phone=“555-3434”

Page 17: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Idea 2 Requires Periodic Log Compression

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

TableforPeopleonDisk:(Old)

PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….

TransacFonLogforTablePeople:

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“555-3434”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”,Email=“[email protected]”},...

TableforPeopleonDisk:(New)

Writeoutanewcopyofthetable,withallofthechangesapplied.Deletethelogandmemorycache,andstartover.

Page 18: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Idea 3: Par55on by Column Family

DataforColumnFamilyName

TabletsforPeopleonDisk:(Old)

PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….

TransacFonLogforTablePeople:

TabletsforPeopleonDisk:(New)

Writeoutanewcopyofthetablet,withallofthechangesapplied.Deletethelogandmemorycache,andstartover.

DataforColumnFamilyHome

DataforColumnFamilyOffice

DataforColumnFamilyHome(Changed)

DataforColumnFamilyOffice(Changed)

DataforColumnFamilyName

Page 19: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Idea 4: Split Into Regions

Region1:Keys100-200

Region2:Keys200-300

Region3:Keys300-400

Region4:Keys400-500

RegionServer

RegionMaster

RegionServer

RegionServer

RegionServer

Transac0onLog

MemoryCache

Tablet

Tablet

Tablet

HbaseClient

(DetailofOneRegionServer)

ColumnFamilyName

ColumnFamilyHome

ColumnFamilyOffice

WherearetheserversfortablePeople?

Accessdataintables

Page 20: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

ColumnFamilyName ColumnFamilyHome ColumnFamilyOffice

Region1Keys101-200

Region2Keys201-300

Region3Keys301-400

TablePeople

Page 21: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

The consistency model is 5ghtly coupled to the scalability of the implementa5on!

Page 22: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Consistency Model

• Atomicity:En0rerowsareupdatedatomicallyornotatall.• Consistency:

•  AGETisguaranteedtoreturnacompleterowthatexistedatsomepointinthetable’shistory.(Checkthe0mestamptobesure!)•  ASCANmustincludealldatawri[enpriortothescan,andmayincludeupdatessinceitstarted.

•  Isola0on:Notguaranteedoutsideasinglerow.• Durability:Allsuccessfulwriteshavebeenmadedurableondisk.

Page 23: The Hadoop Stack, Part 2 Introducon to HBasedthain/courses/cse40822/... · The Hadoop Stack, Part 2 Introducon to HBase CSE 40822 – Cloud Compu0ng – Spring 2016 ... • An in-memory

Ques5ons for Discussion

• Whatarethetradeoffsbetweenhavingaverytallver0caltableversushavingaverywidehorizontaltable?•  Supposeatablestartsoffsmall,andthenalargenumberofrowsareaddedtothetable.Whathappensandwhatshouldthesystemdo?• UsingthePeopleexample,comeupwithsomeexamplesofsimplequeriesthatareveryfast,andsomethatareveryslow.Isthereanythingthatcanbedoneabouttheslowqueries?• WoulditbepossibletohaveSCANreturnaconsistentsnapshotofthesystem?Howwouldthatwork?