l23: nosql (continued) - northeastern university · • redis-next slides and in our...

56
16 L23: NoSQL (continued) CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/9/2018

Upload: others

Post on 27-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

16

L23:NoSQL(continued)

CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/9/2018

Page 2: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

17

Announcements!

• Pleasepickupyourexamifyouhavenotyet• HW6:startearly• NoSQL:focusonbigpicture- weseehowthingsfromearlierinclasscometogether

Page 3: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

18

DatabaseReplication

• Datareplication:storingthesamedataonseveralmachines(“nodes”)• Usefulfor:- Availability (parallelrequestsaremadeagainstreplicas)- Reliability(datacansurvivehardwarefaults)- Faulttolerance(systemstaysalivewhennodes/networkfail)

• Typicalarchitecture:master-slave

Replication example in MySQL (dev.mysql.com)

Page 4: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

19

OpenSource

• Freesoftware,sourceprovided- Usershavetherighttouse,modifyanddistributethesoftware- Butrestrictionsmaystillapply,e.g.,adaptationsneedtobeopensource

• Idea:communitydevelopment- Developersfixbugs,addfeatures,...

• Howcanthatwork?- See[Bonaccorsi,Rossi,2003.Whyopensourcesoftwarecansucceed.Researchpolicy,

32(7),pp.1243-1258]

• AmajordriverofOpenSource isApache

Page 5: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

20

ApacheSoftwareFoundation

• Non-profitorganization• Hostscommunitiesofdevelopers- Individualsandsmall/largecompanies

• Producesopen-sourcesoftware• Fundingfromgrantsandcontributions• Hostsverysignificantprojects- ApacheWebServer,Hadoop,Zookeeper,Cassandra,Lucene,OpenOffice,Struts,Tomcat,Subversion,Tcl,UIMA,...

Page 6: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

21

WeWillLookat4DataModels

Column-Family Store

Key/Value Store

Document Store Graph Databases

Source:BennyKimelfeld

Page 7: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

22

Databaseenginesrankingby"popularity"

Source:https://db-engines.com/en/ranking ,4/2018

Page 8: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

23

Databaseenginesrankingby"popularity"

Source:https://db-engines.com/en/ranking_trend ,4/2018

Page 9: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

24

HighlightedDatabaseFeatures

• Datamodel- Whatdataisbeingstored?

• CRUD interface- APIforCreate,Read,Update,Delete

• 4basicfunctionsofpersistentstorage(insert,select,update,delete)

- SometimesprecedingSforSearch

• Transactionconsistency guarantees

• Replication andsharding model- What’sautomatedandwhat’smanual?

Page 10: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

25

TrueandFalseConceptions

• True:- SQLdoesnoteffectivelyhandlecommonWebneedsofmassive(datacenter)data

- SQLhasguaranteesthatcansometimesbecompromisedforthesakeofscaling

- Joinsarenotforfree,sometimesundoable

• False:- NoSQL saysNOtoSQL- NowadaysNoSQL istheonlywaytogo- Joinscanalwaysbeavoidedbystructureredesign

Page 11: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

26

• Introduction• TransactionConsistency• 4maindatamodels﹣ Key-ValueStores(e.g.,Redis)﹣ Column-FamilyStores(e.g.,Cassandra)﹣ DocumentStores(e.g.,MongoDB)﹣ GraphDatabases(e.g.,Neo4j)

• ConcludingRemarks

Outline

Page 12: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

27

Transaction

• Asequenceofoperations(overdata)viewedasasinglehigher-leveloperation- Transfermoneyfromaccount1toaccount2

• DBMSsexecutetransactionsinparallel- Noproblemapplyingtwo“disjoint”transactions- Butwhatiftherearedependencies?

• Transactionscaneithercommit (succeed)orabort (fail)- Failureduetoviolationofprogramlogic,networkfailures,credit-cardrejection,etc.

• DBMSshouldnotexpecttransactionstosucceed

Page 13: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

28

ExamplesofTransactions

• Airlineticketing- Verifythattheseatisvacant,withthepricequoted,thenchargecreditcard,thenreserve

• Onlinepurchasing- Similar

• “Transactionalfilesystems”(MSNTFS)- Movingafilefromonedirectorytoanother:verifyfileexists,copy,delete

• Textbookexample:bankmoneytransfer- Readfromacct#1,verifyfunds,updateacct#1,updateacct#2

Page 14: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

29

TransferExample

Begin

Read(A,v)

v = v-100

Write(A,v)

Read(B,w)

w=w+100

Write(B,w)

Commit

Begin

Read(A,v)

v = v-100

Write(A,v)

Read(B,w)

w=w+100

Write(B,w)

Commit

Begin

Read(A,x)

x = x-100

Write(A,x)

Read(C,y)

y=y+100

Write(C,y)

Commit

txn1 txn2

• Scheduling is the operation of interleaving transactions• Why is it good?

• A serial schedule executes transactions one at a time, from beginning to end

• A good (“serializable”) scheduling is one that behaves like some serial scheduling (typically by locking protocols)

Page 15: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

30

SchedulingExample1

Begin

Read(A,x)

x = x-100

Write(A,x)

Read(C,y)

y=y+100

Write(C,y)

Commit

txn1 txn2

Begin

Read(A,v)

v = v-100

Write(A,v)

Read(B,w)

w=w+100

Write(B,w)

Commit

Read(A,v)

v = v-100

Write(A,v)

Read(B,w)

w=w+100

Write(B,w)

Read(A,x)

x = x-100

Write(A,x)

Read(C,y)

y=y+100

Write(C,y)

Page 16: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

31

SchedulingExample2

Begin

Read(A,x)

x = x-100

Write(A,x)

Read(C,y)

y=y+100

Write(C,y)

Commit

txn1 txn2

Begin

Read(A,v)

v = v-100

Write(A,v)

Read(B,w)

w=w+100

Write(B,w)

Commit

Read(A,v)

v = v-100

Write(A,v)

Read(B,w)

w=w+100

Write(B,w)

Read(A,x)

x = x-100

Write(A,x)

Read(C,y)

y=y+100

Write(C,y)

Page 17: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

32

ACID

• Atomicity- Eitheralloperationsappliedornoneare(hence,weneednotworryabouttheeffectof

incomplete/failedtransactions)• Consistency- Eachtransactioncanstartwithaconsistentdatabaseandisrequiredtoleavethe

databaseconsistent(bringtheDBfromonetoanotherconsistentstate)• Isolation- Theeffectofatransactionshouldbeasifitistheonlytransactioninexecution(in

particular,changesmadebyothertransactionsarenotvisibleuntilcommitted)• Durability- Oncethesysteminformsatransactionsuccess,theeffectshouldholdwithoutregret,

evenifthedatabasecrashes(beforemakingallchangestodisk)

Page 18: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

33

ACIDMayBeOverlyExpensive

• Inquiteafewmodernapplications:- ACIDcontrastswithkeydesiderata:highvolume,highavailability-Wecanlivewithsomeerrors,tosomeextent- Ormoreaccurately,weprefertosuffererrorsthantobesignificantlylessfunctional

• Canthispointbemademore“formal”?

Page 19: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

34

SimpleModelofaDistributedService

• Context:distributedservice- e.g.,socialnetwork

• Clientsmakeget/setrequests- e.g.,setLike(user,post),getLikes(post)

- Eachclientcantalktoanyserver

• Serversreturnresponses- e.g.,ack,{user1,....,userk}

• Failure:thenetworkmayoccasionallydisconnectduetofailures(e.g.,switchdown)

• Desiderata:Consistency,Availability,Partitiontolerance

Page 20: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

35

CAPServiceProperties

• Consistency:- everyread(toanynode)getsaresponsethatreflectsthemostrecentversionofthedata• Moreaccurately,atransactionshouldbehaveasifitchangestheentirestatecorrectlyinaninstant,Ideasimilartoserializability

• Availability:- everyrequest(toalivingnode)getsananswer:setsucceeds,getretunesavalue(ifyoucantalktoanodeinthecluster,itcanreadandwritedata)

• Partitiontolerance:- servicecontinuestofunctiononnetworkfailures(clustercansurvive

• Aslongasclientscanreachservers

Page 21: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

36

SimpleIllustration

set(x,1)

set(x,1)

ok

ok

get(x)

1

CAConsistency,Availability

set(x,2)

set(x,2)

wait...

get(x) CPConsistency,Partitiontolerance

set(x,2)

set(x,2)

ok

get(x) APAvailability,Partitiontolerance

1

1

Availability

Consistency

OurRelationalDatabaseworldsofar…

Inasystemthatmaysufferpartitions,youhavetotradeoffconsistencyvs.availability

Page 22: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

37

TheCAPTheorem

EricBrewer’sCAPTheorem:

Adistributedservicecansupportatmosttwo

outofC,A andP

Page 23: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

38

HistoricalNote

• Brewerpresenteditasthe CAPprinciple ina1999article- ThenasaninformalconjectureinhiskeynoteatthePODC2000conference

• In2002aformalproofwasgivenbyGilbertandLynch,makingCAPatheorem- [SethGilbert,NancyA.Lynch:Brewer'sconjectureandthefeasibilityofconsistent,available,partition-

tolerantwebservices.SIGACTNews33(2):51-59(2002)]

- Itismainlyaboutmakingthestatementformal;theproofisstraightforward

Page 24: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

39Source:http://blog.nahurst.com/visual-guide-to-nosql-systems ,2010

Page 25: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

40

CAPtheorem

Source:http://guide.couchdb.org

Page 26: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

41

TheBASEModel

• AppliestodistributedsystemsoftypeAP• BasicAvailability- Providehighavailabilitythroughdistribution:Therewillbearesponsetoanyrequest.

Responsecouldbea‘failure’toobtaintherequesteddata,orthedatamaybeinaninconsistentorchangingstate.

• Softstate- Inconsistency(staleanswers)allowed:Stateofthesystemcanchangeovertime,so

evenduringtimeswithoutinput,changescanhappendueto‘eventualconsistency’

• Eventualconsistency- Ifupdatesstop,thenaftersometimeconsistencywillbeachieved

• Achievedbyprotocolstopropagateupdatesandverifycorrectnessofpropagation(gossipprotocols)

• Philosophy:besteffort,optimistic,stalenessandapproximationallowed

Page 27: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

42

• Introduction• TransactionConsistency• 4maindatamodels﹣ Key-ValueStores(e.g.,Redis)﹣ Column-FamilyStores(e.g.,Cassandra)﹣ DocumentStores(e.g.,MongoDB)﹣ GraphDatabases(e.g.,Neo4j)

• ConcludingRemarks

Outline

Page 28: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

43

Key-ValueStores

• Essentially,bigdistributedhashmaps• OriginattributedtoDynamo– Amazon’sDBforworld-scalecatalog/cartcollections- ButBerkeleyDBhasbeenherefor>20years

• Storepairs⟨key,opaque-value⟩- OpaquemeansthatDBdoesnotassociateanystructure/semanticswiththevalue;oblivioustovalues

- Thismaymeanmoreworkfortheuser:retrievingalargevalueandparsingtoextractanitemofinterest

• Sharding viapartitioningofthekeyspace- Hashing,gossipandremappingprotocolsforloadbalancingandfaulttolerance

Page 29: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

44

Hashing(Hashtables,dictionaries)

0

m–1

h(k1)

h(k4)

h(k2)

h(k3)

U(universe of keys)

K(actualkeys)

k1

k2

k3

k5

k4

h(k2)

n = |K| << |U|.

key k “hashes” to slot T[h[k]]

hash table T [0…m–1]h : U ® {0,1,…, m–1}

Page 30: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

45

Hashing(Hashtables,dictionaries)

h(k1)

h(k4)

h(k2)

h(k3)

k1

k2

k3

k5

k4

collisionh(k2)=h(k5)

0

m–1

U(universe of keys)

K(actualkeys)

hash table T [0…m–1]

Page 31: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

46

ExampleDatabases

• Amazon’sDynamoDB- OriginallydesignedforAmazon’sworkloadatpeaks- OfferedaspartofAmazon’sWebservices

• Redis- NextslidesandinourJupyter notebooks

• Riak- Focusesonhighavailability,BASE- “AslongasyourRiak clientcanreachoneRiak server,itshouldbeabletowritedata.”

• FoundationDB- Focusontransactions,ACID

• BerkeleyDB(andOracleNoSQLDatabase)- Firstrelease1994,byBerkeley,acquiredbyOracle- ACID,replication

Page 32: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

47

Redis

• Basicallyadatastructureforstrings,numbers,hashes,lists,sets• Simplistic“transaction”management- Queuingofcommandsasblocks,really- AmongACID,onlyIsolationguaranteed

• Ablockofcommandsthatisexecutedsequentially;notransactioninterleaving;norollbackonerrors

• In-memory store- Persistencebyperiodicalsavestodisk

• Comeswith- Acommand-lineAPI- Clientsfordifferentprogramminglanguages

• Perl,PHP,Rubi,Tcl,C,C++,C#,Java,R,…

Page 33: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

48

key value

set x 10 x 10

hset h y 5 h yà5

hset h1 name twohset h1 value 2_ h1

nameàtwovalueà2

hmset p:22 name Alice age 25 p:22 nameàAliceageà25

sadd s 20___sadd s Alicesadd s Alice s {20,Alice}

rpush l arpush l blpush l c l (c,a,b)

(simple value)

(hash table)

(set)

(list)

key maps to:

ExampleofRedis Commands

get x>> 10

hget h y>> 5

hkeys p:22>> name , age

smembers s>> 20 , Alice

scard s>> 2

llen l>> 3

lrange l 1 2 >> a , b

lindex l 2>> b

lpop l >> c

rpop l >> b

Page 34: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

49

key value

set x 10 x 10

hset h y 5 h yà5

hset h1 name twohset h1 value 2_ h1

nameàtwovalueà2

hmset p:22 name Alice age 25 p:22 nameàAliceageà25

sadd s 20___sadd s Alicesadd s Alice s {20,Alice}

rpush l arpush l blpush l c l (c,a,b)

(simple value)

(set)

(list)

key maps to:

ExampleofRedis Commands

get x>> 10

hget h y>> 5

hkeys p:22>> name , age

smembers s>> 20 , Alice

scard s>> 2

llen l>> 3

lrange l 1 2 >> a , b

lindex l 2>> b

lpop l >> c

rpop l >> b

(hash table)

Page 35: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

50

AdditionalNotes

• Akeycanbeany<256MBbinarystring- Forexample,JPEGimage

• Somekeyoperations:- Listallkeys:keys *- Removeallkeys:flushall- Checkifakeyexists:exists k

• Youcanconfigurethepersistencymodel- save m k meanssaveeverym secondsifatleastk keyshavechanged

Page 36: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

51

Redis Cluster

• Add-onmoduleformanagingmulti-nodeapplicationsoverRedis• Master-slavearchitectureforsharding +replication- Multiplemastersholdingpairwisedisjointsetsofkeys,everymasterhasasetofslavesforreplicationandshardingMaster and Slave nodesNodes are all connected and functionally equivalent, but actually there are two kind of nodes: slave and master nodes:

RedundancyIn the example there are two replicas per every master node, so up to two random nodes can go down without issues.

Working with two nodes down is guaranteed, but in the best case the cluster will continue to work as long as there is at least one node for every hash slot.

Source:http://redis.io/presentation/Redis_Cluster.pdf

Blue… master,Yellow… replicasUpto2randomnodescangodownwithoutissues becauseofredudancy

Page 37: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

52

Whentouseit

• Useit:- Allaccesstothedatabasesisviaprimarykey- Storingsessioninformation(websession)- userorproductprofiles(singleGEToperation)- shoppingcardinformation(basedonuserid)

• Don'tuseit:- relationshipsbetweendifferentsetsofdata- querybydata(basedonvalues)- operationsonmultiplekeysatatime

Page 38: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

53

• Introduction• TransactionConsistency• 4maindatamodels﹣ Key-ValueStores(e.g.,Redis)﹣ Column-FamilyStores(e.g.,Cassandra)﹣ DocumentStores(e.g.,MongoDB)﹣ GraphDatabases(e.g.,Neo4j)

• ConcludingRemarks

Outline

Page 39: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

54

keyspace

2TypesofColumnStores

sid name address year faculty861 Alice Haifa 2 NULL753 Amir London NULL CS955 Ahuva NULL 2 IE

StandardRDB

id sid1 8612 7533 955

id name1 Alice2 Amir3 Ahuva

id address1 Haifa2 London

id year1 23 2

id faculty2 CS3 IE

Eachcolumnstoredseparately.Why?Efficiency (fetchonlyrequiredcolumns),

compression,sparse dataforfree

1 sid:861 name:Aliceaddress:Haifa ts:20

2 sid:753 name:Amiraddress:London ts:22

3 sid:955name:Ahuva ts:32

1 year:2 ts:26

2 faculty:CS ts:25email:{prime:c@d ext:c@e}

3 year:2 faculty:IE ts:32email:{prime:a@b ext:a@c}

columnfamily

columnfamily

“column”

“supercolumn”

Column-FamilyStore (NoSQL)

timestampforconflicts

Columnstore (stillSQL)

Cassandradatamodel

Page 40: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

55

ColumnStores

• Thetwooftenmixedas“columnstore”à confusion- SeeDanielAbadi’sblog:http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html

• Commonidea:don’tkeeparowinaconsecutiveblock,splitviaprojection- Columnstore:eachcolumnisindependent- Column-familystore:eachcolumnfamilyisindependent

• Bothprovidesomemajorefficiencybenefitsincommonread-mainlyworkloads- Givenaquery,loadtomemoryonlytherelevantcolumns- Columnscanoftenbehighlycompressedduetovaluesimilarity- Effectiveformforsparseinformation(noNULLs,nospace)

• Column-family storeishandleddifferentlyfromRDBs,oftenrequiringadesignatedquerylanguage

Page 41: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

56

ExamplesSystems

• Columnstore(SQL):- MonetDB (started2002,Univ.Amsterdam)- VectorWise (spawnedfromMonetDB)- Vertica (M.Stonebraker)- SAP SybaseIQ- Infobright

• Column-familystore (NoSQL):- Google’sBigTable (maininspirationtocolumnfamilies)- ApacheHBase (usedbyFacebook,LinkedIn,Netflix...,CPinCAP)- Hypertable- ApacheCassandra (APinCAP)

Page 42: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

57

Example:ApacheCassandra

• InitiallydevelopedbyFacebook- Open-sourcedin2008

• Usedby1500+businesses,e.g.,Comcast,eBay,GitHub,Hulu,Instagram,Netflix,BestBuy,...

• Column-familystore- Supportskey-valueinterface- ProvidesaSQL-likeCRUDinterface:CQL

• UsesBloomfilters- Aninterestingmembershiptestthatcanhavefalsepositivesbutneverfalsenegatives,

behaveswellstatistically

• BASEconsistencymodel(APinCAP)- Gossipprotocol(constantcommunication)toestablishconsistency- Ring-basedreplicationmodel

Page 43: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

58

ExampleBloomFilterk=3

Insert(x,H)

Member(y,H)

y1 = is not in H (why ?)

y2 may be in H (why ?)

Page 44: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

59

3

2

4

88

3

2

4

Cassandra’sRingModel

1

5

7

6

write(k,t) hash(k)=2

write(k,t)

ReplicationFactor=3

Advantage:Flexibility/easeofclusterredesign

Coordinatornode

Primaryresponsible

Additionalreplicas

Seemore:https://www.hakkalabs.co/articles/how-cassandra-stores-data

Page 45: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

60

Whentouseit(e.g.Cassandra)

• Useit:- Eventlogging(multipleapplicationscanwriteindifferentcolumnsandrow-key:appname:timestamp)

- CMS:Storeblogentrieswithtags,categories,linksindifferentcolumns- Counters:e.g.visitorsofapage

• Don'tuseit:- ifyourequireACID,consistency- ifyouhavetoaggregatesacrossalltherows- ifyouchangequerypatternsoften(inRDMS schemachangesarecostly,inCassandrda querychangesare)

Page 46: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

61

• Introduction• TransactionConsistency• 4maindatamodels﹣ Key-ValueStores(e.g.,Redis)﹣ Column-FamilyStores(e.g.,Cassandra)﹣ DocumentStores(e.g.,MongoDB)﹣ GraphDatabases(e.g.,Neo4j)

• ConcludingRemarks

Outline

Page 47: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

62

DocumentStores

• Similarinnaturetokey-valuestore,butvalueistreestructured asadocument

• Motivation:avoidjoins;ideally,allrelevantjoinsalreadyencapsulatedinthedocumentstructure

• Adocumentisanatomicobjectthatcannotbesplitacrossservers- Butadocumentcollectionwillbesplit

• Moreover,transactionatomicity istypicallyguaranteedwithinasingledocument

• Modelgeneralizescolumn-familyandkey-valuestores

Page 48: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

63

ExampleDatabases

• MongoDB- Nextslides

• ApacheCouchDB- EmphasizesWebaccess

• RethinkDB- Optimizedforhighlydynamicapplicationdata

• RavenDB- Deignedfor.NET,ACID

• Clusterpoint Server- XMLandJSON,acombinedSQL/JavaScriptQL

Page 49: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

64

MongoDB

• Opensource,1strelease2009,documentstore- Actually,anextendedformatcalledBSON (BinaryJSON =JavaScriptObjectNotation)for

typingandbettercompression

• Supportsreplication (master/slave),sharding (horizontalpartitioning)- Developerprovidesthe"shardkey"– collectionispartitionedbyrangesofvaluesofthis

key

• Consistency guarantees,CPofCAP• UsedbyAdobe(experiencetracking),Craigslist,eBay,FIFA(videogame),LinkedIn,McAfee

• ProvidesconnectortoHadoop- ClouderaprovidestheMongoDBconnectorindistributions

Page 50: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

65

DataExample:High-level

{name:"Alice",age:21,status:"A",groups:["algorithms","theory"]

}

Document

Source:Modifiedfromhttps://docs.mongodb.com/v3.0/core/crud-introduction/

Collection{

name:"Alice",age:21,status:"A",groups:["algorithms","theory"]

}

{name:"Bob",age:18,status:"B",groups:["database","cooking"]

}

{name:"Charly",age:22,status:"A",groups:["database","cars"]

}

{name:"Dorothee",age:16,status:"A",groups:["cars","sports"]

}

Page 51: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

66

MongoDBTerminology

RDBMS• Database• Table• Record/Row/Tuple• Column• Primarykey• Foreignkey

MongoDB• Database• Collection• Document• Field• _id

Page 52: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

67

MongoDB DataModel

• JavaScriptObjectNotation(JSON)model• Database =setofnamedcollections• Collection =sequenceofdocuments• Document ={attribute1:value1,...,attributek:valuek}• Attribute=string(attributei≠attributej)• Value =primitive value(string,number,date,...),oradocument,oranarray- Array =[value1,...,valuen]

• Keyproperties:hierarchical(likeXML),noschema- Collectiondocsmayhavedifferentattributes

generalizes relation

generalizes tuple

Page 53: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

68

DataExample

{item:"ABC2",details:{model:"14Q3",manufacturer:"M1Corporation"},stock:[{size:"M",qty:50}],category:"clothing”

}

{item:"MNO2",details:{model:"14Q3",manufacturer:"ABCCompany"},stock:[{size:"S",qty:5},{size:"M",qty:5},{size:"L",qty:1}],category:"clothing”

}

Collectioninventory

db.inventory.insert({item:"ABC1",details:{model:"14Q3",manufacturer:"XYZCompany"},stock:[{size:"S",qty:25},{size:"M",qty:50}],category:"clothing"}

) Document insertionSource:Modifiedfromhttps://docs.mongodb.com/v3.0/core/crud-introduction/

Page 54: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

69

ExampleofaSimpleQuery

{_id:"a",cust_id:"abc123",status:"A",price:25,items:[{sku:"mmm",qty:5,price:3},

{sku:"nnn",qty:5,price:2}]}{

_id:"b",cust_id:"abc124",status:"B",price:12,items:[{sku:"nnn",qty:2,price:2},

{sku:"ppp",qty:2,price:4}]}

Collectionordersdb.orders.find({status:"A"},{cust_id:1,price:1,_id:0}

)

InSQLitwouldlooklikethis:SELECTcust_id,priceFROMordersWHEREstatus="A"

{cust_id:"abc123",price:25

}

selection

projection

Findallordersandpricewithwithstatus"A"

Page 55: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

70

Whentouseit

• Useit:- Eventlogging:differenttypesofeventsacrossanenterprise- CMS:usercomments,registration,profiles,web-facingdocuments- E-commerce:flexibleschemaforproducts,evolvedatamodels

• Don'tuseit:- ifyourequireatomiccross-documentoperations- queriesagainstvaryingaggregatestructures

Page 56: L23: NoSQL (continued) - Northeastern University · • Redis-Next slides and in our Jupyternotebooks • Riak-Focuses on high availability, BASE-“As long as your Riakclient can

71

• Introduction• TransactionConsistency• 4maindatamodels﹣ Key-ValueStores(e.g.,Redis)﹣ Column-FamilyStores(e.g.,Cassandra)﹣ DocumentStores(e.g.,MongoDB)﹣ GraphDatabases(e.g.,Neo4j)

• ConcludingRemarks

Outline