cs 4604: introduction to database management...

70
CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #12: NoSQL and MapReduce

Upload: others

Post on 25-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

CS4604:IntroductiontoDatabaseManagementSystems

B.AdityaPrakashLecture#12:NoSQLandMapReduce

Page 2: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

NOSQL(someslidesfromXiaoYu)

Prakash2018 VTCS4604 2

Page 3: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

WhyNoSQL?

Prakash2018 VTCS4604 3

Page 4: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

RDBMS§  Thepredominantchoiceinstoringdata

– Notsotruefordataminerssincewemuchintxtfiles.

§  Firstformulatedin1969byCodd– WeareusingRDBMSeverywhere

Prakash2018 VTCS4604 4

Page 5: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"

Prakash2018 VTCS4604 5

Page 6: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

WhenRDBMSmetWeb2.0

SlidefromLorenzoAlberton,"NoSQLDatabases:Why,whatandwhen"Prakash2018 VTCS4604 6

Page 7: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Whattodoifdataisreallylarge?

§  Peta-bytes(exabytes,zettabytes…..)

§  Googleprocessed24PBofdataperday(2009)

§  FBadds0.5PBperday

Prakash2018 VTCS4604 7

Page 8: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Prakash2018 VTCS4604 8

BIGdata

Page 9: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

What’sWrongwithRelationalDB?

§  Nothingiswrong.Youjustneedtousetherighttool.

§  Relationalishardtoscale.– Easytoscalereads– Hardtoscalewrites

Prakash2018 VTCS4604 9

Page 10: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

What’sNoSQL?

§  Themisleadingterm“NoSQL”isshortfor“NotOnlySQL”.

§  non-relational,schema-free,non-(quite)-ACID– MoreonACIDtransactionslaterinclass

§  horizontallyscalable,distributed,easyreplicationsupport

§  simpleAPI

Prakash2018 VTCS4604 10

Page 11: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Four(emerging)NoSQLCategories

§  Key-value(K-V)stores– BasedonDistributedHashTables/Amazon’sDynamopaper*

– Datamodel:(global)collectionofK-Vpairs– Example:Voldemort

§  ColumnFamilies– BigTableclones**– Datamodel:bigtable,columnfamilies– Example:HBase,Cassandra,Hypertable

*GDeCandiaetal,Dynamo:Amazon'sHighlyAvailableKey-valueStore,SOSP07**FChangetal,Bigtable:ADistributedStorageSystemforStructuredData,OSDI06

Prakash2018 VTCS4604 11

Page 12: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Four(emerging)NoSQLCategories

§  Documentdatabases–  InspiredbyLotusNotes– Datamodel:collectionsofK-VCollections– Example:CouchDB,MongoDB

§  Graphdatabases–  InspiredbyEuler&graphtheory– Datamodel:nodes,relations,K-Vonboth– Example:AllegroGraph,VertexDB,Neo4j

Prakash2018 VTCS4604 12

Page 13: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

FocusofDifferentDataModels

Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"

Prakash2018 VTCS4604 13

Page 14: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

C-A-P“theorem"

Consistency

Availability

PartitionTolerance

RDBMS

NoSQL(most)

Prakash2018 VTCS4604 14

Page 15: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

WhentouseNoSQL?§  Bigness§  Massivewriteperformance

–  Twittergenerates7TB/perday(2010)§  Fastkey-valueaccess§  Flexibleschemaordatatypes§  Schemamigration§  Writeavailability

–  Writesneedtosucceednomatterwhat(CAP,partitioning)§  Easiermaintainability,administrationandoperations§  Nosinglepointoffailure§  Generallyavailableparallelcomputing§  Programmereaseofuse§  Usetherightdatamodelfortherightproblem§  Avoidhittingthewall§  Distributedsystemssupport§  TunableCAPtradeoffs fromhttp://highscalability.com/

Prakash2018 VTCS4604 15

Page 16: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Key-ValueStoresid hair_color age height

1923 Red 18 6’0”

3371 Blue 34 NA

… … … …

Tableinrelationaldb Store/DomaininKey-Valuedb

Finduserswhoseageisabove18?Findallattributesofuser1923?FinduserswhosehaircolorisRedandageis19?(Joinoperation)Calculateaverageageofallgradstudents?

Prakash2018 VTCS4604 16

Page 17: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

VoldemortinLinkedIn

SidAnand,LinkedInDataInfrastructure(QConLondon2012)

Prakash2018 VTCS4604 17

Page 18: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

VoldemortvsMySQL

SidAnand,LinkedInDataInfrastructure(QConLondon2012)

Prakash2018 VTCS4604 18

Page 19: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

ColumnFamilies–BigTablelike

FChang,etal,Bigtable:ADistributedStorageSystemforStructuredData,osdi06 Prakash2018 VTCS4604 19

Page 20: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

BigTableDataModel

The row name is a reversed URL. The contents column family contains the pagecontents, and the anchor column family contains the text of any anchors thatreferencethepage.

Prakash2018 VTCS4604 20

Page 21: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

BigTablePerformance

Prakash2018 VTCS4604 21

Page 22: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

DocumentDatabase-mongoDB

Tableinrelationaldb

Documentsinacollection

Initialrelease2009

Opensource,documentdbJson-likedocumentwithdynamicschema

Prakash2018 VTCS4604 22

Page 23: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

mongoDBProductDeployment

Andmuchmore…Prakash2018 VTCS4604 23

Page 24: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

GraphDatabase

DataModelAbstraction:• Nodes• Relations• Properties

Prakash2018 VTCS4604 24

Page 25: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Neo4j-BuildaGraph

Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"

Prakash2018 VTCS4604 25

Page 26: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

ADebatablePerformanceEvaluation

Prakash2018 VTCS4604 26

Page 27: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Conclusion

§  Usetherightdatamodelfortherightproblem

Prakash2018 VTCS4604 27

Page 28: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

THEHADOOPECOSYSTEM

Prakash2018 VTCS4604 28

Page 29: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

VTCS4604 29Prakash2018

Page 30: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

SinglevsCluster

§  4TBHDDsarecomingout§  Cluster?

– Howmanymachines?– Handlemachineanddrivefailure– Needredundancy,backup..

Prakash2018 VTCS4604 30

How to analyze such large datasets?

First thing, how to store them?

Single machine? 4TB drive is out

Cluster of machines?

• How many machines?• Need to worry about

machine and drive failure. Really?

• Need data backup, redundancy, recovery, etc.

5

3% of 100,000 hard drives fail within first 3 months

Failure Trends in a Large Disk Drive Populationhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

3%of100KHDDsfailin<=3months

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

Page 31: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Hadoop

§  Opensourcesoftware– Reliable,scalable,distributedcomputing

§  Canhandlethousandsofmachines§ WritteninJAVA§  Asimpleprogrammingmodel§  HDFS(HadoopDistributedFileSystem)

– Faulttolerant(canrecoverfromfailures)

Prakash2018 VTCS4604 31

Open-source software for reliable, scalable, distributed computing

Written in Java

Scale to thousands of machines

• Linear scalability: if you have 2 machines, your job runs twice as fast

Uses simple programming model (MapReduce)

Fault tolerant (HDFS)

• Can recover from machine/disk failure (no need to restart computation)

7http://hadoop.apache.org

Page 32: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

IdeaandSolution§  Issue:Copyingdataoveranetworktakestime§  Idea:

– Bringcomputationclosetothedata– Storefilesmultipletimesforreliability

§ Map-reduceaddressestheseproblems– Google’scomputational/datamanipulationmodel– Elegantwaytoworkwithbigdata– StorageInfrastructure–Filesystem

•  Google:GFS.Hadoop:HDFS– Programmingmodel

•  Map-ReduceVTCS4604 32Prakash2018

Page 33: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Map-Reduce[DeanandGhemawat2004]

§  Abstractionforsimplecomputing– Hidesdetailsofparallelization,fault-tolerance,data-balancing

– MUSTRead!http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

Prakash2018 VTCS4604 33

Page 34: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

HadoopVSNoSQL

§  Hadoop:computingframework– Supportsdata-intensiveapplications–  IncludesMapReduce,HDFSetc.(wewillstudyMRmainlynext)

§  NoSQL:NotonlySQLdatabases– CanbebuiltONhadoop.E.g.HBase.

Prakash2018 VTCS4604 34

Page 35: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

StorageInfrastructure

§  Problem:–  Ifnodesfail,howtostoredatapersistently?

§  Answer:– DistributedFileSystem:

•  Providesglobalfilenamespace•  GoogleGFS;HadoopHDFS;

§  Typicalusagepattern– Hugefiles(100sofGBtoTB)– Dataisrarelyupdatedinplace–  Readsandappendsarecommon

VTCS4604 35Prakash2018

Page 36: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

DistributedFileSystem§  Chunkservers

–  Fileissplitintocontiguouschunks–  Typicallyeachchunkis16-64MB–  Eachchunkreplicated(usually2xor3x)–  Trytokeepreplicasindifferentracks

§  Masternode–  a.k.a.NameNodeinHadoop’sHDFS–  Storesmetadataaboutwherefilesarestored– Mightbereplicated

§  Clientlibraryforfileaccess–  Talkstomastertofindchunkservers–  Connectsdirectlytochunkserverstoaccessdata

VTCS4604 36Prakash2018

Page 37: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

ProgrammingModel:MapReduce

Warm-uptask:§ Wehaveahugetextdocument

§  Countthenumberoftimeseachdistinctwordappearsinthefile

§  Sampleapplication:– AnalyzewebserverlogstofindpopularURLs

VTCS4604 37Prakash2018

Page 38: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Task:WordCount

Case1:–  Filetoolargeformemory,butall<word,count>pairsfitinmemory

Case2:§  Countoccurrencesofwords:

– words(doc.txt) | sort | uniq -c •  wherewordstakesafileandoutputsthewordsinit,oneperaline

§  Case2capturestheessenceofMapReduce– Greatthingisthatitisnaturallyparallelizable

VTCS4604 38Prakash2018

Page 39: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

MapReduce:Overview

§  Sequentiallyreadalotofdata§  Map:

–  Extractsomethingyoucareabout

§  Groupbykey:SortandShuffle§  Reduce:

–  Aggregate,summarize,filterortransform

§  Writetheresult

Outlinestaysthesame,MapandReducechangetofittheproblem

VTCS4604 39Prakash2018

Page 40: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

MapReduce:TheMapStep

vk

k v

k v

mapvk

vk

k vmap

Input key-value pairs

Intermediate key-value pairs

k v

VTCS4604 40Prakash2018

Page 41: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

MapReduce:TheReduceStep

k v

k v

k v

k v

Intermediate key-value pairs

Groupbykey

reduce

reduce

k v

k v

k v

k v

k v

k v v

v v

Key-value groups Output key-value pairs

VTCS4604 41Prakash2018

Page 42: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

MoreSpecifically§  Input:asetofkey-valuepairs§  Programmerspecifiestwomethods:

– Map(k, v) → <k’, v’>* •  Takesakey-valuepairandoutputsasetofkey-valuepairs

–  E.g.,keyisthefilename,valueisasinglelineinthefile

•  ThereisoneMapcallforevery(k,v)pair

– Reduce(k’, <v’>*) → <k’, v’’>* •  Allvaluesv’withsamekeyk’arereducedtogetherandprocessedinv’order

•  ThereisoneReducefunctioncallperuniquekeyk’

VTCS4604 42Prakash2018

Page 43: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

MapReduce:WordCounting

The crew of the space shuttle Endeavor recently re turned to Ear th as ambassadors, harbingers of a new era o f space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need ……………………..

Big document

(The,1)(crew,1)(of,1)(the,1)(space,1)(shuttle,1)(Endeavor,1)(recently,1)

….

(crew,1)(crew,1)(space,1)(the,1)(the,1)(the,1)

(shuttle,1)(recently,1)

(crew,2)(space,1)(the,3)

(shuttle,1)(recently,1)

MAP:Readinputandproducesasetofkey-valuepairs

Groupbykey:Collectallpairswithsamekey

Reduce:Collectallvaluesbelongingtothekeyandoutput

(key, value)

Provided by the programmer

Provided by the programmer

(key, value) (key, value)

Sequ

entia

llyre

adth

edata

Onlysequ

entia

lreads

VTCS4604 43Prakash2018

Page 44: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

WordCountUsingMapReduce

map(key, value): // key: document name; value: text of the document for each word w in value:

emit(w, 1)

reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result)

VTCS4604 44Prakash2018

Page 45: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Map-Reduce(MR)asSQL

§  selectcount(*)fromDOCUMENTgroupbyword

Prakash2018 VTCS4604 45

Mapper

Reducer

Page 46: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Map-Reduce:Environment

Map-Reduceenvironmenttakescareof:§  Partitioningtheinputdata§  Schedulingtheprogram’sexecutionacrossasetofmachines

§  Performingthegroupbykeystep§  Handlingmachinefailures§ Managingrequiredinter-machinecommunication

VTCS4604 46Prakash2018

Page 47: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Map-Reduce:Adiagram

VTCS4604 47

Bigdocument

MAP:Readinputandproducesasetofkey-valuepairs

Groupbykey:Collectallpairswith

samekey(Hashmerge,Shuffle,

Sort,Partition)

Reduce:Collectallvalues

belongingtothekeyandoutput

Prakash2018

Page 48: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Map-Reduce:InParallel

VTCS4604 48AllphasesaredistributedwithmanytasksdoingtheworkPrakash2018

Page 49: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Map-Reduce§  Programmerspecifies:

–  MapandReduceandinputfiles§  Workflow:

–  Readinputsasasetofkey-value-pairs–  Maptransformsinputkv-pairsintoa

newsetofk'v'-pairs–  Sorts&Shufflesthek'v'-pairstooutput

nodes–  Allk’v’-pairswithagivenk’aresentto

thesamereduce–  Reduceprocessesallk'v'-pairsgrouped

bykeyintonewk''v''-pairs–  Writetheresultingpairstofiles

§  Allphasesaredistributedwithmanytasksdoingthework

Input0

Map0

Input1

Map1

Input2

Map2

Reduce0 Reduce1

Out0 Out1

Shuffle

49VTCS4604Prakash2018

Page 50: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

DataFlow

§  Inputandfinaloutputarestoredonadistributedfilesystem(FS):– Schedulertriestoschedulemaptasks“close”tophysicalstoragelocationofinputdata

§  IntermediateresultsarestoredonlocalFSofMapandReduceworkers

§  OutputisofteninputtoanotherMapReducetask

VTCS4604 50Prakash2018

Page 51: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Coordination:Master

§  Masternodetakescareofcoordination:–  Taskstatus:(idle,in-progress,completed)–  Idletasksgetscheduledasworkersbecomeavailable– Whenamaptaskcompletes,itsendsthemasterthelocationandsizesofitsRintermediatefiles,oneforeachreducer

– Masterpushesthisinfotoreducers

§  Masterpingsworkersperiodicallytodetectfailures

VTCS4604 51Prakash2018

Page 52: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

DealingwithFailures

§  Mapworkerfailure– Maptaskscompletedorin-progressatworkerareresettoidle

–  Reduceworkersarenotifiedwhentaskisrescheduledonanotherworker

§  Reduceworkerfailure– Onlyin-progresstasksareresettoidle–  Reducetaskisrestarted

§  Masterfailure– MapReducetaskisabortedandclientisnotified

VTCS4604 52Prakash2018

Page 53: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

PROBLEMSSUITEDFORMAP-REDUCE

Prakash2018 VTCS4604 53

Page 54: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Example:Hostsize

§  Supposewehavealargewebcorpus§  Lookatthemetadatafile

–  Linesoftheform:(URL,size,date,…)§  Foreachhost,findthetotalnumberofbytes

–  Thatis,thesumofthepagesizesforallURLsfromthatparticularhost

§  Otherexamples:–  Linkanalysisandgraphprocessing– MachineLearningalgorithms

VTCS4604 54Prakash2018

Page 55: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Example:LanguageModel

§  Statisticalmachinetranslation:– Needtocountnumberoftimesevery5-wordsequenceoccursinalargecorpusofdocuments

§  VeryeasywithMapReduce:– Map:

•  Extract(5-wordsequence,count)fromdocument

– Reduce:•  Combinethecounts

VTCS4604 55Prakash2018

Page 56: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

InHW5

§  You’lldealwithn-grams – n-gramisacontiguoussequenceofnitemsfromagivensequenceoftextorspeech

§  Example §  Sentence:“theraininSpainfallsmainlyontheplain”– 2grams:therain,rainin,inSpain,Spainfalls,etc.– 3grams:therainin,raininSpain,inSpainfalls,….

Prakash2018 VTCS4604 56

Page 57: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

InHW5§  YouwillworkwiththeGoogle4-gramcorpus.Example:

–  analysis is often described 1991 10 1 1 –  analysis is often described 1992 30 2 1

§  Wewillaskyouto–  Findtotaloccurrencecounts(thiswillbesimilartojustwordcount)

•  intheexampleabove“analysis is often described” occurstotalof10+30=40times.

–  Convert4-gramsto2-grams(thinkwhatshouldbethemapperandreducerforthis)

•  Example:“analysis is often described” willgiverisetothefollowing2grams:analysis is, is often, often described

Prakash2018 VTCS4604 57

Page 58: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

DegreeofgraphExample

§  FinddegreeofeverynodeinagraphExample:Inafriendshipgraph,whatisthenumberoffriendsofeveryperson:Node6=1Node2=3Node4=3Node1=2Node3=2Node5=3

Prakash2018 VTCS4604 58

Page 59: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Degreeofeachnodeinagraph

§  Supposeyouhavetheedgelist === ==atable!

Schema? Edges(from,to)

Prakash2018 VTCS4604 59

6 4 4 6 4 3 3 4 4 5 5 4 ...

Page 60: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Degreeofeachnodeinagraph

§  Supposeyouhavetheedgelist === ==atable!

Schema? Edges(from,to)

SQLfordegreelist?

Prakash2018 VTCS4604 60

SELECTfrom,count(*)FROMEdgesGROUPBYfrom

6 4 4 6 4 3 3 4 4 5 5 4 ...

Page 61: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Degreeofeachnodeinagraph

§  SoinSQL:§ MapReduce?Mapper:emit(from,1)

Reducer:emit(from,count())

Prakash2018 VTCS4604 61

SELECTfrom,count(*)FROMEdgesGROUPBYfrom

Remember

6 4 4 6 4 3 3 4 4 5 5 4 ...

I.E.essentiallyequivalenttothe‘word-count’exampleJ

Page 62: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Conclusions

§  Hadoopisadistributeddata-intesivecomputingframework

§ MapReduce– Simpleprogrammingparadigm– Surprisinglypowerful(maynotbesuitableforalltasksthough)

§  HadoophasspecializedFileSystem,Master-SlaveArchitecturetoscale-up

Prakash2018 VTCS4604 62

Page 63: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

NoSQLandHadoop

§  Hotareawithseveralnewproblems– Goodforacademicresearch– Goodforindustry

=FunANDProfitJ

Prakash2018 VTCS4604 63

Page 64: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

POINTERSANDFURTHERREADING

Prakash2018 VTCS4604 64

Page 65: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Implementations

§  Google– NotavailableoutsideGoogle

§  Hadoop– Anopen-sourceimplementationinJava– UsesHDFSforstablestorage– Download:http://lucene.apache.org/hadoop/

§  AsterData– Cluster-optimizedSQLDatabasethatalsoimplementsMapReduce

VTCS4604 65Prakash2018

Page 66: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

CloudComputing

§  Abilitytorentcomputingbythehour– Additionalservicese.g.,persistentstorage

§  Amazon’s“ElasticComputeCloud”(EC2)

§  AsterDataandHadoopcanbothberunonEC2

VTCS4604 66Prakash2018

Page 67: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Reading

§  JeffreyDeanandSanjayGhemawat:MapReduce:SimplifiedDataProcessingonLargeClusters– http://labs.google.com/papers/mapreduce.html

§  SanjayGhemawat,HowardGobioff,andShun-TakLeung:TheGoogleFileSystem– http://labs.google.com/papers/gfs.html

VTCS4604 67Prakash2018

Page 68: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Resources§  HadoopWiki

–  Introduction•  http://wiki.apache.org/lucene-hadoop/

–  GettingStarted•  http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop

–  Map/ReduceOverview•  http://wiki.apache.org/lucene-hadoop/HadoopMapReduce•  http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses

–  EclipseEnvironment•  http://wiki.apache.org/lucene-hadoop/EclipseEnvironment

§  Javadoc–  http://lucene.apache.org/hadoop/docs/api/

VTCS4604 68Prakash2018

Page 69: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

Resources

§  ReleasesfromApachedownloadmirrors– http://www.apache.org/dyn/closer.cgi/lucene/hadoop/

§  Nightlybuildsofsource– http://people.apache.org/dist/lucene/hadoop/nightly/

§  Sourcecodefromsubversion– http://lucene.apache.org/hadoop/version_control.html

VTCS4604 69Prakash2018

Page 70: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories

FurtherReading§  Programmingmodelinspiredbyfunctionallanguageprimitives§  Partitioning/shufflingsimilartomanylarge-scalesortingsystems

–  NOW-Sort['97]§  Re-executionforfaulttolerance

–  BAD-FS['04]andTACC['97]§  LocalityoptimizationhasparallelswithActiveDisks/Diamondwork

–  ActiveDisks['01],Diamond['04]§  BackuptaskssimilartoEagerSchedulinginCharlottesystem

–  Charlotte['96]§  DynamicloadbalancingsolvessimilarproblemasRiver's

distributedqueues–  River['99]

VTCS4604 70Prakash2018