cs 347lecture 9b1 cs 347: parallel and distributed data management notes x: s4 hector garcia-molina

96
CS 347 Lecture 9B 1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Upload: harold-rogers

Post on 28-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 1

CS 347: Parallel and Distributed

Data Management

Notes X: S4

Hector Garcia-Molina

Page 2: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Material based on:

CS 347 Lecture 9B 2

Page 3: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

S4

• Platform for processing unbounded data streams– general purpose– distributed– scalable– partially fault tolerant

(whatever this means!)

CS 347 Lecture 9B 3

Page 4: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Data Stream

CS 347 Lecture 9B 4

A 34B abcdC 15D TRUE

A 3B defC 0D FALSE

A 34B abcdC 78

Terminology: event (data record), key, attribute

Question: Can an event have duplicate attributes for same key?(I think so...)

Stream unbounded, generated by say user queries,purchase transactions, phone calls, sensor readings,...

Page 5: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

S4 Processing Workflow

CS 347 Lecture 9B 5

user specified“processing unit”

A 34B abcdC 15D TRUE

Page 6: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Inside a Processing Unit

CS 347 Lecture 9B 6

PE1

PE2

PEn

...key=z

key=b

key=a

processing element

Page 7: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example:

• Stream of English quotes• Produce a sorted list of the top K most

frequent words

CS 347 Lecture 9B 7

Page 8: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 8

Page 9: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Processing Nodes

CS 347 Lecture 9B 9

...

key=b

key=a

processing node

......

hash(key)=1

hash(key)=m

Page 10: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Dynamic Creation of PEs

• As a processing node sees new key attributes,it dynamically creates new PEs to handle them

• Think of PEs as threads

CS 347 Lecture 9B 10

Page 11: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Another View of Processing Node

CS 347 Lecture 9B 11

Page 12: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Failures

• Communication layer detects node failures and provides failover to standby nodes

• What happens events in transit during failure?(My guess: events are lost!)

CS 347 Lecture 9B 12

Page 13: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

How do we do DB operations on top of S4?• Selects & projects easy!

CS 347 Lecture 9B 13

• What about joins?

Page 14: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

What is a Stream Join?

• For true join, would need to storeall inputs forever! Not practical...

• Instead, define window join:– at time t new R tuple arrives– it only joins with previous w S tuples

CS 347 Lecture 9B 14

R

S

Page 15: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

One Idea for Window Join

CS 347 Lecture 9B 15

R

S

key 2rel RC 15D TRUE

key 2rel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else ...

“key” isjoin attribute

Page 16: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

One Idea for Window Join

CS 347 Lecture 9B 16

R

S

key 2rel RC 15D TRUE

key 2rel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else ...

“key” isjoin attribute

Is this right???(enforcing window on a per-key value basis)

Maybe add sequence numbersto events to enforce correct window?

Page 17: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Another Idea for Window Join

CS 347 Lecture 9B 17

R

S

key fakerel RC 15D TRUE

key fakerel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S ...

All R & S eventshave “key=fake”;Say join key is C.

Page 18: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Another Idea for Window Join

CS 347 Lecture 9B 18

R

S

key fakerel RC 15D TRUE

key fakerel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S ...

All R & S eventshave “key=fake”;Say join key is C. Entire join done in one PE;

no parallelism

Page 19: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Do You Have a Better Idea for Window Join?

CS 347 Lecture 9B 19

R

S

key ?rel RC 15D TRUE

key ?rel SC 0D FALSE

Page 20: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Final Comment: Managing state of PE

CS 347 Lecture 9B 20

Who manages state?S4: user doesMupet: System doesIs state persistent?

A 34B abcdC 15D TRUE

state

Page 21: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 21

CS 347: Parallel and Distributed

Data Management

Notes X: Hyracks

Hector Garcia-Molina

Page 22: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Hyracks

• Generalization of map-reduce• Infrastructure for “big data” processing• Material here based on:

CS 347 Lecture 9B 22

Appeared in ICDE 2011

Page 23: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

A Hyracks data object:

• Records partitioned across N sites

• Simple record schema is available (more than just key-value)

CS 347 Lecture 9B 23

site 1

site 2

site 3

Page 24: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Operations

CS 347 Lecture 9B 24

operator

distributionrule

Page 25: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Operations (parallel execution)

CS 347 Lecture 9B 25

op 1

op 2

op 3

Page 26: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example: Hyracks Specification

CS 347 Lecture 9B 26

Page 27: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example: Hyracks Specification

CS 347 Lecture 9B 27

initial partitionsinput & output isa set of partitions

mapping to new partitions2 replicated partitions

Page 28: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Notes• Job specification can be done manually

or automatically

CS 347 Lecture 9B 28

Page 29: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example: Activity Node Graph

CS 347 Lecture 9B 29

Page 30: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example: Activity Node Graph

CS 347 Lecture 9B 30

sequencing constraint

activities

Page 31: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example: Parallel Instantiation

CS 347 Lecture 9B 31

Page 32: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example: Parallel Instantiation

CS 347 Lecture 9B 32

stage(start after input stages finish)

Page 33: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

System Architecture

CS 347 Lecture 9B 33

Page 34: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Map-Reduce on Hyracks

CS 347 Lecture 9B 34

Page 35: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Library of Operators:

• File reader/writers• Mappers• Sorters• Joiners (various types0• Aggregators

• Can add more

CS 347 Lecture 9B 35

Page 36: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Library of Connectors:

• N:M hash partitioner• N:M hash-partitioning merger (input

sorted)• N:M rage partitioner (with partition vector)• N:M replicator• 1:1

• Can add more!

CS 347 Lecture 9B 36

Page 37: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Hyracks Fault Tolerance: Work in progress?

CS 347 Lecture 9B 37

Page 38: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Hyracks Fault Tolerance: Work in progress?

CS 347 Lecture 9B 38

save output to files

save output to files

The Hadoop/MR approach: save partial results to persistent storage;after failure, redo all work to reconstruct missing data

Page 39: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Hyracks Fault Tolerance: Work in progress?

CS 347 Lecture 9B 39

Can we do better? Maybe: each process retains previous results until no longer needed?

Pi output: r1, r2, r3, r4, r5, r6, r7, r8

have "made way" to final result current output

Page 40: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 40

CS 347: Parallel and Distributed

Data Management

Notes X: Pregel

Hector Garcia-Molina

Page 41: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Material based on:

• In SIGMOD 2010

• Note there is an open-source version of Pregel called GIRAPH

CS 347 Lecture 9B 41

Page 42: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Pregel

• A computational model/infrastructurefor processing large graphs

• Prototypical example: Page Rank

CS 347 Lecture 9B 42

x

e

d

b

a

f

PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)

Page 43: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Pregel

• Synchronous computation in iterations• In one iteration, each node:

– gets messages from neighbors– computes– sends data to neighbors

CS 347 Lecture 9B 43

x

e

d

b

a

f

PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)

Page 44: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Pregel vs Map-Reduce/S4/Hyracks/...

• In Map-Reduce, S4, Hyracks,...workflow separate from data

CS 347 Lecture 9B 44

• In Pregel, data (graph) drives data flow

Page 45: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Pregel Motivation

• Many applications require graph processing

• Map-Reduce and other workflow systemsnot a good fit for graph processing

• Need to run graph algorithms on many procesors

CS 347 Lecture 9B 45

Page 46: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Example of Graph Computation

CS 347 Lecture 9B 46

Page 47: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Termination

• After each iteration, each vertex votes tohalt or not

• If all vertexes vote to halt, computation terminates

CS 347 Lecture 9B 47

Page 48: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Vertex Compute (simplified)

• Available data for iteration i:– InputMessages: { [from, value] }– OutputMessages: { [to, value] }– OutEdges: { [to, value] }– MyState: value

• OutEdges and MyState are remembered for next iteration

CS 347 Lecture 9B 48

Page 49: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Max Computation

• change := false• for [f,w] in InputMessages do

if w > MyState.value then [ MyState.value := w change := true ]

• if (superstep = 1) OR change then for [t, w] in OutEdges do add [t, MyState.value] to OutputMessages else vote to halt

CS 347 Lecture 9B 49

Page 50: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Page Rank Example

CS 347 Lecture 9B 50

Page 51: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Page Rank Example

CS 347 Lecture 9B 51

iteration count

iterate thru InputMessages

MyState.value

shorthand: send same msg to all

Page 52: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Single-Source Shortest Paths

CS 347 Lecture 9B 52

Page 53: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 53

master

worker A

inputdata 1

inputdata 2

worker B

worker Cgraph has nodes a,b,c,d...

sample record:[a, value]

Page 54: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 54

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

partition graph andassign to workers

Page 55: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 55

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

read input data worker Aforwards inputvalues to appropriateworkers

Page 56: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 56

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

run superstep 1

Page 57: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 57

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

at end superstep 1,send messages

halt?

Page 58: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 58

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

run superstep 2

Page 59: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 59

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

checkpoint

Page 60: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 60

master

worker A

vertexesa, b, c

worker B

vertexesd, e

worker C

vertexesf, g, h

checkpoint

write to stable store:MyState, OutEdges,InputMessages(or OutputMessages)

Page 61: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 61

master

worker A

vertexesa, b, c

worker B

vertexesd, e

if worker dies,find replacement &restart fromlatest checkpoint

Page 62: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Architecture

CS 347 Lecture 9B 62

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

Page 63: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Interesting Challenge

• How best to partition graph for efficiency?

CS 347 Lecture 9B 63

x

e

d

b

a

f

g

Page 64: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 64

CS 347: Parallel and Distributed

Data Management

Notes X:BigTable, HBASE, Cassandra

Hector Garcia-Molina

Page 65: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Sources

• HBASE: The Definitive Guide, Lars George, O’Reilly Publishers, 2011.

• Cassandra: The Definitive Guide, Eben Hewitt, O’Reilly Publishers, 2011.

• BigTable: A Distributed Storage System for Structured Data, F. Chang et al, ACM Transactions on Computer Systems, Vol. 26, No. 2, June 2008.

CS 347 Lecture 9B 65

Page 66: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Lots of Buzz Words!

• “Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database that bases its distribution design on Amazon’s dynamo and its data model on Google’s Big Table.”

• Clearly, it is buzz-word compliant!!

CS 347 Lecture 9B 66

Page 67: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Basic Idea: Key-Value Store

CS 347 Lecture 9B 67

key valuek1 v1k2 v2k3 v3k4 v4

Table T:

Page 68: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Basic Idea: Key-Value Store

CS 347 Lecture 9B 68

key valuek1 v1k2 v2k3 v3k4 v4

Table T:

keys are sorted

• API:– lookup(key) value– lookup(key range)

values– getNext value– insert(key, value)– delete(key)

• Each row has timestemp• Single row actions atomic

(but not persistent in some systems?)

• No multi-key transactions• No query language!

Page 69: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Fragmentation (Sharding)

CS 347 Lecture 9B 69

key valuek1 v1k2 v2k3 v3k4 v4k5 v5k6 v6k7 v7k8 v8k9 v9k10 v10

key valuek1 v1k2 v2k3 v3k4 v4

key valuek5 v5k6 v6

key valuek7 v7k8 v8k9 v9k10 v10

server 1 server 2 server 3

• use a partition vector• “auto-sharding”: vector selected automatically

tablet

Page 70: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Tablet Replication

CS 347 Lecture 9B 70

key valuek7 v7k8 v8k9 v9k10 v10

server 3 server 4 server 5

key valuek7 v7k8 v8k9 v9k10 v10

key valuek7 v7k8 v8k9 v9k10 v10

primary backup backup

• Cassandra:Replication Factor (# copies)R/W Rule: One, Quorum, AllPolicy (e.g., Rack Unaware, Rack Aware, ...)Read all copies (return fastest reply, do repairs if necessary)

• HBase: Does not manage replication, relies on HDFS

Page 71: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Need a “directory”

• Table Name: Key Server that stores key Backup servers

• Can be implemented as a special table.

CS 347 Lecture 9B 71

Page 72: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Tablet Internals

CS 347 Lecture 9B 72

key valuek3 v3k8 v8k9 deletek15 v15

key valuek2 v2k6 v6k9 v9k12 v12

key valuek4 v4k5 deletek10 v10k20 v20k22 v22

memory

disk

Design Philosophy (?): Primary scenario is where all data is in memory.Disk storage added as an afterthought

Page 73: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Tablet Internals

CS 347 Lecture 9B 73

key valuek3 v3k8 v8k9 deletek15 v15

key valuek2 v2k6 v6k9 v9k12 v12

key valuek4 v4k5 deletek10 v10k20 v20k22 v22

memory

disk

flush periodically

• tablet is merge of all segments (files)• disk segments imutable• writes efficient; reads only efficient when all data in memory• periodically reorganize into single segment

tombstone

Page 74: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Column Family

CS 347 Lecture 9B 74

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

Page 75: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Column Family

CS 347 Lecture 9B 75

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

• for storage, treat each row as a single “super value”• API provides access to sub-values

(use family:qualifier to refer to sub-values e.g., price:euros, price:dollars )

• Cassandra allows “super-column”: two level nesting of columns (e.g., Column A can have sub-columns X & Y )

Page 76: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Vertical Partitions

CS 347 Lecture 9B 76

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

K Ak1 a1k2 a2k4 a4k5 a5

K Bk1 b1k4 b4k5 b5

K Ck1 c1k2 c2k4 c4

K D Ek1 d1 e1k2 d2 e2k3 d3 e3k4 e4 e4

can be manually implemented as

Page 77: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Vertical Partitions

CS 347 Lecture 9B 77

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

K Ak1 a1k2 a2k4 a4k5 a5

K Bk1 b1k4 b4k5 b5

K Ck1 c1k2 c2k4 c4

K D Ek1 d1 e1k2 d2 e2k3 d3 e3k4 e4 e4

column family

• good for sparse data;• good for column scans• not so good for tuple reads• are atomic updates to row still supported?• API supports actions on full table; mapped to actions on column tables• API supports column “project”• To decide on vertical partition, need to know access patterns

Page 78: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Failure Recovery (BigTable, HBase)

CS 347 Lecture 9B 78

tablet servermemory

log

GFS or HFS

master nodespare

tablet server

write ahead logging

ping

Page 79: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Failure recovery (Cassandra)

• No master node, all nodes in “cluster” equal

CS 347 Lecture 9B 79

server 1 server 2 server 3

Page 80: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Failure recovery (Cassandra)

• No master node, all nodes in “cluster” equal

CS 347 Lecture 9B 80

server 1 server 2 server 3

access any table in clusterat any server

that server sends requeststo other servers

Page 81: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 81

CS 347: Parallel and Distributed

Data Management

Notes X: MemCacheD

Hector Garcia-Molina

Page 82: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

MemCacheD

• General-purpose distributed memory caching system

• Open source

CS 347 Lecture 9B 82

Page 83: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

What MemCacheD Should Be (but ain't)

CS 347 Lecture 9B 83

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

distributed cache

get_object(X)

Page 84: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

What MemCacheD Should Be (but ain't)

CS 347 Lecture 9B 84

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

distributed cache

get_object(X)

x

x

Page 85: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

What MemCacheD Is (as far as I can tell)

CS 347 Lecture 9B 85

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

get_object(cache 1, MyName)

x

put(cache 1, myName, X)

Page 86: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

What MemCacheD Is

CS 347 Lecture 9B 86

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

get_object(cache 1, MyName)

x

put(cache 1, myName, X)

Can purge MyName whenever

each cache ishash table of(name, value) pairs

no connection

Page 87: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 87

CS 347: Parallel and Distributed

Data Management

Notes X: ZooKeeper

Hector Garcia-Molina

Page 88: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

ZooKeeper

• Coordination service for distributed processes• Provides clients with high throughput, high

availability, memory only file system

CS 347 Lecture 9B 88

client

client

client

client

client

ZooKeeper

/

a b

c d

e

znode /a/d/e (has state)

Page 89: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

ZooKeeper Servers

CS 347 Lecture 9B 89

client

client

client

client

client

server

server

server

state replica

state replica

state replica

Page 90: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

ZooKeeper Servers

CS 347 Lecture 9B 90

client

client

client

client

client

server

server

server

state replica

state replica

state replica

read

Page 91: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

ZooKeeper Servers

CS 347 Lecture 9B 91

client

client

client

client

client

server

server

server

state replica

state replica

state replica

write

propagate & sych

writes totally ordered:used Zab algorithm

Page 92: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Failures

CS 347 Lecture 9B 92

client

client

client

client

client

server

server

server

state replica

state replica

state replica

if your server dies,just connect to a different one!

Page 93: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

ZooKeeper Notes

• Differences with file system:– all nodes can store data– storage size limited

• API: insert node, read node, read children, delete node, ...

• Can set triggers on nodes• Clients and servers must know all servers• ZooKeeper works as long as a majority of

servers are available• Writes totally ordered; read ordered w.r.t.

writes

CS 347 Lecture 9B 93

Page 94: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

CS 347 Lecture 9B 94

CS 347: Parallel and Distributed

Data Management

Notes X: Kestrel

Hector Garcia-Molina

Page 95: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Kestrel

• Kestrel server handles a set of reliable, ordered message queues

• A server does not communicate with other servers (advertised as good for scalability!)

CS 347 Lecture 9B 95

client

client

server

server

queues

queues

Page 96: CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina

Kestrel

• Kestrel server handles a set of reliable, ordered message queues

• A server does not communicate with other servers (advertised as good for scalability!)

CS 347 Lecture 9B 96

client

client

server

server

queues

queuesget q

put q

log

log