cs 347lecture 11 cs 347: parallel and distributed data management notes 01: introduction brian...

65
CS 347 Lecture 1 1 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia- Molina

Upload: percival-campbell

Post on 16-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 1

CS 347: Parallel and Distributed

Data Management

Notes 01: Introduction

Brian Cooper

Based on Material by Hector Garcia-Molina

Page 2: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 2

In CS245: Centralized DB system

Software:

ApplicationSQL Front EndQuery ProcessorTransaction Proc.File Access

P

M ...

• Simplifications:• single front end• one place to keep locks• if processor fails, system fails, ...

Page 3: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 3

In CS347

• Multiple processors ( + memories)• Heterogeneity and autonomy of

“components”

Page 4: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 4

Multiple processors

• Opportunity for parallelism• Opportunity for reliability• Synchronization issues

To illustrate synchronization problems: Two Generals Problem

Page 5: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 5

The one general problem (Trivial!)

Battlefield

G

Troops

Page 6: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 6

The two general problem:

<------------------------------->messengers

Blue army Red army

BlueG Red

G

Enemy

Page 7: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 7

• Blue and red army must attack at same time

• Blue and red generals synchronize through messengers

• Messengers can be lost

Rules:

Page 8: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 8

How Many Messages Do We Need?

BG RG

attack at 9am

assume blue starts...

Is this enough??

Page 9: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 9

How Many Messages Do We Need?

BG RG

attack at 9am

assume blue starts...

Is this enough??

ack (red goes at 9am)

Page 10: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 10

How Many Messages Do We Need?

BG RG

attack at 9am

assume blue starts...

Is this enough??

ack (red goes at 9am)

got ack

Page 11: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 11

Stated problem is Impossible!

• Theorem: There is no protocol that uses a finite number of messages that solves the two-generals problem (as stated here)

Alternatives??

Page 12: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 12

Probabilistic Approach?

• Send as many messages as possible, hope one gets through...

BG RG

attack at 9am

assume blue starts...

attack at 9am

attack at 9am

attack at 9am

Page 13: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 13

Eventual Commit

• Eventually both sides attack...

BG RG

attack ASAP

assume blue starts...

on my way!

retransmitsretransmits

Page 14: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 14

Eventual Commit

• One message sent every time unit• Probability of success one message is p• What is probability that red commits by

time t?

BG RGattack ASAP

on my way!

retransmitsretransmits

Page 15: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 15

Eventual Commit

BG RGattack ASAP

on my way!

retransmitsretransmits

• C(1) = p

Page 16: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 16

Eventual Commit

BG RGattack ASAP

on my way!

retransmitsretransmits

• C(1) = p• C(2) = p + (1-p)p

Page 17: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 17

Eventual Commit

BG RGattack ASAP

on my way!

retransmitsretransmits

• C(1) = p• C(2) = p + (1-p)p• C(3) = p + (1-p)p + (1-p)2p • C(4) = p + (1-p)p + (1-p)2p + (1-

p)3p

Page 18: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

Eventual Commit

CS 347 Lecture 1 18

C(t)

t

p

Page 19: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 19

Eventual Commit

BG RGattack ASAP

on my way!

retransmitsretransmits

• How expensive is protocol?• E = expected number of messages• Homework: compute E (function of

p)

Page 20: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 20

2-Phase Eventual Commit

• Eventually both sides attack...

BG RG

ready to attack?

assume blue starts...

yes, at your disposal

attack ASAP

ack

retransmits

retransmits

phase 1

phase 2

Page 21: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 21

Commit Protocols

• Will study commit protocols like these...

Page 22: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 22

Heterogeneity

Select new

investmentsApplication

RDBMS FilesStocktickertape

Portfolio History ofdividends,ratios,...

Page 23: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 23

Autonomy

Example: unable to get statisticsfor query optimization

Example: blue general may have mind of his (or her) own!

Page 24: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 24

• So, in CS347 we study data management

with multiple processors and possible

autonomy, heterogeneity– Impact on:

• Data organization• Query processing• Access structures• Concurrency control• Recovery

Page 25: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 25

• Renewed Interest in Distributed/ParallelData Processing!– Massive web data, manage with many

computers– How to crawl and search the web?– Peer-to-peer systems manage huge amounts of

data– Data from many sources (e.g., comparison

shopping): how to integrate?– Sensor Networks: data generated an many

sensors/devices, need to analyze– Multi-player games (e.g., World of Warcraft):

tons of distributed data

Page 26: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 26

It’s the Economy, Stupid!

• Example: Multi-player games

Data

state

P

P

PP

P

P

P

P

P

P

Page 27: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 27

It’s the Economy, Stupid!

• Example: Multi-player games

Data

state

P

P

PP

P

P

P

P

P

P

state

Page 28: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 28

Logistics• LECTURES: Mondays and Wednesdays 12:50pm

to 2:05pm, Gates B01• INSTRUCTOR: Brian Cooper; Email:

[email protected]; Office Hours: Mondays, Wednesdays 11:30 to 12:30.

• TEACHING ASSISTANT: – Vasilis Verroios, Email: [email protected]

• Piazza forum (tentative): https://piazza.com/class#spring2015/cs347.

• SECRETARY: Marianne Siroker; Office: Gates Hall 435;Email: [email protected]; Phone: (650) 723-0872

Page 29: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 29

Logistics

• TEXTBOOK: No required textbook. You'll be expected to read several research papers.

• CLASS WEB PAGE: http://www.stanford.edu/class/cs347Will contain homework assignments, course news, etc. Be sure to check it periodically.

• ASSIGNMENTS: about 5 homeworks• GRADING: Homeworks: 20%, Midterm 30%, Final:

50%.

Page 30: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 30

Tentative Syllabus 2014 (Part I)

DATE TOPIC• Monday March 30 Introduction [N01]• Wednesday April 1 Data Fragmentation [N02]• Monday April 6 Query processing [N03]• Wednesday April 8Query processing & Optimization [N04]• Monday April 13 Concurrency Control, Failures [N05]• Wednesday April 15 Reliable Data Management [N06]• Monday April 20 Reliable Data Management [N06]• Wednesday April 22 Guest Lecture – Cloudera Impala• Monday April 27 Replicated Data Management

[N07]• Wednesday April 29 Midterm

Page 31: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 31

Tentative Syllabus 2014 (Part II)DATE TOPIC

• Monday May 4 Partitions, Entity Resolution [N11]• Wednesday May 6 Peer to Peer Systems [N08]• Monday May 11 Map-Reduce & Pig [N09]• Wednesday May 13 Map-Reduce & Pig [N09]• Monday May 18 Other Open Source Systems [N09b]• Wednesday May 20 Other Open Source Systems [N09b]• Wednesday May 27 Distributed IR [N10]• Monday June 1 Time [N12]• Wednesday June 3 Publish/Subscribe Systems [N13]• Monday June 8 8:30 am!!! FINAL EXAM

Page 32: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

Interesting New Systems

• Storm (from Twitter)• S4 (from Yahoo)• Cassandra (key-value store)• Hive (SQL over Hadoop)• Pregel (graph execution)• Kestrel (queues?)• ZooKeeper (replicated data)• Sparkl or Spark (Berkeley?)• H-Base• HyRacks (UC Irvine)

CS 347 Lecture 1 32

• MemCache-D • Pnuts (Yahoo)• Dynamo (Amazon)• Spanner (Google)• Paxos• G-Store (UC Santa Barbara)• Elastras (UC Santa Barbara)• Tao (Facebook)• Kafka (LinkedIn)• Impala (Cloudera)

Page 33: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 33

Concepts you should be familiar with:

• CS245: query plan, cost estimation, join algorithms, recovery, logging,…

• Interconnection networks (bus, mesh, hypercube,…)

• Computer networks (LAN, WAN,…)

Page 34: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 34

Introductory topics

• Database architectures• Client-server systems• Distributed vs. parallel DB systems• Cloud Computing

Page 35: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 35

DB architectures

(1) Shared memory

P P P...

M

Page 36: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 36

DB architectures

(2) Shared disk

...

...

P

M

P P

M M

Page 37: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 37

DB architectures

(2B) Shared data storage (disk or file?)

...

...

P

M

P P

M M

• storage area network (SAN)• Hadoop/Google file system

Page 38: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 38

DB architectures

(3) Shared nothing

P

M

P

M

P

M

...

Page 39: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 39

DB architectures

(4) Hybrid example

M

P P P...

M

P P P...

Page 40: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 40

DB architectures

(4) Hybrid example 2

P

M

P

M

P

M

P

M

... ... ...

WAN

LAN #1

R R

LAN #2

Page 41: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 41

DB architectures

(4) Hybrid Tandem-like also in: Microsoft SQLServer Parallel Data Warehouse

P

M

P

M

P

M

P

M

...

Page 42: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 42

DB architectures

(5) Unusual?Datacycle (Broadcast disks)

P P P

MMM

Entire DB broadcast

Page 43: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 43

(5) Unusual Sorting network

Sortnet

P

M

P

M

...

Page 44: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 44

(5) Unusual — processor per track or processor per disk

M

P

P’

P’

P’

...

“small” processors+ “tiny” memories

Related idea inOracle Exadata"DB machine"

Page 45: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 45

(6) Unusual — sensor networks

P’

M

M

PB

M

PB

M

PB

M

PB

M

PB

data collection node

sensor

battery

Page 46: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 46

Issues for selecting architecture

• Reliability• Scalability• Geographic distribution of data• Data “clusters”• Performance• Cost

Page 47: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 47

Client-Server Systems

(or how to partition software)

ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access

client

server

Page 48: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 48

Client-Server Systems

(or how to partition software)

ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access

client

server

Page 49: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 49

Client-Server Systems

(or how to partition software)

ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access

client

server

Page 50: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 50

Transaction Servers

• Clients ship transactions consisting of 1 or more SQL commands

E.g., Open DataBase Connectivity (ODBC)

(standard API)

Page 51: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 51

Data Servers

• Client requests pages or records• Popular for OODB systems

Page 52: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 52

Issues

• Object granularity• Where is data cached?• Where is locking done?

Page 53: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 53

Basic Tradeoff

• Offloading work to clients• Data transmitted

C C

SS

Get pages

Reservehotel room

Page 54: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 54

Note: Similar issues arise when we partition software/functionality within server

Reservehotel room P

M

P

M

P

M

...

•Where is data cached?

•Where is locking done?

Page 55: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 55

Parallel or distributed DB system?• More similarities than differences!

Page 56: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 56

• Typically, parallel DBs:– Fast interconnect– Homogeneous software– High performance is goal– Transparency is goal

Page 57: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 57

• Typically, distributed DBs:– Geographically distributed– Data sharing is goal (may run into heterogeneity, autonomy)– Disconnected operation possible

Page 58: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

Cloud Computing

• Is CC just a marketing term??– utility (like power)– data or CPU cycles?– many processors, many storage units– business model

CS 347 Lecture 1 58

Page 59: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

Is CC a subset, superset, disjoint from, or overlaps with:

• grid computing• distributed computing• Web 2.0• Cluster Computing• Peer-to-peer computing• software as a service• client-server computing• data center as a computer• massively parallel

computingCS 347 Lecture 1 59

(A)

CC(B)

CC(C)

CC(D)

CC

Page 60: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

Clash of the Clouds (Economist April 4, 2009)

CS 347 Lecture 1 60

Page 61: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

Silver lining (cloud price war) (Economist, August 30, 2014)

CS 347 Lecture 1 61

Page 62: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CC Issues

• Customer lock-in• Privacy• Standards• Software licensing

CS 347 Lecture 1 62

Page 63: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 63

Next

• How to describe distributed data• Query processing in parallel DBs• Query processing in distributed DBs

Page 64: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 64

Query processing in parallel DBs:

• Typically: we can distribute/partition/sort…. data to make certain DBoperations (e.g., Join) fast

Page 65: CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

CS 347 Lecture 1 65

Query processing in distributed DBs:

• Typically: we are given data distribution; we need to find query processing strategy to minimize cost

(e.g., communication cost)