cs 347lecture 11 cs 347: parallel and distributed data management notes 01: introduction brian...
TRANSCRIPT
CS 347 Lecture 1 1
CS 347: Parallel and Distributed
Data Management
Notes 01: Introduction
Brian Cooper
Based on Material by Hector Garcia-Molina
CS 347 Lecture 1 2
In CS245: Centralized DB system
Software:
ApplicationSQL Front EndQuery ProcessorTransaction Proc.File Access
P
M ...
• Simplifications:• single front end• one place to keep locks• if processor fails, system fails, ...
CS 347 Lecture 1 3
In CS347
• Multiple processors ( + memories)• Heterogeneity and autonomy of
“components”
CS 347 Lecture 1 4
Multiple processors
• Opportunity for parallelism• Opportunity for reliability• Synchronization issues
To illustrate synchronization problems: Two Generals Problem
CS 347 Lecture 1 5
The one general problem (Trivial!)
Battlefield
G
Troops
CS 347 Lecture 1 6
The two general problem:
<------------------------------->messengers
Blue army Red army
BlueG Red
G
Enemy
CS 347 Lecture 1 7
• Blue and red army must attack at same time
• Blue and red generals synchronize through messengers
• Messengers can be lost
Rules:
CS 347 Lecture 1 8
How Many Messages Do We Need?
BG RG
attack at 9am
assume blue starts...
Is this enough??
CS 347 Lecture 1 9
How Many Messages Do We Need?
BG RG
attack at 9am
assume blue starts...
Is this enough??
ack (red goes at 9am)
CS 347 Lecture 1 10
How Many Messages Do We Need?
BG RG
attack at 9am
assume blue starts...
Is this enough??
ack (red goes at 9am)
got ack
CS 347 Lecture 1 11
Stated problem is Impossible!
• Theorem: There is no protocol that uses a finite number of messages that solves the two-generals problem (as stated here)
Alternatives??
CS 347 Lecture 1 12
Probabilistic Approach?
• Send as many messages as possible, hope one gets through...
BG RG
attack at 9am
assume blue starts...
attack at 9am
attack at 9am
attack at 9am
CS 347 Lecture 1 13
Eventual Commit
• Eventually both sides attack...
BG RG
attack ASAP
assume blue starts...
on my way!
retransmitsretransmits
CS 347 Lecture 1 14
Eventual Commit
• One message sent every time unit• Probability of success one message is p• What is probability that red commits by
time t?
BG RGattack ASAP
on my way!
retransmitsretransmits
CS 347 Lecture 1 15
Eventual Commit
BG RGattack ASAP
on my way!
retransmitsretransmits
• C(1) = p
CS 347 Lecture 1 16
Eventual Commit
BG RGattack ASAP
on my way!
retransmitsretransmits
• C(1) = p• C(2) = p + (1-p)p
CS 347 Lecture 1 17
Eventual Commit
BG RGattack ASAP
on my way!
retransmitsretransmits
• C(1) = p• C(2) = p + (1-p)p• C(3) = p + (1-p)p + (1-p)2p • C(4) = p + (1-p)p + (1-p)2p + (1-
p)3p
Eventual Commit
CS 347 Lecture 1 18
C(t)
t
p
CS 347 Lecture 1 19
Eventual Commit
BG RGattack ASAP
on my way!
retransmitsretransmits
• How expensive is protocol?• E = expected number of messages• Homework: compute E (function of
p)
CS 347 Lecture 1 20
2-Phase Eventual Commit
• Eventually both sides attack...
BG RG
ready to attack?
assume blue starts...
yes, at your disposal
attack ASAP
ack
retransmits
retransmits
phase 1
phase 2
CS 347 Lecture 1 21
Commit Protocols
• Will study commit protocols like these...
CS 347 Lecture 1 22
Heterogeneity
Select new
investmentsApplication
RDBMS FilesStocktickertape
Portfolio History ofdividends,ratios,...
CS 347 Lecture 1 23
Autonomy
Example: unable to get statisticsfor query optimization
Example: blue general may have mind of his (or her) own!
CS 347 Lecture 1 24
• So, in CS347 we study data management
with multiple processors and possible
autonomy, heterogeneity– Impact on:
• Data organization• Query processing• Access structures• Concurrency control• Recovery
CS 347 Lecture 1 25
• Renewed Interest in Distributed/ParallelData Processing!– Massive web data, manage with many
computers– How to crawl and search the web?– Peer-to-peer systems manage huge amounts of
data– Data from many sources (e.g., comparison
shopping): how to integrate?– Sensor Networks: data generated an many
sensors/devices, need to analyze– Multi-player games (e.g., World of Warcraft):
tons of distributed data
CS 347 Lecture 1 26
It’s the Economy, Stupid!
• Example: Multi-player games
Data
state
P
P
PP
P
P
P
P
P
P
CS 347 Lecture 1 27
It’s the Economy, Stupid!
• Example: Multi-player games
Data
state
P
P
PP
P
P
P
P
P
P
state
CS 347 Lecture 1 28
Logistics• LECTURES: Mondays and Wednesdays 12:50pm
to 2:05pm, Gates B01• INSTRUCTOR: Brian Cooper; Email:
[email protected]; Office Hours: Mondays, Wednesdays 11:30 to 12:30.
• TEACHING ASSISTANT: – Vasilis Verroios, Email: [email protected]
• Piazza forum (tentative): https://piazza.com/class#spring2015/cs347.
• SECRETARY: Marianne Siroker; Office: Gates Hall 435;Email: [email protected]; Phone: (650) 723-0872
CS 347 Lecture 1 29
Logistics
• TEXTBOOK: No required textbook. You'll be expected to read several research papers.
• CLASS WEB PAGE: http://www.stanford.edu/class/cs347Will contain homework assignments, course news, etc. Be sure to check it periodically.
• ASSIGNMENTS: about 5 homeworks• GRADING: Homeworks: 20%, Midterm 30%, Final:
50%.
CS 347 Lecture 1 30
Tentative Syllabus 2014 (Part I)
DATE TOPIC• Monday March 30 Introduction [N01]• Wednesday April 1 Data Fragmentation [N02]• Monday April 6 Query processing [N03]• Wednesday April 8Query processing & Optimization [N04]• Monday April 13 Concurrency Control, Failures [N05]• Wednesday April 15 Reliable Data Management [N06]• Monday April 20 Reliable Data Management [N06]• Wednesday April 22 Guest Lecture – Cloudera Impala• Monday April 27 Replicated Data Management
[N07]• Wednesday April 29 Midterm
CS 347 Lecture 1 31
Tentative Syllabus 2014 (Part II)DATE TOPIC
• Monday May 4 Partitions, Entity Resolution [N11]• Wednesday May 6 Peer to Peer Systems [N08]• Monday May 11 Map-Reduce & Pig [N09]• Wednesday May 13 Map-Reduce & Pig [N09]• Monday May 18 Other Open Source Systems [N09b]• Wednesday May 20 Other Open Source Systems [N09b]• Wednesday May 27 Distributed IR [N10]• Monday June 1 Time [N12]• Wednesday June 3 Publish/Subscribe Systems [N13]• Monday June 8 8:30 am!!! FINAL EXAM
Interesting New Systems
• Storm (from Twitter)• S4 (from Yahoo)• Cassandra (key-value store)• Hive (SQL over Hadoop)• Pregel (graph execution)• Kestrel (queues?)• ZooKeeper (replicated data)• Sparkl or Spark (Berkeley?)• H-Base• HyRacks (UC Irvine)
CS 347 Lecture 1 32
• MemCache-D • Pnuts (Yahoo)• Dynamo (Amazon)• Spanner (Google)• Paxos• G-Store (UC Santa Barbara)• Elastras (UC Santa Barbara)• Tao (Facebook)• Kafka (LinkedIn)• Impala (Cloudera)
CS 347 Lecture 1 33
Concepts you should be familiar with:
• CS245: query plan, cost estimation, join algorithms, recovery, logging,…
• Interconnection networks (bus, mesh, hypercube,…)
• Computer networks (LAN, WAN,…)
CS 347 Lecture 1 34
Introductory topics
• Database architectures• Client-server systems• Distributed vs. parallel DB systems• Cloud Computing
CS 347 Lecture 1 35
DB architectures
(1) Shared memory
P P P...
M
CS 347 Lecture 1 36
DB architectures
(2) Shared disk
...
...
P
M
P P
M M
CS 347 Lecture 1 37
DB architectures
(2B) Shared data storage (disk or file?)
...
...
P
M
P P
M M
• storage area network (SAN)• Hadoop/Google file system
CS 347 Lecture 1 38
DB architectures
(3) Shared nothing
P
M
P
M
P
M
...
CS 347 Lecture 1 39
DB architectures
(4) Hybrid example
M
P P P...
M
P P P...
CS 347 Lecture 1 40
DB architectures
(4) Hybrid example 2
P
M
P
M
P
M
P
M
... ... ...
WAN
LAN #1
R R
LAN #2
CS 347 Lecture 1 41
DB architectures
(4) Hybrid Tandem-like also in: Microsoft SQLServer Parallel Data Warehouse
P
M
P
M
P
M
P
M
...
CS 347 Lecture 1 42
DB architectures
(5) Unusual?Datacycle (Broadcast disks)
P P P
MMM
Entire DB broadcast
CS 347 Lecture 1 43
(5) Unusual Sorting network
Sortnet
P
M
P
M
...
CS 347 Lecture 1 44
(5) Unusual — processor per track or processor per disk
M
P
P’
P’
P’
...
“small” processors+ “tiny” memories
Related idea inOracle Exadata"DB machine"
CS 347 Lecture 1 45
(6) Unusual — sensor networks
P’
M
M
PB
M
PB
M
PB
M
PB
M
PB
data collection node
sensor
battery
CS 347 Lecture 1 46
Issues for selecting architecture
• Reliability• Scalability• Geographic distribution of data• Data “clusters”• Performance• Cost
CS 347 Lecture 1 47
Client-Server Systems
(or how to partition software)
ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access
client
server
CS 347 Lecture 1 48
Client-Server Systems
(or how to partition software)
ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access
client
server
CS 347 Lecture 1 49
Client-Server Systems
(or how to partition software)
ApplicationFront EndQuery ProcessorTransaction ProcessingFile Access
client
server
CS 347 Lecture 1 50
Transaction Servers
• Clients ship transactions consisting of 1 or more SQL commands
E.g., Open DataBase Connectivity (ODBC)
(standard API)
CS 347 Lecture 1 51
Data Servers
• Client requests pages or records• Popular for OODB systems
CS 347 Lecture 1 52
Issues
• Object granularity• Where is data cached?• Where is locking done?
CS 347 Lecture 1 53
Basic Tradeoff
• Offloading work to clients• Data transmitted
C C
SS
Get pages
Reservehotel room
CS 347 Lecture 1 54
Note: Similar issues arise when we partition software/functionality within server
Reservehotel room P
M
P
M
P
M
...
•Where is data cached?
•Where is locking done?
CS 347 Lecture 1 55
Parallel or distributed DB system?• More similarities than differences!
CS 347 Lecture 1 56
• Typically, parallel DBs:– Fast interconnect– Homogeneous software– High performance is goal– Transparency is goal
CS 347 Lecture 1 57
• Typically, distributed DBs:– Geographically distributed– Data sharing is goal (may run into heterogeneity, autonomy)– Disconnected operation possible
Cloud Computing
• Is CC just a marketing term??– utility (like power)– data or CPU cycles?– many processors, many storage units– business model
CS 347 Lecture 1 58
Is CC a subset, superset, disjoint from, or overlaps with:
• grid computing• distributed computing• Web 2.0• Cluster Computing• Peer-to-peer computing• software as a service• client-server computing• data center as a computer• massively parallel
computingCS 347 Lecture 1 59
(A)
CC(B)
CC(C)
CC(D)
CC
Clash of the Clouds (Economist April 4, 2009)
CS 347 Lecture 1 60
Silver lining (cloud price war) (Economist, August 30, 2014)
CS 347 Lecture 1 61
CC Issues
• Customer lock-in• Privacy• Standards• Software licensing
CS 347 Lecture 1 62
CS 347 Lecture 1 63
Next
• How to describe distributed data• Query processing in parallel DBs• Query processing in distributed DBs
CS 347 Lecture 1 64
Query processing in parallel DBs:
• Typically: we can distribute/partition/sort…. data to make certain DBoperations (e.g., Join) fast
CS 347 Lecture 1 65
Query processing in distributed DBs:
• Typically: we are given data distribution; we need to find query processing strategy to minimize cost
(e.g., communication cost)