distributed systemsiwanicki/courses/ds/2011/... · distributed multidimensional map indexed by a...

21
Distributed Systems Maciej Łopatka

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Distributed Systems

Maciej Łopatka

Page 2: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Facebook Inbox Search

Authors Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik

Facebook code dump

Community

Transfer to Apache Software Foundation

An Apache top level project

Page 3: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

BigTable data model

An Amazon Dynamo-like infrastructure

Page 4: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Distributed multidimensional map indexed by a key

Four or five dimensions

Key Value Timestamp

Data

Page 5: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Keyspace → Column Family

Column Family → Column Family Row

Column Family Row → Columns

Column → Data value

Page 6: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Keyspace → Super Column Family

Super Column Family → Super Column Family Row

Super Column Family Row → Columns Row

Column Row → Columns

Column → Data value

Page 7: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Replication Log file Bootstrapping Partitioning Consistent Hashing

Periodic Data Compaction Gossip Anti-Entropy data sync (uses Merkel tree) Write and Read Quorum

W + R > N

Page 8: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family
Page 9: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

RandomPartitioner

OrderPreservingPartitioner

Page 10: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family
Page 11: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Terabytes of data

Replaced MySQL

Detecting failures in 15 seconds

ZooKeeper used to locate nodes

Replaced by HBase

Page 12: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

50+TB of data on a 150 node cluster, east and west coast data centers

Term search UserId -> Word -> MessageId Columns

Interaction search UserId -> Recipient UserId -> MessageId Columns

Latency Stat Search Interactions Term Search

Min 7.69ms 7.78ms

Median 15.69ms 18.27ms

Max 26.13ms 44.41ms

Tab. Read performance

Page 13: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family
Page 14: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family
Page 15: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family
Page 16: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family
Page 17: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family
Page 18: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Workload A— 50 percent reads and 50 percent updates, update heavy: (a) read operations, (b) update operations.

Six server-class machines (dual 64-bit quad core 2.5 GHz Intel Xeon CPUs, 8 GB of RAM, 6 disk RAID-10 array and gigabit ethernet)

Page 19: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Workload B — 50 percent reads and 50 percent updates, Read heavy: (a) read operations, (b) update operations.

Six server-class machines (dual 64-bit quad core 2.5 GHz Intel Xeon CPUs, 8 GB of RAM, 6 disk RAID-10 array and gigabit ethernet)

Page 20: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

Designed to run on cheap commodity hardware

Handle high write throughput while not sacricing read eciency

Decentralized

Elasticity

Fault-tolerant

Tunable consistency

Page 21: Distributed Systemsiwanicki/courses/ds/2011/... · Distributed multidimensional map indexed by a key Four or five dimensions Key Value Timestamp Data Keyspace → Column Family

http://en.wikipedia.org/wiki/Apache_Cassandra Cassandra - A Decentralized Structured Storage

System, Avinash Lakshman, Prashant Malik, Facebook

http://maxgrinev.com/2010/07/09/a-quick-introduction-to-the-cassandra-data-model/

http://www.facebook.com/note.php?note_id=454991608919

http://horicky.blogspot.com/2010/10/bigtable-model-with-cassandra-and-hbase.html

http://www.datastax.com/docs/1.0/ddl/index

Benchmarking Cloud Serving Systems with YCSB, Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears