cassandra and sigmod contest cloud computing group haiping wang 2009-12-19
TRANSCRIPT
![Page 1: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/1.jpg)
Cassandra and Sigmod contest
Cloud computing group
Haiping Wang
2009-12-19
![Page 2: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/2.jpg)
Outline
Cassandra• Cassandra overview• Data model• Architecture• Read and write
Sigmod contest 2009Sigmod contest 2010
![Page 3: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/3.jpg)
Cassandra overview
Highly scalable, distributedEventually consistentStructured key-value storeDynamo + bigtable P2PRandom reads and random writesJava
![Page 4: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/4.jpg)
Data ModelKEY
ColumnFamily1 Name : MailList Type : Simple Sort : Name
Name : tid1
Value : <Binary>
TimeStamp : t1
Name : tid2
Value : <Binary>
TimeStamp : t2
Name : tid3
Value : <Binary>
TimeStamp : t3
Name : tid4
Value : <Binary>
TimeStamp : t4
ColumnFamily2 Name : WordList Type : Super Sort : Time
Name : aloha
ColumnFamily3 Name : System Type : Super Sort : Name
Name : hint1
<Column List>
Name : hint2
<Column List>
Name : hint3
<Column List>
Name : hint4
<Column List>
C1
V1
T1
C2
V2
T2
C3
V3
T3
C4
V4
T4
Name : dude
C2
V2
T2
C6
V6
T6
Column Families are declared
upfront
Columns are added and modified dynamically
SuperColumns are added and modified
dynamically
Columns are added and modified dynamically
![Page 5: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/5.jpg)
Cassandra Architecture
![Page 6: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/6.jpg)
Cassandra API
• Data structures• Exceptions• Service API• ConsistencyLevel(4)• Retrieval methods(5)• Range query: returns matching keys(1)• Modification methods(3)
• Others
![Page 7: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/7.jpg)
Cassandra commands
![Page 8: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/8.jpg)
Partitioning and replication(1)
• Consistent hashing• DHT• Balance• Monotonicity• Spread• Load
• Virtual nodes• Coordinator• Preference list
![Page 9: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/9.jpg)
01
1/2
F
E
D
C
B
A N=3
h(key2)
h(key1)
9
Partitioning and replication(2)
![Page 10: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/10.jpg)
Data Versioning
• Always writeable• Mulitple versions– put() return before all replicas– get() many versions
• Vector clocks • Reconciliation during reads by clients
![Page 11: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/11.jpg)
Vector clock
• List of (node, counter) pairsE.g. [x,2][y,3] vs. [x,3][y,4][z,1]
[x,1][y,3] vs. [z,1][y,3]
• Use timestampE.g. D([x,1]:t1,[y,1]:t2)
• Remove the oldest version when reach a thresthold
![Page 12: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/12.jpg)
Vector clock
Return all the objects at the leaves
D3,4([Sx,2],[Sy,1],[Sz,1])
Single new version
![Page 13: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/13.jpg)
Excution operations
• Two strategies– A generic load balancer based on load balance• Easy ,not have to link any code specific
– Directory to the node• Achieve lower latency
![Page 14: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/14.jpg)
Put() operation
client coordinator
PN-1
P2
P1w-1 responses
Object with vector clock
![Page 15: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/15.jpg)
Cluster Membership
• Gossip protocol• State disseminated in O(logN) rounds• Increase its heartbeat counter and send its list
to another every T seconds• Merge operations
![Page 16: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/16.jpg)
![Page 17: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/17.jpg)
![Page 18: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/18.jpg)
![Page 19: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/19.jpg)
Failure
• Data center(s) failure–Multiple data centers
• Temporary failure• Permanent failure–Merkle tree
![Page 20: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/20.jpg)
Temporary failure
![Page 21: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/21.jpg)
Merkle tree
![Page 22: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/22.jpg)
Boolom filter
a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive
![Page 23: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/23.jpg)
CompactionsK1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
--
--
--Sorted
K2 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
--
--
--Sorted
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
--
--
--Sorted
MERGE SORT
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
Sorted
K1 Offset
K5 Offset
K30 Offset
Bloom Filter
Loaded in memory
Index File
Data File
D E L E T E D
![Page 24: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/24.jpg)
Write
Key (CF1 , CF2 , CF3)
Commit LogBinary serialized
Key ( CF1 , CF2 , CF3 )
Memtable ( CF1)
Memtable ( CF2)
Memtable ( CF2)
• Data size
• Number of Objects
• Lifetime
Dedicated Disk
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>
---
---
---
---
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>
BLOCK Index <Key Name> Offset, <Key Name> Offset
K128 Offset
K256 Offset
K384 Offset
Bloom Filter
(Index in memory)
Data file on disk
![Page 25: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/25.jpg)
Read
Query
Closest replica
Cassandra Cluster
Replica A
Result
Replica B Replica C
Digest QueryDigest Response Digest Response
Result
Client
Read repair if digests differ
![Page 26: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/26.jpg)
Outline
Cassandra• Cassandra overview• Data model• Architecture• Read and write
Sigmod contest 2009Sigmod contest 2010
![Page 27: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/27.jpg)
Sigmod contest 2009
Task overviewAPIData structureArchitectureTest
![Page 28: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/28.jpg)
Task overview
• Index system for main memory data• Running on multi-core machine• Many threads with multiple indices• Serialize execution of user-specified
transactions• Basic function
exact match queries ,range queries , updates inserts , deletes
![Page 29: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/29.jpg)
API
![Page 30: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/30.jpg)
Record
![Page 31: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/31.jpg)
HashTableHsize
size
hashTab
average(64)
deviation
nbEl
domain
warpMode(bool)
0 1 ... size-1
key
key
key
key
key
key
key
key key
key key
key
dataType key ; int64_t hashKey ; char * payload ;
*nexrt
key
key
key key
![Page 32: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/32.jpg)
HashShared
nbNameIndex
ni
0 1 2 3 ... 999
0 100 200 300 ... 999000
1 101 201 301 ... 99901
2 102 202 302 ... 99902
... ... ... ... ... ...
199 299 399 ... 9999999
int 类型数据
NameIndex类型数据
idx
str \0
IdxState类型的对象
![Page 33: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/33.jpg)
TxnState
state
indexActive
th
iNbR
indexToReset
nbIndex
0123...
199
0123...
199
![Page 34: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/34.jpg)
IdxState
• Keep track of an index• Created openIndex()• Destroyed closeIndex()• Inherited by IdxStateType• Contains pointers pointing to– a hashtable– a FixedAllocator– a Allocator– a array with the type of action
![Page 35: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/35.jpg)
Architecture
indexManager
Allocator
transactor
DeadLockDetector
![Page 36: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/36.jpg)
IndexManager
hs
nbIndexTab
indexTab
indexTab[0]
indexTab[1]
indexTab[2]
indexTab[3]
indexTab[i]
indexTab[..]
![Page 37: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/37.jpg)
DeadLockDetector
![Page 38: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/38.jpg)
Transactor
• a HashOnlyGet object with type TxnState
nbNameIndex
id
mutex
iThread
nbElement
0 1 2 3 ... 999
0 100 200 300 ... 999000
1 101 201 301 ... 99901
2 102 202 302 ... 99902
... ... ... ... ... ...
199 299 399 ... 9999999
pt
data T
![Page 39: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/39.jpg)
Allocator
• Allocate the memory for the payloads• Use pools and linked list• Pool sized --the max length of payload is 100• The payloads with the same payload are in the
same list
![Page 40: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/40.jpg)
Unit Tests• three threads , run over three indices• the primary thread– create the primary index– inserts, deletes and accesses data in the primary index
• the second thread– simultaneously runs some basic tests over a separate index
• the third thread– ensure the transactional guarantees– Continuously queries the primary index
![Page 41: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/41.jpg)
Outline
Cassandra• Cassandra overview• Data model• Architecture• Read and write
Sigmod contest 2009Sigmod contest 2010
![Page 42: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/42.jpg)
Task overview
• Implement a simple distributed query executor with the help of the in-memory index
• Given centralized query plans, translate them into distributed query plans
• Given a parsed SQL query, return the right results• Data stored on disk, the indexes are all in
memory• Measure the total time costs
![Page 43: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/43.jpg)
SQL query form
SELECT alias_name.field_name, ...
FROM table_name AS alias_name,…
WHERE condition1 AND ... AND conditionN
Conditionalias_name.field_name = fixed value
alias_name.field_name > fixed value
alias_name.field_name1 =alias_name.field_name2
![Page 44: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/44.jpg)
Initialization phase
![Page 45: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/45.jpg)
Connection phase
![Page 46: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/46.jpg)
Query phase
![Page 47: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/47.jpg)
Closing phase
![Page 48: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/48.jpg)
Tests
• An initial computation• On synthetic and real-world datasets• Tested on a single machine• Tested on an ad-hoc cluster of peers• Passed a collection of unit tests , provided with
an Amazon Web Services account of a 100 USD value
![Page 49: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/49.jpg)
Benchmarks(stag1)
• Assume a partition always cover the entire table, the data is not replicated.
• Unit-tests• Benchmarks
– On a single node, selects with an equal condition on the primary key – On a single node, selects with an equal condition on an indexed field – On a single node, 2 to 5 joins on tables of different size – On a single node, 1 join and a "greater than" condition on an indexed field – On three nodes, one join on two tables of different size, the two tables being on
two different nodes
![Page 50: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/50.jpg)
Benchmarks(stag2)
• Tables are now stored on multiple nodes• Part of a table, or the whole table may be
replicated on multiple nodes• Queries will be sent in parallel up to 50
simultaneous connections• Benchmarks
– Selects with an equal condition on the primary key, the values being uniformly distributed
– Selects with an equal condition on the primary key, the values being non-uniformly distributed
– Multiple joins on tables separated on different nodes
![Page 51: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/51.jpg)
Important Dates
![Page 52: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19](https://reader030.vdocuments.net/reader030/viewer/2022020718/5519015955034626428b45aa/html5/thumbnails/52.jpg)
Thank you!!!