20110515 google megastore
TRANSCRIPT
![Page 1: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/1.jpg)
111
Megastore - Providing Scalable, Highly Available
J. Baker, C. Bond,
J.C. Corbett, JJ Furman,
A. Khorlin, J. Larson,
J-M Léon, Y. Li, A. Lloyd,
V. Yushprakh
Google Inc.
[email protected]. 2011
![Page 2: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/2.jpg)
22
Agenda
Motivation
Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
![Page 3: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/3.jpg)
33
Motivation
Build a system to please everyone (users, admins, developers).
![Page 4: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/4.jpg)
44
Motivation
High availability – Fully functional during planned maintenance periods, as well as most unplanned infrastructure issues.
Scalability – Service huge audience of potential users.
ACID – Easier for writing and deploying applications.
![Page 5: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/5.jpg)
55
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
![Page 6: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/6.jpg)
66
Megastore Overview
Widely deployed in Google for several years.
Used on more than 100 production applications.
Handles more than 3 billion write and 20 billion read transactions daily.
Stores nearly a petabyte of primary data across many global datacenters.
Available on GAE since Jan 2011.
![Page 7: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/7.jpg)
77
Architecture
Built on top of Bigtable and Chubby.
Blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS
Synchronous replication based on Paxos across datacenters.
![Page 8: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/8.jpg)
88
Architecture
![Page 9: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/9.jpg)
99
Architecture
Scalable replication.
![Page 10: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/10.jpg)
1010
Architecture
Operation across Entity Groups
![Page 11: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/11.jpg)
1111
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
![Page 12: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/12.jpg)
1212
Data Model
Somewhere between RDBMS and row-column storage of NOSQL.SchemasTables (Entity group root table/child table, child table must have a single distinguished foreign key referencing root table)EntitiesProperties
![Page 13: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/13.jpg)
1313
Sample Schema
![Page 14: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/14.jpg)
1414
Mapping to Bigtable
Primary Keys are chosen to cluster entities that will be read together.
Each entity is mapped into a single Bigtable row.
“IN TABLE” instructs to colocate tables into the same Bigtable, and key ordering ensures Photo entities are stored adjacent to corresponding User.
Bigtable column name = Megastore table name + property name
![Page 15: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/15.jpg)
1515
Indexes
Two level of indexes:Local index: Separate indexes for each entity group. Stored in entity group and updated atomically and consistently.Global index: Span entity groups. Not guaranteed to reflect all recent updates.
![Page 16: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/16.jpg)
1616
Transactions & Concurrency
Entity group is a mini-database providing serializable ACID semantics.
MVCC (MultiVersion Concurrency Control) using transaction timestamp
Reads and Writes are isolated
![Page 17: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/17.jpg)
1717
Transactions & Concurrency
Three level of reads consistencyCurrent: apply all previous committed logs before read within a single entity group.Snapshot: pick the last known fully applied transaction to read, within a single entity group.Inconsistent: ignore the state of log and read the latest value directly.
![Page 18: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/18.jpg)
1818
Transactions & Concurrency
Write transaction:Current read: Obtain the timestamp and log position of the last committed transaction.Application logic: Read from Bigtable and gather writes into a log entry.Commit: Use Paxos to achieve consensus for appending the log entry to log.Apply: Write mutations to the entities and indexes in Bigtable.Clean up: Delete temp data.
![Page 19: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/19.jpg)
1919
Transactions & Concurrency
Queues provide transactional messaging between entity groups. Declaring a queue automatically creates an inbox on each entity group (scale automatically).
Two phase commit
Queue is recommended over two phase commit.
![Page 20: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/20.jpg)
2020
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
![Page 21: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/21.jpg)
2121
Paxos
Basic Paxos
Multi-Paxos
![Page 22: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/22.jpg)
2222
Reads
![Page 23: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/23.jpg)
2323
Writes
![Page 24: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/24.jpg)
2424
Failure Detection
Coordinators obtain specific Chubby locks in remote datacenters at startup.
If it ever loses a majority of its locks from a crash or network partition, it will consider all entity groups in its purview to be out-of-date.
reads at the replica must query the log position from a majority of replicas until the locks are regained and its coordinator entries are revalidated.
all writers must wait for the coordinator's Chubby locks to expire before writes can complete
![Page 25: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/25.jpg)
2525
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
![Page 26: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/26.jpg)
2626
Distribution of Availability
![Page 27: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/27.jpg)
2727
Distribution of Average Latencies
![Page 28: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/28.jpg)
2828
Conclusion
Most users see five nines availability
Average read latencies are tens of milliseconds, indicating most reads are local.
Most writes costs 100-400 milliseconds.
![Page 29: 20110515 google megastore](https://reader036.vdocuments.net/reader036/viewer/2022062319/5551477db4c905c6268b4e7b/html5/thumbnails/29.jpg)
2929
Questions?