20110515 google megastore
Post on 12-May-2015
866 Views
Preview:
TRANSCRIPT
111
Megastore - Providing Scalable, Highly Available
J. Baker, C. Bond,
J.C. Corbett, JJ Furman,
A. Khorlin, J. Larson,
J-M Léon, Y. Li, A. Lloyd,
V. Yushprakh
Google Inc.
bergwolf@linuxfb.orgMay. 2011
22
Agenda
Motivation
Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
33
Motivation
Build a system to please everyone (users, admins, developers).
44
Motivation
High availability – Fully functional during planned maintenance periods, as well as most unplanned infrastructure issues.
Scalability – Service huge audience of potential users.
ACID – Easier for writing and deploying applications.
55
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
66
Megastore Overview
Widely deployed in Google for several years.
Used on more than 100 production applications.
Handles more than 3 billion write and 20 billion read transactions daily.
Stores nearly a petabyte of primary data across many global datacenters.
Available on GAE since Jan 2011.
77
Architecture
Built on top of Bigtable and Chubby.
Blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS
Synchronous replication based on Paxos across datacenters.
88
Architecture
99
Architecture
Scalable replication.
1010
Architecture
Operation across Entity Groups
1111
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
1212
Data Model
Somewhere between RDBMS and row-column storage of NOSQL.SchemasTables (Entity group root table/child table, child table must have a single distinguished foreign key referencing root table)EntitiesProperties
1313
Sample Schema
1414
Mapping to Bigtable
Primary Keys are chosen to cluster entities that will be read together.
Each entity is mapped into a single Bigtable row.
“IN TABLE” instructs to colocate tables into the same Bigtable, and key ordering ensures Photo entities are stored adjacent to corresponding User.
Bigtable column name = Megastore table name + property name
1515
Indexes
Two level of indexes:Local index: Separate indexes for each entity group. Stored in entity group and updated atomically and consistently.Global index: Span entity groups. Not guaranteed to reflect all recent updates.
1616
Transactions & Concurrency
Entity group is a mini-database providing serializable ACID semantics.
MVCC (MultiVersion Concurrency Control) using transaction timestamp
Reads and Writes are isolated
1717
Transactions & Concurrency
Three level of reads consistencyCurrent: apply all previous committed logs before read within a single entity group.Snapshot: pick the last known fully applied transaction to read, within a single entity group.Inconsistent: ignore the state of log and read the latest value directly.
1818
Transactions & Concurrency
Write transaction:Current read: Obtain the timestamp and log position of the last committed transaction.Application logic: Read from Bigtable and gather writes into a log entry.Commit: Use Paxos to achieve consensus for appending the log entry to log.Apply: Write mutations to the entities and indexes in Bigtable.Clean up: Delete temp data.
1919
Transactions & Concurrency
Queues provide transactional messaging between entity groups. Declaring a queue automatically creates an inbox on each entity group (scale automatically).
Two phase commit
Queue is recommended over two phase commit.
2020
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
2121
Paxos
Basic Paxos
Multi-Paxos
2222
Reads
2323
Writes
2424
Failure Detection
Coordinators obtain specific Chubby locks in remote datacenters at startup.
If it ever loses a majority of its locks from a crash or network partition, it will consider all entity groups in its purview to be out-of-date.
reads at the replica must query the log position from a majority of replicas until the locks are regained and its coordinator entries are revalidated.
all writers must wait for the coordinator's Chubby locks to expire before writes can complete
2525
Agenda
Motivation
Megastore Architecture
ACID over NOSQL Database
Replication via Paxos
Operational Results
2626
Distribution of Availability
2727
Distribution of Average Latencies
2828
Conclusion
Most users see five nines availability
Average read latencies are tens of milliseconds, indicating most reads are local.
Most writes costs 100-400 milliseconds.
2929
Questions?
top related