distributed systems tutorial 9 – windows azure storage written by alex libov based on sosp 2011...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Distributed SystemsTutorial 9 – Windows Azure Storage
written by Alex LibovBased on SOSP 2011 presentation
winter semester, 2011-2012
Windows Azure Storage (WAS)
2
A scalable cloud storage system In production since November 2008 used inside Microsoft for applications such as
social networking search, serving video, music and game content, managing medical records and more
Thousands of customers outside Microsoft Anyone can sign up over the Internet to use
the system.
WAS Abstractions
3
Blobs – File system in the cloud Tables – Massively scalable structured
storage Queues – Reliable storage and delivery of
messages A common usage pattern is incoming and
outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.
4
Design goals Highly Available with Strong Consistency
Provide access to data in face of failures/partitioning
Durability Replicate data several times within and across
data centers Scalability
Need to scale to exabytes and beyond Provide a global namespace to access data around
the world Automatically load balance data to meet peak
traffic demands
Global Partitioned Namespace
5
http(s)://AccountName.<service>.core.windows.net/PartitionName/ObjectName <service> can be a blob, table or queue. AccountName is the customer selected account name for accessing
storage. The Account name specifies the data center where the data is stored. An application may use multiple AccountNames to store its data
across different locations.
PartitionName locates the data once a request reaches the storage cluster When a PartitionName holds many objects, the ObjectName identifies
individual objects within that partition The system supports atomic transactions across objects with the
same PartitionName value The ObjectName is optional since, for some types of data, the
PartitionName uniquely identifies the object within the account.
Storage Stamps
6
A storage stamp is a cluster of N racks of storage nodes.
Each rack is built out as a separate fault domain with redundant networking and power.
Clusters typically range from 10 to 20 racks with 18 disk-heavy storage nodes per rack.
The first generation storage stamps hold approximately 2PB of raw storage each.
The next generation stamps hold up to 30PB of raw storage each.
High Level Architecture
7
Storage Stamp
LB
StorageLocation Service
Data access
Partition Layer
Front-Ends
Stream LayerIntra-stamp replication
Storage Stamp
LB
Partition Layer
Front-Ends
Stream LayerIntra-stamp replication
Inter-stamp (Geo) replication
Access blob storage via the URL: http://<account>.blob.core.windows.net/
Storage Stamp Architecture – Stream Layer
8
Append-only distributed file system All data from the Partition Layer is stored into files
(extents) in the Stream layer An extent is replicated 3 times across different fault and
upgrade domains With random selection for where to place replicas
Checksum all stored data Verified on every client read
Re-replicate on disk/node/rack failure or checksum mismatch
M
Extent Nodes (EN)
Paxos
M
MStream Layer(DistributedFile System)
Storage Stamp Architecture – Partiton Layer
9
Provide transaction semantics and strong consistency for Blobs, Tables and Queues
Stores and reads the objects to/from extents in the Stream layer
Provides inter-stamp (geo) replication by shipping logs to other stamps
Scalable object index via partitioning
PartitionServer
PartitionServer
PartitionServer
PartitionServer
PartitionMaster
Lock Service
Partition Layer
Storage Stamp Architecture – Front End Layer
10
Stateless Servers Authentication + authorization Request routing
Storage Stamp Architecture
11
M
Extent Nodes (EN)
Paxos
Front End Layer
FE
Incoming Write Request
M
M
Partition
Server
Partition
Server
Partition
Server
Partition
Server
Partition
Master
FE FE FE FE
Lock Servi
ce
Ack
Partition Layer
Stream Layer
Partition Layer – Scalable Object Index
12
100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp Need to efficiently enumerate, query, get, and
update them Traffic pattern can be highly dynamic
Hot objects, peak load, traffic bursts, etc
Need a scalable index for the objects that can Spread the index across 100s of servers Dynamically load balance
Dynamically change what servers are serving each part of the index based on load
13
Scalable Object Index via Partitioning Partition Layer maintains an internal Object Index Table
for each data abstraction Blob Index: contains all blob objects for all accounts in a
stamp Table Entity Index: contains all table entities for all accounts
in a stamp Queue Message Index: contains all messages for all accounts
in a stamp
Scalability is provided for each Object Index Monitor load to each part of the index to determine hot spots Index is dynamically split into thousands of Index
RangePartitions based on load Index RangePartitions are automatically load balanced across
servers to quickly adapt to changes in load
AccountName
ContainerName
BlobName
aaaa aaaa aaaaa
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
zzzz zzzz zzzzz
• Split index into RangePartitions based on load
• Split at PartitionKey boundaries
• PartitionMap tracks Index RangePartition assignment to partition servers
• Front-End caches the PartitionMap to route user requests
• Each part of the index is assigned to only one Partition Server at a time
Storage Stamp
Partition
Server
Partition
Server
AccountName
ContainerName
BlobName
richard videos tennis
……… ……… ………
……… ……… ………
zzzz zzzz zzzzz
AccountName
ContainerName
BlobName
harry pictures sunset
……… ……… ………
……… ……… ………
richard videos soccer
Partition
Server
Partition
Master
Partition Layer – Index Range Partitioning
Front-EndServer
PS 2 PS 3
PS 1
A-H: PS1H’-R: PS2R’-Z: PS3
A-H: PS1H’-R: PS2R’-Z: PS3
Partition
Map
Blob Index
Partition Map
AccountName
ContainerName
BlobName
aaaa aaaa aaaaa
……… ……… ………
……… ……… ………
harry pictures sunrise A-H
R’-Z
H’-R
15
Partition Layer – RangePartition A RangePartition uses a Log-Structured Merge-Tree to
maintain its persistent data. RangePartition consists of its own set of streams in the
stream layer, and the streams belong solely to a given RangePartition
Metadata Stream – The metadata stream is the root stream for a RangePartition. The PM assigns a partition to a PS by providing the name of the
RangePartition’s metadata stream Commit Log Stream – Is a commit log used to store the
recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition.
Row Data Stream – Stores the checkpoint row data and index for the RangePartition.
16
Stream Layer Append-Only Distributed File System Streams are very large files
Has file system like directory namespace Stream Operations
Open, Close, Delete Streams Rename Streams Concatenate Streams together Append for writing Random reads
Extent E2 Extent E3
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Stream Layer Concepts
Block Min unit of
write/read Checksum Up to N bytes (e.g.
4MB)
Extent Unit of replication Sequence of
blocks Size limit (e.g.
1GB) Sealed/unsealed
Stream Hierarchical
namespace Ordered list of
pointers to extents Append/
Concatenate
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Extent E4
Stream //foo/myfile.dataPtr E1
Ptr E2
Ptr E3
Ptr E4
sealed sealed sealed unsealedExtent E1
Creating an Extent
SMSMStrea
m Maste
r
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN
Create Stream/Extent
Allocate Extent replica set
Primary Secondary ASecondary B
EN1 PrimaryEN2, EN3 Secondary
Replication Flow
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN
Append
Primary Secondary ASecondary B
Ack
EN1 PrimaryEN2, EN3 Secondary
www.buildwindows.com
Providing Bit-wise Identical Replicas
• Want all replicas for an extent to be bit-wise the same, up to a committed length• Want to store pointers from the partition layer index to an
extent+offset• Want to be able to read from any replica
• Replication flow• All appends to an extent go to the Primary• Primary orders all incoming appends and picks the offset for
the append in the extent• Primary then forwards offset and data to secondaries• Primary performs in-order acks back to clients for extent
appends• Primary returns the offset of the append in the extent• An extent offset can commit back to the client once all replicas
have written that offset and all prior offsets have also already been completely written
• This represents the committed length of the extent
?
Dealing with Write FailuresFailure during append1. Ack from primary lost when going back to partition layer
Retry from partition layer can cause multiple blocks to be appended (duplicate records)
2. Unresponsive/Unreachable Extent Node (EN) Append will not be acked back to partition layer Seal the failed extent Allocate a new extent and append immediately
Stream //foo/myfile.datPtr E1
Ptr E2
Ptr E3
Ptr E4
Extent E5
Ptr E5
Extent E1Extent E2 Extent E3Extent E4
Extent Sealing (Scenario 1)
SMSMStrea
m Maste
r
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Append
Primary Secondary ASecondary B
Ask for current length120120
Sealed at 120
Seal Extent
Seal Extent
Extent Sealing (Scenario 1)
SMSMStrea
m Maste
r
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Primary Secondary ASecondary B
Sync with SM120
Sealed at 120
Seal Extent
Extent Sealing (Scenario 2)
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Append
Primary Secondary ASecondary B
Ask for current length120
Sealed at 100
Seal Extent
100
Seal Extent
Extent Sealing (Scenario 2)
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Primary Secondary ASecondary B
Sync with SM
Sealed at 100
Seal Extent
100
Providing Consistency for Data Streams
SMSMSM
EN 1 EN 2 EN 3
Primary Secondary ASecondary B
Partition
Server
Network partition• PS can talk to EN3• SM cannot talk to EN3
For Data Streams, Partition Layer only reads from offsets returned from successful appends Committed on all replicas Row and Blob Data Streams
Offset valid on any replica
Safe to read from EN3
Providing Consistency for Log Streams
SMSMSM
EN 1 EN 2 EN 3
Primary Secondary ASecondary B
Partition
Server
Check commit length Logs are used on partition load
Commit and Metadata log streams
Check commit length first Only read from Unsealed replica if all
replicas have the same commit length
A sealed replica Check commit lengthSeal Extent
Use EN1, EN2 for loading
Network partition• PS can talk to EN3• SM cannot talk to EN3
Summary Highly Available Cloud Storage with Strong
Consistency Scalable data abstractions to build your
applications Blobs – Files and large objects Tables – Massively scalable structured storage Queues – Reliable delivery of messages
More information at: http://www.sigops.org/sosp/sosp11/current/
2011-Cascais/11-calder-online.pdf