distributed systems tutorial 9 – windows azure storage written by alex libov based on sosp 2011...

Distributed SystemsTutorial 9 – Windows Azure Storage

written by Alex LibovBased on SOSP 2011 presentation

winter semester, 2011-2012

Windows Azure Storage (WAS)

2

A scalable cloud storage system In production since November 2008 used inside Microsoft for applications such as

social networking search, serving video, music and game content, managing medical records and more

Thousands of customers outside Microsoft Anyone can sign up over the Internet to use

the system.

WAS Abstractions

3

Blobs – File system in the cloud Tables – Massively scalable structured

storage Queues – Reliable storage and delivery of

messages A common usage pattern is incoming and

outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.

4

Design goals Highly Available with Strong Consistency

Provide access to data in face of failures/partitioning

Durability Replicate data several times within and across

data centers Scalability

Need to scale to exabytes and beyond Provide a global namespace to access data around

the world Automatically load balance data to meet peak

traffic demands

Global Partitioned Namespace

5

http(s)://AccountName.<service>.core.windows.net/PartitionName/ObjectName <service> can be a blob, table or queue. AccountName is the customer selected account name for accessing

storage. The Account name specifies the data center where the data is stored. An application may use multiple AccountNames to store its data

across different locations.

PartitionName locates the data once a request reaches the storage cluster When a PartitionName holds many objects, the ObjectName identifies

individual objects within that partition The system supports atomic transactions across objects with the

same PartitionName value The ObjectName is optional since, for some types of data, the

PartitionName uniquely identifies the object within the account.

Storage Stamps

6

A storage stamp is a cluster of N racks of storage nodes.

Each rack is built out as a separate fault domain with redundant networking and power.

Clusters typically range from 10 to 20 racks with 18 disk-heavy storage nodes per rack.

The first generation storage stamps hold approximately 2PB of raw storage each.

The next generation stamps hold up to 30PB of raw storage each.

High Level Architecture

7

Storage Stamp

LB

StorageLocation Service

Data access

Partition Layer

Front-Ends

Stream LayerIntra-stamp replication

Storage Stamp

LB

Partition Layer

Front-Ends

Stream LayerIntra-stamp replication

Inter-stamp (Geo) replication

Access blob storage via the URL: http://<account>.blob.core.windows.net/

Storage Stamp Architecture – Stream Layer

8

Append-only distributed file system All data from the Partition Layer is stored into files

(extents) in the Stream layer An extent is replicated 3 times across different fault and

upgrade domains With random selection for where to place replicas

Checksum all stored data Verified on every client read

Re-replicate on disk/node/rack failure or checksum mismatch

M

Extent Nodes (EN)

Paxos

M

MStream Layer(DistributedFile System)

Storage Stamp Architecture – Partiton Layer

9

Provide transaction semantics and strong consistency for Blobs, Tables and Queues

Stores and reads the objects to/from extents in the Stream layer

Provides inter-stamp (geo) replication by shipping logs to other stamps

Scalable object index via partitioning

PartitionServer

PartitionServer

PartitionServer

PartitionServer

PartitionMaster

Lock Service

Partition Layer

Storage Stamp Architecture – Front End Layer

10

Stateless Servers Authentication + authorization Request routing

Storage Stamp Architecture

11

M

Extent Nodes (EN)

Paxos

Front End Layer

FE

Incoming Write Request

M

M

Partition

Server

Partition

Server

Partition

Server

Partition

Server

Partition

Master

FE FE FE FE

Lock Servi

ce

Ack

Partition Layer

Stream Layer

Partition Layer – Scalable Object Index

12

100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp Need to efficiently enumerate, query, get, and

update them Traffic pattern can be highly dynamic

Hot objects, peak load, traffic bursts, etc

Need a scalable index for the objects that can Spread the index across 100s of servers Dynamically load balance

Dynamically change what servers are serving each part of the index based on load

13

Scalable Object Index via Partitioning Partition Layer maintains an internal Object Index Table

for each data abstraction Blob Index: contains all blob objects for all accounts in a

stamp Table Entity Index: contains all table entities for all accounts

in a stamp Queue Message Index: contains all messages for all accounts

in a stamp

Scalability is provided for each Object Index Monitor load to each part of the index to determine hot spots Index is dynamically split into thousands of Index

RangePartitions based on load Index RangePartitions are automatically load balanced across

servers to quickly adapt to changes in load

AccountName

ContainerName

BlobName

aaaa aaaa aaaaa

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

…….. …….. ……..

zzzz zzzz zzzzz

• Split index into RangePartitions based on load

• Split at PartitionKey boundaries

• PartitionMap tracks Index RangePartition assignment to partition servers

• Front-End caches the PartitionMap to route user requests

• Each part of the index is assigned to only one Partition Server at a time

Storage Stamp

Partition

Server

Partition

Server

AccountName

ContainerName

BlobName

richard videos tennis

……… ……… ………

……… ……… ………

zzzz zzzz zzzzz

AccountName

ContainerName

BlobName

harry pictures sunset

……… ……… ………

……… ……… ………

richard videos soccer

Partition

Server

Partition

Master

Partition Layer – Index Range Partitioning

Front-EndServer

PS 2 PS 3

PS 1

A-H: PS1H’-R: PS2R’-Z: PS3

A-H: PS1H’-R: PS2R’-Z: PS3

Partition

Map

Blob Index

Partition Map

AccountName

ContainerName

BlobName

aaaa aaaa aaaaa

……… ……… ………

……… ……… ………

harry pictures sunrise A-H

R’-Z

H’-R

15

Partition Layer – RangePartition A RangePartition uses a Log-Structured Merge-Tree to

maintain its persistent data. RangePartition consists of its own set of streams in the

stream layer, and the streams belong solely to a given RangePartition

Metadata Stream – The metadata stream is the root stream for a RangePartition. The PM assigns a partition to a PS by providing the name of the

RangePartition’s metadata stream Commit Log Stream – Is a commit log used to store the

recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition.

Row Data Stream – Stores the checkpoint row data and index for the RangePartition.

16

Stream Layer Append-Only Distributed File System Streams are very large files

Has file system like directory namespace Stream Operations

Open, Close, Delete Streams Rename Streams Concatenate Streams together Append for writing Random reads

Extent E2 Extent E3

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Stream Layer Concepts

Block Min unit of

write/read Checksum Up to N bytes (e.g.

4MB)

Extent Unit of replication Sequence of

blocks Size limit (e.g.

1GB) Sealed/unsealed

Stream Hierarchical

namespace Ordered list of

pointers to extents Append/

Concatenate

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Extent E4

Stream //foo/myfile.dataPtr E1

Ptr E2

Ptr E3

Ptr E4

sealed sealed sealed unsealedExtent E1

Creating an Extent

SMSMStrea

m Maste

r

Paxos

Partition Layer

EN 1 EN 2 EN 3 EN

Create Stream/Extent

Allocate Extent replica set

Primary Secondary ASecondary B

EN1 PrimaryEN2, EN3 Secondary

Replication Flow

SMSMSM

Paxos

Partition Layer

EN 1 EN 2 EN 3 EN

Append


Ack

EN1 PrimaryEN2, EN3 Secondary

www.buildwindows.com

Providing Bit-wise Identical Replicas

• Want all replicas for an extent to be bit-wise the same, up to a committed length• Want to store pointers from the partition layer index to an

extent+offset• Want to be able to read from any replica

• Replication flow• All appends to an extent go to the Primary• Primary orders all incoming appends and picks the offset for

the append in the extent• Primary then forwards offset and data to secondaries• Primary performs in-order acks back to clients for extent

appends• Primary returns the offset of the append in the extent• An extent offset can commit back to the client once all replicas

have written that offset and all prior offsets have also already been completely written

• This represents the committed length of the extent

?

Dealing with Write FailuresFailure during append1. Ack from primary lost when going back to partition layer

Retry from partition layer can cause multiple blocks to be appended (duplicate records)

2. Unresponsive/Unreachable Extent Node (EN) Append will not be acked back to partition layer Seal the failed extent Allocate a new extent and append immediately

Stream //foo/myfile.datPtr E1

Ptr E2

Ptr E3

Ptr E4

Extent E5

Ptr E5

Extent E1Extent E2 Extent E3Extent E4

Extent Sealing (Scenario 1)

SMSMStrea

m Maste

r

Paxos

Partition Layer

EN 1 EN 2 EN 3 EN 4

Append


Ask for current length120120

Sealed at 120

Seal Extent

Seal Extent


SMSMStrea

m Maste

r

Paxos

Partition Layer

EN 1 EN 2 EN 3 EN 4


Sync with SM120

Sealed at 120

Seal Extent


SMSMSM

Paxos

Partition Layer

EN 1 EN 2 EN 3 EN 4

Append


Ask for current length120

Sealed at 100

Seal Extent

100

Seal Extent


SMSMSM

Paxos

Partition Layer

EN 1 EN 2 EN 3 EN 4


Sync with SM

Sealed at 100

Seal Extent

100

Providing Consistency for Data Streams

SMSMSM

EN 1 EN 2 EN 3


Partition

Server

Network partition• PS can talk to EN3• SM cannot talk to EN3

For Data Streams, Partition Layer only reads from offsets returned from successful appends Committed on all replicas Row and Blob Data Streams

Offset valid on any replica

Safe to read from EN3

Providing Consistency for Log Streams

SMSMSM

EN 1 EN 2 EN 3


Partition

Server

Check commit length Logs are used on partition load

Commit and Metadata log streams

Check commit length first Only read from Unsealed replica if all

replicas have the same commit length

A sealed replica Check commit lengthSeal Extent

Use EN1, EN2 for loading

Network partition• PS can talk to EN3• SM cannot talk to EN3

Summary Highly Available Cloud Storage with Strong

Consistency Scalable data abstractions to build your

applications Blobs – Files and large objects Tables – Massively scalable structured storage Queues – Reliable delivery of messages

More information at: http://www.sigops.org/sosp/sosp11/current/

2011-Cascais/11-calder-online.pdf

distributed systems tutorial 9 – windows azure storage written by alex libov based on sosp 2011...

Documents

storage cluster

stored data

windows azure storage

generation storage stamps

pb of raw storage

outgoing data

balance data

types of data