Transcript
Page 1: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Architecting for the Cloud

Len and Matt Bass Storage in the Cloud

Page 2: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Outline

This section will focus on storage in the cloud

• We will first look at relational databases

• What solutions emerged for the cloud

• Storage options for NoSQL databases

• Architecture of typical NoSQL databases

Page 3: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Outline

This section will focus on storage in the cloud

• We will first look at relational databases

• What solutions emerged for the cloud

• Storage options for NoSQL databases

• Architecture of typical NoSQL databases

Page 4: Architecting for the cloud storage misc topics

© Matthew Bass 2013

History

• The relational data model was created in the late 1960s

• In the 1980s relational databases became commercially successful

– Replacing Hierarchical and Network data bases

• Relational databases continue to be the dominate db model today

Page 5: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Relational Databases

• The relational model is a mathematical model for describing the structure of data

– We will not go into this model

• Let’s take a quick review of the 1st and 2nd normal form, however

Page 6: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Example

Imagine you sell car parts

– You have warehouses

– You have part inventories

– You have orders

What’s the problem? Warehouse Warehouse Address Part

Page 7: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What Happens Here?

Warehouse 1 123 Main Street Transmission, Steering wheel, Brake pads, …

What about here?

Warehouse 1 123 Main Street Transmission

Warehouse 1 123 Main Street Steering wheel

Warehouse 1 123 Main Street Brake Pads

Page 8: Architecting for the cloud storage misc topics

© Matthew Bass 2013

The Solution …

Warehouse ID Warehouse Address

Warehouse Table:

Parts Table:

Relations Table:

Part ID Part Description

Warehouse ID Part ID

Page 9: Architecting for the cloud storage misc topics

© Matthew Bass 2013

This Works

• We have a standard language for querying the data (SQL)

• We can now extract data in a very flexible way

• We can read, write, update, and delete data pretty efficiently

– Joins add some overhead

Page 10: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Moreover We Have RDBMS

• We have robust software systems that manage the data

• These systems provide many advanced features including:

– Behavior

– Concurrency control

– Transactions

– Referential integrity

– Optimization

Page 11: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Behavior

• DBMS’s provide mechanisms for building in behavior

• These are mechanisms like

– Stored procedures

– PLSQL

• This allows you to simplify the application logic

Page 12: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Concurrency Control

• DBMS’s will support multiple user access

• They will lock tables during updates to ensure that writes are complete prior to reads

• They will manage multiple updates to ensure integrity and consistency of data

Page 13: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Transactions

• Transactions are supported

• This ensures that updates either happen completely or not at all

– Often an atomic update is a set up updates to individual records across multiple tables

– If only some of these updates happen the integrity of the overall database is compromised

Page 14: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Referential Integrity

• Ensures that references from one table refer to a valid entry in another table

Page 15: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Optimization

• Database systems will perform a variety of actions to optimize based on usage patterns

• They will

– Create indexes

– Create virtual tables

– Cache values

– …

Page 16: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Impedance Mismatch

• There is however, a mismatch

– We need to translate between the relational structure and the organizational needs

• Think about the reports needed for the warehouse

– Purchase orders

– History of orders for customer

– Parts inventory per warehouse

– …

• This means we will need lots of Joins

– This isn’t too much of an issue until we scale …

Page 17: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Speaking of Scaling …

Do relational databases scale?

Page 18: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Internet Scale Is Difficult

• We can “shard” the data

– Split the data across the machines

• This is very difficult to do efficiently

• This makes joins more costly

– Remember joins are common

• This also has a practical limit

– At some point you will need to replicate the data

• The database becomes slow …

Page 19: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Change is Needed

• For this reason internet scale applications moved to distributed file systems

– Google was the first

– Many others followed

• This allowed the data to be partitioned across nodes more efficiently

– We’ll talk about this in a minute

Page 20: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Outline

This section will focus on storage in the cloud

• We will first look at relational databases

• What solutions emerged for the cloud

• Storage options for NoSQL databases

• Architecture of typical NoSQL databases

Page 21: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Needs

• Let’s explore the needs in a bit more detail

• The file system needed to:

– Be fault-tolerant

– Handle large files

– Accommodate extremely large data sets

– Accommodate many concurrent clients

– Be flexible enough to handle multiple kinds of applications

Page 22: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Fault-Tolerance

• Due to the scale of the systems they were deployed on hundreds or thousands of servers

• This meant that at any given time some of these nodes would not be operational

• Problems from application bugs, operating system bugs, human error, hardware failures, and networks are common

Page 23: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Large Files/Large Data Sets

• It’s common for files in these systems to be multiple GBs

• Each file could have millions of objects

– E.g. many individual web pages

• The data sets grow quickly

• The data sets can be multiple terabytes or petabytes

Page 24: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Many Concurrent Clients

• The system needs to efficiently handle multiple clients

• These clients could be reading or writing

Page 25: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Multiple Applications

• Additionally the system needs to be flexible enough to handle multiple applications

• Applications have a variety of needs

– Long streaming reads

– Throughput oriented operations

– Low latency reads

– …

Page 26: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Addressing Needs

• There were a number of things that were done to address the needs

• One primary decision was the de-normalization of the data

– We’ll talk about this more in the next slides

• Other decisions include (we’ll talk about these in a bit)

– Block size

– Replication strategy

– Data consistency checks

– API and capability of the system

Page 27: Architecting for the cloud storage misc topics

© Matthew Bass 2013

De-Normalizing Data

• Remember what was difficult with relational models?

– Joins across nodes are expensive

– As is synchronization for replicated data

• If the data is de-normalized it can be “localized”

– Data that will likely be accessed together can be collocated

– In other words store it as you will use it

Page 28: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Example

• Imagine a Purchase Order

• Typically this would contain

– Customer information

– Product information

– Pricing

Page 29: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Relational Purchase Order

• The data could would be split across multiple tables such as – Customer

– Product Catalog

– Inventory

– …

• If the data set is large enough the data would be distributed

Page 30: Architecting for the cloud storage misc topics

© Matthew Bass 2013

De-Normalized Purchase Order

• In a file system without a relational model the data doesn’t need to be split up

• The purchase order data would be co-located

• If the data set was very large purchase orders would still be co-located

– Different purchased orders could be distributed

– A single purchase order, however would not be

Page 31: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Relational vs NoSQL

Relational Model

NoSQL

Customers Product Catalog Inventory

Orders 1 - 100 Orders 101 - 200 Orders 201 - 300

Page 32: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What Does This Mean?

• Data has no explicit structure (not entirely true … but we’ll talk about this)

– Data is largely treated as a blob

• This has several implications

– You can change the nature of the data as needed

– You can collocate the data as desired

– The application now has increased burden

Page 33: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Back to Purchase Order

PO Number PO

1 Contents of PO1 …

2 Contents of PO2 …

3 Contents of PO3 …

4 Contents of PO4 …

Key Value

Page 34: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Retrieving Data

• To retrieve the purchase order data you provide the reference key

• The file system routes you to the appropriate node (more later)

• The single node returns the entire purchase order

• This can happen quickly … regardless of how many purchase orders you have

Do you see any potential issues?

Page 35: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Data Locality

• First, being able to retrieve the data quickly depends on the location of the data

• If the data is distributed it’s difficult to retrieve quickly – Imagine you want to get the number of times a customer ordered

product X

– More on this later

• While there is not an explicit structure there is an implicit structure – Design of this structure is important

Page 36: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Data Processing

• As the file system treats the data as unstructured it’s not able to preprocess the data

• Getting an ordered list, for example, has to be done in the application

• The validity of the data needs to be checked by the application

Page 37: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Updating Data

• What happens if you want to change the data?

– Imagine trying to update the customer’s address

• Updates tend to be difficult

• In this environment you tend to not update data

– Instead you will append the new data

– You can establish rules for the lifetime of the data

Page 38: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Other Issues

• Things like data integrity are not managed by the file system

• You don’t (typically) have full support for transactions

• There is no notion of referential integrity

• There is support for some concurrent access, but with built in assumptions

• Consistency is not typically guaranteed (more later)

Page 39: Architecting for the cloud storage misc topics

© Matthew Bass 2013

A New Tool in Your Toolbox

• You’ve been given a new kind of hammer

– Remember that everything is not a nail

– In other words these kinds of data stores are good for some things … and not others

• Today there are many different flavors of these data stores

– Both in terms of structures and features

Page 40: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Multiple Data Structures

• Today many options exist

– Key value stores

– Document centric data stores

– Column data bases

• We’ve also started to see old models reemerge e.g.

– Hierarchical data stores

Page 41: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Key Value Databases

• Basically you have a key that maps to some “value”

• This value is just a blob – The database doesn’t care about the content or structure of this value

• The operations are quite simple e.g. – Read (get the value given a key)

– Insert (inserts a key/value pair)

– Remove (removes the value associated with a given key)

Page 42: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Key Value Databases II

• There is no real schema – Basically you query a key and get the value

– This can be useful when accessing things like user sessions, shopping carts, …

• Concurrency – Concurrency only makes sense at the level of a single key

– Can have either optimistic write or eventual consistency – we’ll talk about this more later

• Replication – Can be handled by the client or the data store – more about this later

Page 43: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Uses

• Very fast reads

• Scales well

• Good for quick access of data without complex querying needs

– The classic example is for session management

• Not good for

– Situations where data integrity is critical

– Data with complex querying needs

Page 44: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Document Centric Databases

• Stores a “document”

ID : 123

Customer : 8790

Line Items : [{product id: 2, quantity: 2}

{product id: 34, quantity 1}]

Page 45: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Document Centric

• No schema

• You can query the data store

– Can return all or part of the document

– Typically query the store by using the id (or key)

• As with key value, discussing concurrency only makes sense at the level of a single document

Page 46: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Advantages

• A document centric data store is similar in many ways to a key/value data store

• It does, however, allow for more complex queries

– For example you can query using a non-primary key

Page 47: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Column Databases

• Row key maps to “column families”

1234 …

Name Matt

Billing Address 123 Main st

Phone 412 770-4145

Profile

Order Data …

Order Data …

Order Data …

Orders

Page 48: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Column Databases - Rows

• Rows are grouped together to form units of load balancing – Row keys are ordered and grouped together by locality

– In this example consecutive rows would be from the same domain (CNN)

• Concurrency makes sense at the level of a row

Key Contents Anchor:cnnsi.com Anchor:my.look.ca

com.cnn.www Html page … …

Page 49: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Column Databases – Columns

• Columns are grouped into “column families”

• Column families form the unit of access control

– Clients may or may not have access to all column families

• Column keys can be used to query data

Page 50: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Column Databases – Timestamps

• The cells in a column database can be versioned with a timestamp

• The cells can contain multiple versions

– The application can typically specify how many versions to keep or when a version times out

• You can use either use a client generated timestamp or one generated by the storage node

Page 51: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Examples

Document Centric

• MongoDB

• CouchDB

• RavenDB

Key Value

• DynamoDB

• Azure Table

• Redis

• Riak

Column

• Hbase

• Cassandra

• Hypertable

• SimpleDB

Page 52: Architecting for the cloud storage misc topics

© Matthew Bass 2013

NoSQL vs RDBMS

• Explicit vs Implicit Schema

– NoSQL databases do have an implicit schema – at least in most cases

• Distribution of data

• Consistency

• Efficiency of storage

• Additional capabilities

Page 53: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Schema

• Clearly with Relational DB there is an explicit schema

• You do have an implicit schema with NoSQL db as well – You typically want to do something with the data

• With relational schema distributed data has a big performance impact

• Data model of NoSQL data impacts performance as well – It is easier to distribute data so that related data is co-located

Page 54: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Consistency - CAP Theorem

• When data becomes distributed you need to worry about a network partition

– Essentially this means that instances of your data store can’t communicate

• When this happens you need to choose between availability or consistency

Page 55: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Let’s Demonstrate

• Imagine we start a store that takes orders

– Who wants to work at this store?

• The operators need to be able to:

– Take orders

– Give order history

– Modify orders

• We will start with one operator until business grows …

Page 56: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Consistency in the Cloud

• Many NoSQL databases give you options

– Eventual consistency

– Optimistic consistency

– …

• They all come with different trade offs

• You must understand the needs of your system to ensure appropriate behavior

– We’ll talk more about this later

Page 57: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Outline

This section will focus on storage in the cloud

• We will first look at relational databases

• What solutions emerged for the cloud

• Storage options for NoSQL databases

• Architecture of typical NoSQL databases

Page 58: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Fault Tolerance

• As we said earlier fault tolerance was a prime motivator for many of the decisions

• These systems are built with commodity components that are prone to failure

• They also need to deal with other issues (previously mentioned) that arise

• We’ll look at a representative example of such a system to understand what decisions have been made

Page 59: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Google File System

• Grew out of “BigFiles”

• Distributed, scalable and portable file system

• Written in Java

• Supports the kinds of applications we discussed earlier

– Search

– Large data retrieval

Page 60: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Leads to following requirements

1. High reliability through commodity hardware

– Even with RAID, disks will still have one failure per day. If the system has to deal with failure smoothly in any case, it is much more economical to use commodity hardware.

– Even if disks do not fail, data blocks may get corrupted.

2. Minimal synchronization on writes

– Require each application process to write to a distinct file. File merge can take place after files are written.

– This means minimal locking during the write process (or read process).

3. Data blocks are all the same size

– Streaming data. ALL blocks are 64MBytes.

– GFS is unaware of any internal logic of data and the Internal logic of data must be managed by the application

Page 61: Architecting for the cloud storage misc topics

© Matthew Bass 2013

GFS Interfaces

• Supports the following commands

– Open

– Create

– Read

– Write

– Close

– Append

– Snapshot

Page 62: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Organization of GFS

• Organized into clusters

• Each cluster might have thousands of machines

• Within each cluster you have the following kinds of entities

– Clients

– Master servers

– Chunk servers

Page 63: Architecting for the cloud storage misc topics

© Matthew Bass 2013

GFS Clients

• Clients are any entity that makes a file request

• Requests are often to retrieve existing files

• They might also include manipulating or creating files

• Clients are other computers or applications

– Think of the web server that serves your search engine as a client

Page 64: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Chunk Servers

• Responsible for storing the data “chunks” – These chunks are all 64 MB blocks

• These chunk servers are the work horses of the file system

• They receive requests for data and send the chunks directly to the client

• The client also writes the files directly to the appropriate chunk servers – The reference for replicas come from the master as well

• The chunk server is responsible for determining the correctness of the write (more later)

Page 65: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Master Servers

• Acts as a coordinator for the cluster

• Keeps track of the metadata

– This is data that describes the data blocks (or chunks)

– Tells the Master what chunk the file belongs to

• Master tells the client where the chunk is located

• Master keeps an operations log

– Logs the activities of the cluster

– One of the mechanisms used to keep service outages to a minimum (more later)

Page 66: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Two Additional Concepts

Lease: • Lease is the minimal locking that is performed. Client receives lease on a file

when it is opened and, until file is closed or lease expires, no other process can write to that file. This prevents accidently using the same file name twice.

• Client must renew lease periodically (~ 1 minute) or lease is expired.

Block: • Every file managed by GFS is divided into 64MByte blocks. Each read/write is

in terms of <file, block #>

• Each block is replicated – three is the default number of replicas.

• As far as GFS is concerned there is no internal structure to a block. The application must perform any parsing of the data that is necessary.

Page 67: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Basic Read Operation

Client

Master

Chunk

Server

Chunk

Server

Chunk

Server

Requests location of File

Sends read

request

Returns location

Returns file

content

Page 68: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Basic Write Operation

Client

Master

Chunk

Server

Chunk

Server

Chunk

Server

Requests location of

primary and secondary

Sends data to write

Returns locations

Caches locations Sends data to write

Applies Mutations

Page 69: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Reliability Mechanisms

• Master and chunk replication

• Rebalancing

• Stale replication detection

• Checksumming

• Garbage removal

Page 70: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Master Replication

• One active Master per cluster

• “Shadow” masters exist on other machines

– These shadows may perform limited functions (i.e. reads)

• Monitors the operations of the active master

– Though the operations log

• Maintains contact with the Chunk Servers by polling

– Does this to keep track of data

• If the Master fails the shadow takes over

Page 71: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Data Replication/Rebalancing

• File system replicates chunks of data

• It stores data on different machines across different racks – That way if a machine or rack fail another replica exists

• Master also monitors cluster as a whole

• It periodically rebalances the load across the cluster – All chunk servers run at near capacity but never at full capacity

• Master also monitors each chunk to ensure data is current – If not it’s designated as a stale replica

– The stale replica become garbage

Page 72: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Checksum

• In order to detect data corruption checksumming is used

• The system breaks each 64 MB chunk into 64 KB blocks

• Each block has it’s own 32 bit checksum

• The Master monitors the checksums for each block

• If the checksums don’t match what the Master has on record it is deleted and a new replica created

Page 73: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Failure Scenarios

• Let’s look at the following failure scenarios to see what happens

– Client failure

– Corrupt disk

– Chunk server failure

– Master failure

Page 74: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Client Failure

• Client fails while file open

• Master recognizes this because lease expires

• File is placed in intermediate state where client can re-activate lease

• After intermediate state expires (~hour), Master informs Chunk Server that have blocks for that file to delete them

• Master removes all entries associated with file

• Chunk Server deletes blocks

Page 75: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Corrupt Disk

This is the case where a block becomes corrupted after writing.

Replica1 writes a checksum for every 64 KB in a parallel file.

Replica1 returns checksums along with the block during a read.

Client checks checksum when block returned

If there is an error then Client:

• Retries read from different Replica2

• Informs Master of corrupt block on Replica1

Master:

• Allocates new replica for that block on Replica3

• Informs Replica2 with an existing replica to copy it to Replica3.

• Informs Replica1 with corrupted block to delete that block.

Page 76: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Chunk Server Failure

Master sends Heartbeat request to Chunk Server

• Active Replica responds with a list of block #, replica #s it has.

• Failed Replica does not respond

Master recognizes Replica’s failure.

Master maintains block #, replica # -> Chunk Server mapping from last Heartbeat.

Master queues all of blocks replicated on failed Chunk Server to generate an additional replica.

The generation of an additional replica of Block A:

• Allocate new replica on an active Chunk Server say Replica1

• Instruct one of the Chunk Servers with valid replica of Block A to copy it to Replica1.

Page 77: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Master Failure

• Back up Master maintains copy of log

• Responsible for creating checkpoint image and trimming EditLog

• BackupNode takes over in case of Master failure

• BackupNode may also fail

BackupNode

Master

EditLog

Checkpoint

Image

Page 78: Architecting for the cloud storage misc topics

© Matthew Bass 2013

More about Master Structure

Four Threads:

• Main – perform file management operations.

• Ping/Echo – check on status of Chunk Servers and receive responses from Chunk Servers

• Replica Management – manage new replica creation and replica deletion

• Lease Management – cancel leases when they expire. Queues replicas for replica deletion for files where the client has failed.

Three Modes

• Normal operations

• Safe mode – when Master is restarted then no new requests are accepted until percentage of Chunk Servers have reported their block allocations

• Backup – act as Master backup

Page 79: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Summary

• Relational databases are difficult to distribute efficiently

– Scalability can be problematic

• NoSQL databases offer an alternative

– Data is typically schema-less

• Aggregates of data that mirror primary use cases are considered a unit of data

• Queries across nodes requires an efficient mechanism for aggregation

Page 80: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Architecting for the Cloud

MiscTopics

Page 81: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Topics

These are topics that have architectural implications and do not fit neatly into one of the other lectures.

• Zookeeper

• Failure in the cloud

• Business continuity

• Release planning

• Managing configuration parameters

• Monitoring

81

Page 82: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Zookeeper

• Zookeeper is intended to manage distributed coordination

– Synchronization

– data

82

Page 83: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Distributed applications

• Zookeeper provides guaranteed consistent (mostly) data structure for every instance of a distributed application. – Definition of “mostly” is within eventual consistency

lag (but this is small). More on eventual consistency later.

• Zookeeper deals with managing failure as well as consistency. – Done using Paxos algorithm.

• Zookeeper guarantees that service requests are linearly ordered and processed in a FIFO order

Page 84: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Model

• Zookeeper maintains a file type data structure

– Hierarchical

– Data in every node (called znode)

– Amount of data in each node assumed small (<1M)

– Intended for metadata

• Configuration

• Location

• Group

Page 85: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Zookeeper znode structure

/

<data>

/b1

<data>

/b1/c1

<data>

/b1/c2

<data>

/b2

<data>

/b2/c1

<data>

Page 86: Architecting for the cloud storage misc topics

© Matthew Bass 2013

API

Function Type

create write

delete write

Exists read

Get children Read

Get data Read

Set data write

+ others

• All calls return atomic views of state – either succeed or fail. No partial state returned. Writes also are atomic. Either succeed or fail. If they fail, no side effects.

Page 87: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Example - Group membership

• Remember the load balancer. It has a list of registered servers.

• The load balancer wants to know which of its servers are – Alive – Providing service

• The list must be – highly available – Reflect failure of individual servers

• Strict performance requirements on list manager

Page 88: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Using Zookeeper to manage group membership

• Load balancer on initialization – connects to zookeeper

– Gets list of zookeeper servers

– Create session (if server fails – automatic fail over)

• Load balancer issues Create /”Servers” call • If already exists get a failure

• Servers register by creating /Server/my_IP

• Load balancer can list children of /Servers and get their IPs.

• Watcher will inform Load balancer if a server fails or leaves.

• Latency is low (order of micro seconds) since Zookeeper keeps data structures in memory.

Page 89: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Other use cases

• Leader election

• Distributed locks

• Synchronization

• Configuration

Page 90: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Topics

• Zookeeper

• Failure in the cloud

• Business continuity

• Release planning

• Managing configuration parameters

• Monitoring

90

Page 91: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Failures in the cloud

• Cloud failures large and small

• The Long Tail

• Techniques for dealing with the long tail

Page 92: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Sometimes the whole cloud fails …

Page 93: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Selected Cloud Outages - 2013

• July 10, Google down for 10 minutes

• June 18, Facebook down for 30 minutes

• Aug 14-17 Outlook.com offline for three days

• Aug 19, Amazon.com down for 40-45 minutes

• Aug 22, Apple iCloud down for 11 hours

• Aug 16, Google down for 5 minutes

• Sept 13, AWS down for ~two hours

• Nov 21, Microsoft services intermittent for ~2 hours

Page 94: Architecting for the cloud storage misc topics

© Matthew Bass 2013

And sometimes just a part of it fails …

Page 95: Architecting for the cloud storage misc topics

© Matthew Bass 2013

A year in the life of a Google datacenter

• Typical first year for a new cluster: – ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to

recover) – ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come

back) - ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) – ~5 racks go wonky (40-80 machines see 50% packetloss) – ~8 network maintenances (4 might cause ~30-minute random connectivity

losses) – ~12 router reloads (takes out DNS and external vips for a couple minutes) – ~3 router failures (have to immediately pull traffic for an hour) – ~dozens of minor 30-second blips for dns – ~1000 individual machine failures – ~thousands of hard drive failures

• slow disks, bad memory, misconfigured machines, flaky machines, dead horses, etc.

Page 96: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Amazon failure statistics

• In a data center with ~64,000 servers with 2 disks each

~5 servers and ~17 disks fail every day.

Page 97: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What does this mean for a consumer of the cloud?

• You need to be concerned about “long tail” distribution for requests due to piecewise failure

• You need to be concerned about business continuity due to overall failure.

Page 98: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Short digression into probability

• A distribution describes the probability than any given reading will have a particular value.

• Many phenomenon in nature are “normally distributed”. • Most values will cluster around the mean with progressively smaller numbers of values going toward the edges. • In a normal distribution the mean is equal to the median

Page 99: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Long Tail

• In a long tail distribution, there are some values far from the mean.

• These values are sufficient to influence the mean.

• The mean and the

median are

dramatically

different in a long

tail distribution.

Page 100: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What does this mean?

• If there is a partial failure of the cloud some activities will take a long time to complete and exhibit a long tail.

• The figure shows distribution of 1000 AWS “launch instance” calls. • 4.5% of calls were “long tail”

Mean Median STD Max

launch instance

EC2 27.81 23.10 25.12 202.3

Page 101: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What can you do to prevent long tail problems?

• “Hedged” request. Suppose you wish to launch 10 instances. Issue 11 requests. Terminate the request that has not completed when 10 are completed.

• “Alternative” request. In the above scenario, issue 10 requests. When 8 requests have completed issue 2 more. Cancel the last 2 to respond.

• Using these techniques reduces the time of the longest of the 1000 launch instance requests from 202 sec to 51 sec.

Page 102: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Topics

• Zookeeper

• Failure in the cloud

• Business continuity

• Release planning

• Managing configuration parameters

• Monitoring

102

Page 103: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Business continuity

• Business continuity means that the business should continue to provide service even if a disaster such as a fire, floor, or cloud outage occurs.

• Two numbers characterize a business continuity strategy – RTO is the Recovery Time Objective – how long before the

service is available again – RPO is the Recovery Point Objective – what is the point in

time that the system rolls back to. i.e. how much data can be potentially lost

• Allows for cost/benefit trade offs. • Many industries such as banks have compliance rules

that require business continuity policies and practices.

Page 104: Architecting for the cloud storage misc topics

© Matthew Bass 2013

How does business continuity work?

• Replicate site in physically distant location.

• Recall DNS server with multiple sites

• If first site does not respond promptly, client will try second site.

X Site 1 Site 2

Website.com

123.45.67.89 456.77.88.99 123.45.67.89

456.77.88.99

DNS

Page 105: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What does it mean to “replicate site”?

• Must have a parallel datacenter • Data must be replicated within RPO

– If RPO is small or zero this implies DB replication – If RPO is larger then can use other means to replicate data

• Software must also be replicated. – Versions must be identical in both sites

• Using different versions in different sites may result in different results.

• Configurations in two sites will be different but must yield the same results.

• Replication of a site incurs costs. You may wish to increase the RPO and just copy (back up) data to another site.

Page 106: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Recall discussion about DNS servers

• There is a hierarchy of DNS servers.

• Local DNS servers are under the control of the local organization.

• When a disaster happens, the new data center can be made operative by changing the IP address in the local DNS server.

Page 107: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What the the architectural implications

• State maintained in servers will be lost if a disaster happens

• Dependencies on other than configuration parameters must be identical in a replicated site.

• Applications must be architected to be movable for one environment to another.

Page 108: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Topics

• Zookeeper

• Failure in the cloud

• Business continuity

• Release planning

• Managing configuration parameters

• Monitoring

108

Page 109: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Dependencies

• There exist many different types of dependencies within a system. E.g. – Inter component – Version – Configuration parameters – Hardware – Location – Names – DB schemas – Platform – Libraries

• Inconsistency among these dependencies is a common source of production time errors.

Page 110: Architecting for the cloud storage misc topics

© Matthew Bass 2013

For example

• You develop some code on your desktop. – You have installed the latest Java update – You configure your code to use a Python script to do

some data cleansing – You depend on a component that your colleagues are

simultaneously developing.

• You deploy your code into production. – The latest Java version has not been installed. – Python has not been installed in the production

environment. – Your colleagues are delayed in their development.

Page 111: Architecting for the cloud storage misc topics

© Matthew Bass 2013

You finally get your code into production

• A user has a problem and calls the help desk.

• The help desk doesn’t know how to solve the problem and escalates it back to you.

• You have gone on vacation.

Page 112: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Problems lead to a requirement for a formal “release plan”

1. Define and agree release and deployment plans with customers/stakeholders.

2. Ensure that each release package consists of a set of related assets and service components that are compatible with each other.

3. Ensure that integrity of a release package and its constituent components is maintained throughout the transition activities and recorded accurately in the configuration management system.

4. „„Ensure that all release and deployment packages can be tracked, installed, tested, verified, and/or uninstalled or backed out, if appropriate.

5. „„Ensure that change is managed during the release and deployment activities.

6. „„Record and manage deviations, risks, issues related to the new or changed service, and take necessary corrective action.

7. „„Ensure that there is knowledge transfer to enable the customers and users to optimise their use of the service to support their business activities.

8. „„Ensure that skills and knowledge are transferred to operations and support staff to enable them to effectively and efficiently deliver, support and maintain the service, according to required warranties and service levels

*http://en.wikipedia.org/wiki/Deployment_Plan

112

Page 113: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Release planning is labor intensive

• Note the requirements for coordination in the release plan • Each item requires multiple people and time consuming

activities. – Time consuming activities delay introducing features included in

the release.

• Open questions – Which items are dealt with through process? – Which items are dealt with through tool support? – Which items are dealt with through architecture design? – Which items are dealt with through a combination of the

above?

• We will see an architecture designed to reduce team coordination inn a subsequent lecture.

Page 114: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Topics

• Zookeeper

• Failure in the cloud

• Business continuity

• Release planning

• Managing configuration parameters

• Monitoring

114

Page 115: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What is a configuration parameter?

• A configuration parameter or environment variable is a parameter for an application that either controls the behavior of the application or specifies a connection of the application to its environment

– Thread pool or database connection pool size control the behavior of the application.

– Database url specifies an connection of the app to a database.

Page 116: Architecting for the cloud storage misc topics

© Matthew Bass 2013

When are configuration parameters bound?

• Recommended practice is to bind these at initialization time for the app. – App is loaded into an execution environment – App is told where to find configuration parameters

through language, OS, or environment specific means. E.g. main parameter in C

– App reads configuration parameters from the specified location.

• The virtue of this approach is that an app can be loaded into different execution environments and doesn’t need to be aware of which environment it is.

Page 117: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Use DB as an example – Unit test

• App is given URL for database access component.

• In the case of unit test, the database access component is a component that maintains some fake data in memory for fast access without the overhead of the full DB.

Page 118: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Integration Test

• Test database is maintained for integration testing.

• Test database has subset of full data base.

• URL of test database is provided to App

• App can read or write test database

Page 119: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Performance testing

• Special database access component exists for performance testing – Passes through reads to production database – Writes to mirror database

• App is given URL of special database access component

• Allows testing with real data but blocks and writes to real database

• Mirror database is checked at end of test for correctness.

Page 120: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Other configuration parameters

• Other configuration parameters should be identical from integration test through to production.

• Reduces possibility of incorrect specification of configuration parameters.

– Incorrect specification of configuration parameters is a major source of deployment errors.

Page 121: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Topics

• Zookeeper

• Failure in the cloud

• Business continuity

• Release planning

• Managing configuration parameters

• Monitoring

121

Page 122: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Monitoring

• When is this done

• Why is it done

• What can you get from monitoring.

• Data sources – monitor/logs

Page 123: Architecting for the cloud storage misc topics

© Matthew Bass 2013

What is monitoring?

• Monitoring is the collection of data from individual or collections of systems during the runtime of these systems.

• Isn’t this an operations problem and not an architectural problem? – No.

• Operators are first class stakeholders and their needs should be considered when designing the system.

• In the modern world, difficult run time problems are solved by the architect so its to your advantage that the correct information is available.

• Other reasons are implicit in the uses of monitoring information which we are about to go into.

Page 124: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Why monitor?

1. Identifying failures and the associated faults both at runtime and during post-mortems held after a failure has occurred.

2. Identifying performance problems both of individual systems and collections of interacting systems.

3. Characterizing workload for both short term and long term billing and capacity planning purposes.

4. Measuring user reactions to various types of interfaces or business offerings. We will discuss A/B testing later.

5. Detecting intruders who are attempting to break into the system. (outside of our scope).

Page 125: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Basic metrics

• Per VM instance provider will collect – CPU utilization – Disk read/writes – Messages in/out

• These metrics are used for – Charging – Scaling – Mapping utilization to workload

• Similar type of metrics for storage and utilities • Can aggregate these metrics over autoscaling groups,

regions, accounts, etc.

Page 126: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Other metrics

• The problem with the basic metrics is that they are not related to particular activities whether business or internal.

• Other things to monitor – Transactions – transactions per second gives the business

an idea of how many customers are utilizing the system.

– Transactions by type.

– Messages from one portion of the system to another.

– Error conditions detected by different portions of the system

– … anything you want

Page 127: Architecting for the cloud storage misc topics

© Matthew Bass 2013

How do I decide what to monitor?

• Look at reasons for monitoring – Failure detection – Performance degradation – Workload characterization – User reactions

• For each reason, – decide what symptoms you would like reported. – Place responsibilities to detect symptoms in various modules. – Decide on active/passive monitoring (discussed soon) – Decide what constitutes an alarm (discussed soon) – Logic should be under configuration control – levels of reporting

Page 128: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Metadata is crucial

• Data by itself is not that useful. • It must be tagged with identifying information including

timestamp.. • For example

– VM CPU usage divided among which processes – I/O requests to which disks triggered from which VM process – Messages from which component to which other component in

response to what user requests.

• Ideal – each user request is given a tag and all monitoring information as a consequence of satisfying that request are tagged with request ID.

• Other monitoring activities are tagged with ID that identifies why activity was trigger.

Page 129: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Why this emphasis on metadata?

• Any of the uses enumerated for monitoring data require associating effect with its cause.

• The monitoring data represents the effect.

• The metadata enables determining the cause.

Page 130: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Active/Passive

• Active data collection involves the component that generates the data. It emits it periodically or based on a triggering event – To a key-value store

– To a file

– A message to a known location

• Passive data collection involves the component that generates the data making it available to an agent in the same address space. The agent emits the data either periodically or based on events.

Page 131: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Data Collection

• Whether active or passive data, the data is emitted from a component to a known location periodically or based on events.

System

application

agent

System

application

agent

Monitoring System

Page 132: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Monitoring Systems

• Data collecting tools

– Ngaio .

– Sensu

– Inciga

– Cloud Watch – AWS specific

Page 133: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Volumes of data

• It is possible to generate huge amounts of data. • That is the purpose of data collating tools

– Logstash – Splunk

• Features of such tools – Collating data from different instances – Visualization – Filtering – Organizing data – Reports

Page 134: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Alarms

• An alarm is a specific message about some condition needing attention.

– Can be e-mail, text, or on screen for operators.

• Problems with alarms

– False positives – an alarm is raised without justification

– False negatives – justification exists but no alarm is raised.

Page 135: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Summary

• Distributed coordination problems are simplified when using a tool such as Zookeeper

• You must expect failure in the cloud and prepare for it.

• A disaster is when everything has failed and you need to have business continuity plans

• Flexibility in the cloud is managed by setting configuration parameters and they need to be managed.

• Monitoring lets you know what is going on with your system from whatever perspective you wish. But, you must choose your perspective.

Page 136: Architecting for the cloud storage misc topics

© Matthew Bass 2013

Questions??


Top Related