csci5221: data centers, cloud computing, … cloud computing and data centers: overview what’s...

68
CSci5221: Data Centers, Cloud Computing, … Cloud Computing and Data Centers: Overview What’s Cloud Computing? Data Centers and “Computing at Scale” Case Studies: Google File System Map-Reduce Programming Model Optional Material Google Bigtable Readings: Do required readings Also do some of the optional readings if interested

Upload: paul-cole

Post on 25-Dec-2015

242 views

Category:

Documents


3 download

TRANSCRIPT

CSci5221: Data Centers, Cloud Computing, …

Cloud Computing and Data Centers:Overview• What’s Cloud Computing?

• Data Centers and “Computing at Scale”• Case Studies:

– Google File System– Map-Reduce Programming Model

Optional Material• Google Bigtable

Readings: Do required readings Also do some of the optional readings if interested

CSci5221: Data Centers, Cloud Computing, …

2

Why Studying Cloud Computing and Data Centers

Using Google as an example: GFS, MapReduce, etc.• mostly related to distributed systems, not really

“networking” stuffTwo Primary Goals: • they represent part of current and “future” trends

– how applications will be serviced, delivered, … – what are important “new” networking problems?

• more importantly, what lessons can we learn in terms of (future) networking design?– closely related, and there are many similar issues/challenges

(availability, reliability, scalability, manageability, ….)– (but of course, there are also unique challenges in

networking)

CSci5221: Data Centers, Cloud Computing, …

3

Internet and Web• Simple client-server model

– a number of clients served by a single server– performance determined by “peak load”– doesn’t scale well (e.g., server crashes), when # of clients

suddenly increases -- “flash crowd”

• From single server to blade server to server farm (or data center)

CSci5221: Data Centers, Cloud Computing, …

4

Internet and Web …• From “traditional” web to “web service” (or SOA)

– no longer simply “file” (or web page) downloads• pages often dynamically generated, more complicated

“objects” (e.g., Flash videos used in YouTube)

– HTTP is used simply as a “transfer” protocol• many other “application protocols” layered on top of HTTP

– web services & SOA (service-oriented architecture)

• A schematic representation of “modern” web services

front-end

web rendering, request routing, aggregators, …

back-end

database, storage, computing, …

CSci5221: Data Centers, Cloud Computing, …

5

Data Center and Cloud Computing• Data center: large server farms + data warehouses

– not simply for web/web services – managed infrastructure: expensive!

• From web hosting to cloud computing– individual web/content providers: must provision for peak load

• Expensive, and typically resources are under-utilized

– web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers

– “server consolidation” via virtualization

VMMGuest OS

App

Under client web service control

CSci5221: Data Centers, Cloud Computing, …

6

Cloud Computing• Cloud computing and cloud-based services:

– beyond web-based “information access” or “information delivery”– computing, storage, …

• Cloud Computing: NIST Definition "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing

resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction."

• Models of Cloud Computing– “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace

– “Platform as a Service” (PaaS), e.g., Micorsoft Azure

– “Software as a Service” (SaaS), e.g., Google

CSci5221: Data Centers, Cloud Computing, …

7

Data Centers: Key Challenges With thousands of servers within a data center, • How to write applications (services) for them?• How to allocate resources, and manage them?

– in particular, how to ensure performance, reliability, availability, …

• Scale and complexity bring other key challenges– with thousands of machines, failures are the default case!– load-balancing, handling “heterogeneity,” …

• data center (server cluster) as a “computer”• “super-computer” vs. “cluster computer”

– A single “super-high-performance” and highly reliable computer – vs. a “computer” built out of thousands of “cheap & unreliable”

PCs– Pros and cons?

CSci5221: Data Centers, Cloud Computing, …

8

Case Studies• Google File System (GFS)

– a “file system” (or “OS”) for “cluster computer”• An “overlay” on top of “native” OS on individual machines

– designed with certain (common) types of applications in mind, and designed with failures as default cases

• Google MapReduce (cf. Microsoft Dryad)– MapReduce: a new “programming paradigm” for certain

(common) types of applications, built on top of GFS

• Other examples (optional):– BigTable: a (semi-) structured database for efficient key-value

queries, etc. , built on top of GFS– Amazon Dynamo:A distributed <key, value> storage system

high availability is a key design goal– Google’s Chubby, Sawzall, etc.– Open source systems: Hadoop, …

CSci5221: Data Centers, Cloud Computing, …

Google Scale and Philosophy• Lots of data

– copies of the web, satellite data, user data, email and USENET, Subversion backing store

• Workloads are large and easily parallelizable• No commercial system big enough

– couldn’t afford it if there was one– might not have made appropriate design choices– But truckloads of low-cost machines

• 450,000 machines (NYTimes estimate, June 14th 2006)

• Failures are the norm– Even reliable systems fail at Google scale

• Software must tolerate failures– Which machine an application is running on should not

matter– Firm believers in the “end-to-end” argument

• Care about perf/$, not absolute machine perf

CSci5221: Data Centers, Cloud Computing, …

Typical Cluster at Google

Cluster Scheduling Master Lock Service GFS Master

Machine 1

SchedulerSlave

GFSChunkserver

Linux

UserTask 1

Machine 2

SchedulerSlave

GFSChunkserver

Linux

UserTask

Machine 3

SchedulerSlave

GFSChunkserver

Linux

User Task 2

BigTableServer

BigTableServer BigTable Master

CSci5221: Data Centers, Cloud Computing, …

Google: System Building Blocks

• Google File System (GFS): – raw storage

• (Cluster) Scheduler: – schedules jobs onto machines

• Lock service: – distributed lock manager– also can reliably hold tiny files (100s of

bytes) w/ high availability• Bigtable:

– a multi-dimensional database• MapReduce:

– simplified large-scale data processing• ....

CSci5221: Data Centers, Cloud Computing, …

Chubby: Distributed Lock Service

• {lock/file/name} service• Coarse-grained locks, can store small

amount of data in a lock• 5 replicas, need a majority vote to be

active• Also an OSDI ’06 Paper

CSci5221: Data Centers, Cloud Computing, …

Google File SystemKey Design Considerations• Component failures are the norm

– hardware component failures, software bugs, human errors, power supply issues, …

– Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery

• Files are huge by traditional standards– multi-GB files are common, billions of objects– most writes (modifications or “mutations”) are “append”– two types of reads: large # of “stream” (i.e., sequential)

reads, with small # of “random” reads

• High concurrency (multiple “producers/consumers” on a file)– atomicity with minimal synchronization

• Sustained bandwidth more important than latency

CSci5221: Data Centers, Cloud Computing, …

GFS Architectural Design• A GFS cluster:

– a single master + multiple chunkservers per master– running on commodity Linux machines

• A file: a sequence of fixed-sized chunks (64 MBs)– labeled with 64-bit unique global IDs, – stored at chunkservers (as “native” Linux files, on local

disk)– each chunk mirrored across (default 3) chunkservers

• master server: maintains all metadata– name space, access control, file-to-chunk mappings,

garbage collection, chunk migration– why only a single master? (with read-only shadow

masters)• simple, and only answer chunk location queries to clients!

• chunk servers (“slaves” or “workers”):– interact directly with clients, perform reads/writes, …

CSci5221: Data Centers, Cloud Computing, …

GFS Architecture: Illustration

• GPS clients– consult master for

metadata– typically ask for multiple

chunk locations per request– access data from

chunkservers

Separation of control and data flows

CSci5221: Data Centers, Cloud Computing, …

Chunk Size and Metadata

• Chunk size: 64 MBs– fewer chunk location requests to the master– client can perform many operations on a chuck

• reduce overhead to access a chunk• can establish persistent TCP connection to a chunkserver

– fewer metadata entries• metadata can be kept in memory (at master)• in-memory data structures allows fast periodic scanning

– some potential problems with fragmentation

- Metadata– file and chunk namespaces (files and chunk identifiers)– file-to-chunk mappings– locations of a chunk’s replicas

CSci5221: Data Centers, Cloud Computing, …

Chunk Locations and Logs• Chunk location:

– does not keep a persistent record of chunk locations– polls chunkservers at startup, and use heartbeat messages to monitor

chunkservers: simplicity!• because of chunkserver failures, it is hard to keep persistent record of chunk

locations

– on-demand approach vs. coordination• on-demand wins when changes (failures) are often

• Operation logs– maintains historical record of critical metadata changes– Namespace and mapping– for reliability and consistency, replicate operation log on multiple remote

machines (“shadow masters”)

CSci5221: Data Centers, Cloud Computing, … 18

Clients and APIs• GFS not transparent to clients

– requires clients to perform certain “consistency” verification (using chunk id & version #), make snapshots (if needed), …

• APIs:– open, delete, read, write (as expected)– append: at least once, possibly with gaps and/or

inconsistencies among clients– snapshot: quickly create copy of file

• Separation of data and control:– Issues control (metadata) requests to master server– Issues data requests directly to chunkservers

• Caches metadata, but does no caching of data– no consistency difficulties among clients– streaming reads (read once) and append writes (write once)

don’t benefit much from caching at client

CSci5221: Data Centers, Cloud Computing, … 19

System Interaction: Read• Client sends master:

– read(file name, chunk index)

• Master’s reply:– chunk ID, chunk version#,

locations of replicas

• Client sends “closest” chunkserver w/replica:– read(chunk ID, byte range)– “closest” determined by IP

address on simple rack-based network topology

• Chunkserver replies with data

CSci5221: Data Centers, Cloud Computing, … 20

System Interactions: Write and Record Append

• Write and Record Append (atomic)– slightly different semantics: record append is “atomic”

• The master grants a chunk lease to a chunkserver (primary), and replies back to client

• Client first pushes data to all chunkservers – pushed linearly: each replica forwards as it receives– pipelined transfer: 13 MB/second with 100 Mbps network

• Then issues a write/append to primary chunkserver• Primary chunkserver determines the order of

updates to all replicas– in record append: primary chunkserver checks to see

whether record append would exceed maximum chunk size– if yes, pad the chuck (and ask secondaries to do the same),

and then ask client to append to the next chunk

CSci5221: Data Centers, Cloud Computing, …

Leases and Mutation Order• Lease:

– 60 second timeouts; can be extended indefinitely

– extension request are piggybacked on heartbeat messages

– after a timeout expires, master can grant new leases

• Use leases to maintain consistent mutation order across replicas

• Master grant lease to one of the replicas -> Primary

• Primary picks serial order for all mutations

• Other replicas follow the primary order

CSci5221: Data Centers, Cloud Computing, …

Consistency Model• Changes to namespace (i.e., metadata) are

atomic– done by single master server!– Master uses log to define global total order of

namespace-changing operations

• Relaxed consistency– concurrent changes are consistent but “undefined”

• defined: after data mutation, file region that is consistent, and all clients see that entire mutation

– an append is atomically committed at least once • occasional duplications

• All changes to a chunk are applied in the same order to all replicas

• Use version number to detect missed updates

CSci5221: Data Centers, Cloud Computing, …

Master Namespace Management & Logs

• Namespace: files and their chunks– metadata maintained as “flat names”, no hard/symbolic links– full path name to metadata mapping

• with prefix compression

• Each node in the namespace has associated read-write lock (-> a total global order, no deadlock)– concurrent operations can be properly serialized by this locking

mechanism

• Metadata updates are logged – logs replicated on remote machines– take global snapshots (checkpoints) to truncate logs (but checkpoints can be

created while updates arrive)

• Recovery– Latest checkpoint + subsequent log files

CSci5221: Data Centers, Cloud Computing, …

Replica Placement• Goals:

– Maximize data reliability and availability– Maximize network bandwidth

• Need to spread chunk replicas across machines and racks

• Higher priority to replica chunks with lower replication factors

• Limited resources spent on replication

CSci5221: Data Centers, Cloud Computing, …

Other Operations • Locking operations

– one lock per path, can modify a directory concurrently• to access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf• each thread acquires: a read lock on a directory & a write lock

on a file

– totally ordered locking to prevent deadlocks

• Garbage Collection: – simpler than eager deletion due to

• unfinished replicated creation, lost deletion messages

– deleted files are hidden for three days, then they are garbage collected

• combined with other background (e.g., take snapshots) ops

– safety net against accidents

CSci5221: Data Centers, Cloud Computing, …

Fault Tolerance and Diagnosis• Fast recovery

– Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions

• Chunk replication• Data integrity

– A chunk is divided into 64-KB blocks– Each with its checksum– Verified at read and write times– Also background scans for rarely used data

• Master replication– Shadow masters provide read-only access when the primary

master is down

CSci5221: Data Centers, Cloud Computing, … 27

GFS: Summary• GFS is a distributed file system that support large-scale data

processing workloads on commodity hardware– GFS has different points in the design space

• Component failures as the norm• Optimize for huge files

– Success: used actively by Google to support search service and other applications

– But performance may not be good for all apps• assumes read-once, write-once workload (no client caching!)

• GFS provides fault tolerance– Replicating data (via chunk replication), fast and automatic

recovery

• GFS has the simple, centralized master that does not become a bottleneck

• Semantics not transparent to apps (“end-to-end” principle?)– Must verify file contents to avoid inconsistent regions, repeated

appends (at-least-once semantics)

CSci5221: Data Centers, Cloud Computing, …

Google MapReduce• The problem

– Many simple operations in Google• Grep for data, compute index, compute summaries, etc

– But the input data is large, really large• The whole Web, billions of Pages

– Google has lots of machines (clusters of 10K etc)– Many computations over VERY large datasets– Question is: how do you use large # of machines efficiently?

• Can reduce computational model down to two steps– Map: take one operation, apply to many many data tuples– Reduce: take result, aggregate them

• MapReduce– A generalized interface for massively parallel cluster processing

CSci5221: Data Centers, Cloud Computing, …

MapReduce Programming Model

• Intuitively just like those from functional languages– Scheme, lisp, haskell, etc

• Map: initial parallel computation– map (in_key, in_value) -> list(out_key, intermediate_value) – In: a set of key/value pairs – Out: a set of intermediate key/value pairs– Note keys might change during Map

• Reduce: aggregation of intermediate values by key– reduce (out_key, list(intermediate_value)) ->

list(out_value) – Combines all intermediate values for a particular key – Produces a set of merged output values (usually just one)

CSci5221: Data Centers, Cloud Computing, …

Example: Word Counting• Goal

– Count # of occurrences of each word in many documents

• Sample data– Page 1: the weather is good– Page 2: today is good– Page 3: good weather is good

• So what does this look like in MapReduce?map(String key, String value):

// key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

CSci5221: Data Centers, Cloud Computing, …

Map/Reduce in Action

• Worker 1: – (the 1), (weather 1), (is 1), (good 1)

• Worker 2: – (today 1), (is 1), (good 1)

• Worker 3: – (good 1), (weather 1), (is 1), (good 1)

Page 1: the weather is goodPage 2: today is goodPage 3: good weather is good

map

Worker 1: (the 1)

Worker 2: (is 1), (is 1), (is 1)

Worker 3: (weather 1), (weather 1)

Worker 4: (today 1)

Worker 5: (good 1), (good 1), (good 1), (good 1)

feed

Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)

reduce

CSci5221: Data Centers, Cloud Computing, …

Illustration

CSci5221: Data Centers, Cloud Computing, …

More Examples• Distributed Grep

– Map: emit line if matches given pattern P– Reduce: identity function, just copy result to output

• Count of URL Access Frequency– Map: parses URL access logs, outputs <URL, 1>– Reduce: adds together counts for the same unique URL

outputs <URL, totalCount>

• Reverse web-link graph: who links to this page?– Map: go through all source pages, generate all links

<target, source>– Reduce: for each target, concatenate all source links

<target, list (sources)>

• Many more examples, see paper [MapReduce, OSDI’04]

CSci5221: Data Centers, Cloud Computing, …

MapReduce Architecture

Single Master node

Many worker beesMany worker bees

CSci5221: Data Centers, Cloud Computing, …

MapReduce OperationInitial data splitinto 64MB blocks

Computed, resultslocally stored

Master informed ofresult locations

M sends datalocation to R workers

Final output written

CSci5221: Data Centers, Cloud Computing, …

What if Workers Die?• And you know they will…• Masters periodically ping workers

– Still alive and working? Good…

• If corpse found…– Allocate task to next idle worker (ruthless!)– If Map worker dies, need to recompute all its data, why?

• If corpse comes back to life… (zombies!)– Give it a task, and clean slate

• What if the Master dies?– Only 1 Master, he/she dies, the whole thing stops– Fairly rare occurrence

CSci5221: Data Centers, Cloud Computing, …

What if You Find Stragglers?• Some workers can be slower than

others– Faulty hardware– Software misconfiguration / bug– Whatever …

• Near completion of task– Master looks at stragglers and their tasks– Assigns “backup” workers to also compute these

tasks– Whoever finishes first wins!– Can now leave stragglers behind!

CSci5221: Data Centers, Cloud Computing, …

What if You Find Stragglers?• Some workers can be slower than

others– Faulty hardware– Software misconfiguration / bug– Whatever …

• Near completion of task– Master looks at stragglers and their tasks– Assigns “backup” workers to also compute these

tasks– Whoever finishes first wins!– Can now leave stragglers behind!

CSci5221: Data Centers, Cloud Computing, …

Optional Materials:Google BigTable

39

CSci5221: Data Centers, Cloud Computing, …

Google Bigtable• Distributed multi-level map

– With an interesting data model• Fault-tolerant, persistent• Scalable

– Thousands of servers– Terabytes of in-memory data– Petabyte of disk-based data– Millions of reads/writes per second, efficient scans

• Self-managing– Servers can be added/removed dynamically– Servers adjust to load imbalance

• Key points:– Data Model and Implementation Structure

• Tablets, SSTables, compactions, locality groups, …

– API and Details: shared logs, compression, replication, …

CSci5221: Data Centers, Cloud Computing, …

Basic Data Model• Distributed multi-dimensional sparse map

(row, column, timestamp) cell contents

• Good match for most of Google’s applications

……

“<html>…”

t1t2

t3www.cnn.com

ROWS

COLUMNS

TIMESTAMPS

“contents”

CSci5221: Data Centers, Cloud Computing, …

Rows• A row key is an arbitrary string

– Typically 10-100 bytes in size, up to 64 KB.– Every read or write of data under a single row is atomic

• Data is maintained in lexicographic order by row key

• The row range for a table is dynamically partitioned

• Each partition (row range) is named a tablet– Unit of distribution and load-balancing.

• Objective: make read operations single-sited!– E.g., In Webtable, pages in the same domain are grouped

together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.

CSci5221: Data Centers, Cloud Computing, …

Columns

• Columns have two-level name structure:– Family: optional_qualifier; e.g., Language:English

• Column family– A column family must be created before data can be

stored in a column key– Unit of access control– Has associated type information

• Qualifier gives unbounded columns– Additional level of indexing, if desired

“CNN homepage”

“anchor:cnnsi.com”

“…”cnn.com

“contents:” “anchor:stanford.edu”

“CNN”

CSci5221: Data Centers, Cloud Computing, …

Locality Groups• column families can be assigned to a locality group

– Used to organize underlying storage representation for performance

– data in a locality group can be mapped in memory, and stored in SSTable

– Avoid mingling data, e.g. page contents and page metadata

• Can compress locality groups• Bloom Filters on SSTables in a locality group

– avoid searching SSTable if bit not set

• Tablet movement– Major compaction (with concurrent updates)– Minor compaction (to catch up with updates) without any

concurrent updates– Load on new server without requiring any recovery action

CSci5221: Data Centers, Cloud Computing, …

Timestamps (64 bit integers)• Used to store different versions of data in a cell

– New writes default to current time, but timestamps for writes can also be set explicitly by clients

• Assigned by:– Bigtable: real-time in microseconds,– client application: when unique timestamps are a necessity.

• Items in a cell are stored in decreasing timestamp order– Application specifies how many versions (n) of data items are maintained in a

cell.

– Bigtable garbage collects obsolete versions• Lookup options:

– “Return most recent K values”– “Return all values in timestamp range (or all values)”

• Column families can be marked w/ attributes:– “Only retain most recent K values in a cell”– “Keep values until they are older than K seconds”

CSci5221: Data Centers, Cloud Computing, …

Tablets• Large tables broken into tablets at row

boundaries– Tablet holds contiguous range of rows

• Clients can often choose row keys to achieve locality

– Aim for ~100MB to 200MB of data per tablet• Serving machine responsible for ~100

tablets– Fast recovery:

• 100 machines each pick up 1 tablet from failed machine

– Fine-grained load balancing• Migrate tablets away from overloaded machine• Master makes load-balancing decisions

CSci5221: Data Centers, Cloud Computing, …

Tablets & Splitting

“<html>…”

aaa.com

TABLETS

“contents”

ENcnn.com

cnn.com/sports.html

“language”

Website.com

Zuppa.com/menu.html

CSci5221: Data Centers, Cloud Computing, …

Tablets & Splitting

“<html>…”

aaa.com

TABLETS

“contents”

ENcnn.com

cnn.com/sports.html

“language”

Website.com

Zuppa.com/menu.html

Yahoo.com/kids.html

Yahoo.com/kids.html?D

CSci5221: Data Centers, Cloud Computing, …

Table, Tablet and SSTable • Multiple tablets make up the table• SSTables can be shared• Tablets do not overlap, SSTables can overlap

SSTable SSTable SSTable SSTable

Tabletaardvark apple

Tabletapple_two_E boat

CSci5221: Data Centers, Cloud Computing, …

Tablet Representation

• SSTable: Immutable on-disk ordered map from stringstring• String keys: <row, column, timestamp> triples

Write buffer in memory(random-access) Append-only log on GFS

SSTable on GFS

SSTable on GFS

SSTable on GFS

(mmap)

Tablet

Write

Read

CSci5221: Data Centers, Cloud Computing, …

Tablet • Contains some range of rows of the table• Built out of multiple SSTables

Index

64K block

64K block

64K block

SSTable

Index

64K block

64K block

64K block

SSTable

Tablet Start:aardvark End:apple

CSci5221: Data Centers, Cloud Computing, …

SSTable• Immutable, sorted file of key-value pairs• Chunks of data plus an index

– Index is of block ranges, not values

• A SSTable is stored in GFS

Index

64K block

64K block

64K block

SSTable

CSci5221: Data Centers, Cloud Computing, …

Immutability• SSTables are immutable

– simplifies caching, sharing across GFS etc– no need for concurrency control– SSTables of a tablet recorded in METADATA table– Garbage collection of SSTables done by master– On tablet split, split tables can start off quickly

on shared SSTables, splitting them lazily

• Only memtable has reads and updates concurrent– copy on write rows, allow concurrent read/write

CSci5221: Data Centers, Cloud Computing, …

Overall Architecture

Tablet Server Tablet Server Tablet Server

GFS chunk server, CMS

client

GFS chunk server, CMS

client

GFS chunk server, CMS

client

Master Server

GFS master server, CMS

client

Chubby

Client

- Handle master election- Store the bootstrap location of Hbase data- Discover region server- Store access control lists

- Assigning tablets- Detecting the addition and expiration of tablet- Balancing tablet-server load- Handle schema changed

CMS server

- Scheduling jobs- Managing resources on the cluster- dealing with machine failures

CSci5221: Data Centers, Cloud Computing, …

Tablet AssignmentMaster keeps track of the set of live tablet servers the current assignment of tablets to region servers, including which tablets are unassigned.

Chubby

Tablet Server Master Server6) Check lock status

5) Assign tablets

Tablet servers

9) Reassign unassigned tablets

2) Create a lock

3) Acquire the lock4) Monitor

8) Acquire andDelete the lock

Cluster manager

1) Start a server

CSci5221: Data Centers, Cloud Computing, …

Placement of Tablets• A tablet is assigned to one tablet server at a time.• Master maintains:

– The set of live tablet servers,– Current assignment of tablets to tablet servers (including the

unassigned ones)

• Chubby maintains tablet servers:– A tablet server creates and acquires an eXclusive lock on a

uniquely named file in a specific chubby directory (named server directory),

– Master monitors server directory to discover tablet server,– A tablet server stops processing requests if it loses its X lock

(network partitioning).• Tablet server will try to obtain an X lock on its uniqely named

file as long as it exists.• If the uniquely named file of a tablet server no longer exists

then the tablet server kills itself. Goes back to a free pool to be assigned tablets by the master.

CSci5221: Data Centers, Cloud Computing, …

How Client Locates Tablets• 3-level hierarchical lookup scheme for tablets

– Location is ip:port of relevant server– 1st level: bootstrapped from lock server, points to owner of

META0– 2nd level: Uses META0 data to find owner of appropriate

META1 tablet– 3rd level: META1 table holds locations of tablets of all other

tables• META1 table itself can be split into multiple tablets

CSci5221: Data Centers, Cloud Computing, …

Editing a Table• Mutations are logged, then applied to an in-memory

memtable– May contain “deletion” entries to handle updates– Group commit on log: collect multiple updates before log flush

SSTable SSTable

Insert

InsertDelete

InsertDelete

Insert

tabl

et lo

g

GFS

Tablet

apple_two_E boat

Memtable

Memory

CSci5221: Data Centers, Cloud Computing, …

Client Write & Read Operations• Write operation arrives at a tablet server:

– Server ensures the client has sufficient privileges for the write operation (Chubby),

– A log record is generated to the commit log file,– Once the write commits, its contents are inserted into the memtable.– Once memtable reaches a threshold:

• memtable is frozen,• a new memtable is created,• frozen metable is converted to an SSTable and written to GFS

• Read operation arrives at a tablet server:– Server ensures client has sufficient privileges for the read operation (Chubby),– Read is performed on a merged view of (a) the SSTables that constitute the

tablet, and (b) the memtable.

CSci5221: Data Centers, Cloud Computing, …

DFSMemory

Compaction

memtable Read op

Write op

Tablet log

Frozenmemtable

V6.0

V4.0 V3.0 V2.0 V1.0

V5.0Create new memtable

Minor compactionMemtable -> a new

SSTable

SSTable files

Major compactionMemtable + all SSTables -> to one SSTable

Deleted data are removed storage can be re-used

Merging compactionMemtable + a few SSTables-> A new SSTable

Periodically done.Deleted data are still alive.

CSci5221: Data Centers, Cloud Computing, …

System Structure (Again)

Cluster Scheduling Master

handles failover, monitoring

GFS

holds tablet data, logs

Lock service

holds metadata,handles master-election

Bigtable tablet server

serves data

Bigtable tablet server

serves data

Bigtable tablet server

serves data

Bigtable master

performs metadata ops,load balancing

Bigtable cellBigtable client

Bigtable clientlibrary

Open()

CSci5221: Data Centers, Cloud Computing, …

Master’s Tasks• Use Chubby to monitor health of tablet servers,

restart failed servers• Tablet server registers itself by getting a lock in

a specific directory chubby– Chubby gives “lease” on lock, must be renewed

periodically– Server loses lock if it gets disconnected

• Master monitors this directory to find which servers exist/are alive– If server not contactable/has lost lock, master grabs

lock and reassigns tablets– GFS replicates data. Prefer to start tablet server on

same machine that the data is already at

CSci5221: Data Centers, Cloud Computing, …

Master’s Tasks …

When (new) master starts• grabs master lock on chubby

– ensures only one master at a time

• finds live servers (scan chubby directory)• communicates with servers to find assigned

tablets• scans metadata table to find all tablets

– Keeps track of unassigned tablets, assigns them– Metadata root from chubby, other metadata tablets

assigned before scanning.

CSci5221: Data Centers, Cloud Computing, …

Shared Logs• Designed for 1M tablets, 1000s of tablet

servers– 1M logs being simultaneously written performs badly

• Solution: shared logs– Write log file per tablet server instead of per tablet

• Updates for many tablets co-mingled in same file

– Start new log chunks every so often (64MB)

• Problem: during recovery, server needs to read log data to apply mutations for a tablet– Lots of wasted I/O if lots of machines need to read

data for many tablets from same log chunk

CSci5221: Data Centers, Cloud Computing, …

Shared Log RecoveryRecovery:• Servers inform master of log chunks

they need to read• Master aggregates and orchestrates

sorting of needed chunks– Assigns log chunks to be sorted to different

tablet servers– Servers sort chunks by tablet, writes sorted

data to local disk• Other tablet servers ask master which

servers have sorted chunks they need• Tablet servers issue direct RPCs to peer

tablet servers to read sorted data for its tablets

CSci5221: Data Centers, Cloud Computing, …

Application 1: Google Analytics

• Enables webmasters to analyze traffic pattern at their web sites. Statistics such as:– Number of unique visitors per day and the page views per

URL per day,– Percentage of users that made a purchase given that they

earlier viewed a specific page.

• How? – A small JavaScript program that the webmaster embeds

in their web pages.– Every time the page is visited, the program is executed.– Program records the following information about each

request:• User identifier• The page being fetched

CSci5221: Data Centers, Cloud Computing, …

Application 2: Personalized Search

• Records user queries and clicks across Google properties.

• Users browse their search histories and request for personalized search results based on their historical usage patterns.

• One Bigtable:– Row name is userid– A column family is reserved for each action type, e.g.,

web queries, clicks.– User profiles are generated using MapReduce.

• These profiles personalize live search results.– Replicated geographically to reduce latency and increase

availability.

CSci5221: Data Centers, Cloud Computing, …

Recap: Google Scale and Philosophy• Lots of data

– copies of the web, satellite data, user data, email and USENET, Subversion backing store

• Workloads are large and easily parallelizable• No commercial system big enough

– couldn’t afford it if there was one– might not have made appropriate design choices– But truckloads of low-cost machines

• 450,000 machines (NYTimes estimate, June 14th 2006)

• Failures are the norm– Even reliable systems fail at Google scale

• Software must tolerate failures– Which machine an application is running on should not

matter– Firm believers in the “end-to-end” argument

• Care about perf/$, not absolute machine perf