handout3o

INSE 6620 (Cloud Computing Security and Privacy)

Cloud Computing 101

Prof. Lingyu Wang

1

Enabling TechnologiesEnabling Technologies

Cloud computing relies on:

1. Hardware advancements2. Web x.0 technologies3 Vi t li ti3. Virtualization4. Distributed file system

2Ghemawat et al., The Google File System; Dean et al., MapReduce: Simplified Data Processing on Large Clusters; Chang et al., Bigtable: A Distributed Storage System for Structured Data

Google Server FarmsGoogle Server FarmsEarly days…

…today…today

3

How Does it Work?How Does it Work?

How are data stored?The Google File System (GFS)

How are data organized?The Bigtable

How are computations supported?M dMapreduce

4

Google File System (GFS) MotivationGoogle File System (GFS) Motivation

Need a scalable DFS forLarge distributed data-intensive applicationsPerformance, Reliability, Scalability and Availability

M th t diti l DFSMore than traditional DFSComponent failure is norm, not exception

built from inexpensive commodity componentsbuilt from inexpensive commodity components

Files are large (multi-GB)Workloads: Large streaming reads sequential writesg g qCo-design applications and file system APISustained bandwidth more critical than low latency

5

File StructureFile Structure

Files are divided into chunksFixed-size chunks (64MB)

Replicated over chunkservers, called replicas3 replicas by default

Unique 64-bit chunk handlesh k f lChunks as Linux files chunk

file

6

…blocks

ArchitectureArchitecture

metadata

data

Contact single masterObtain chunk locationsContact one of chunkservers

7

Contact one of chunkserversObtain data

Architecture - MasterArchitecture Master

Master stores three types of meta dataFile & chunk namespacesMapping from files to chunksLocation of chunk replicasLocation of chunk replicasStored in memory

HeartbeatsHeartbeatsHaving one master

Global knowledge allows better placement /Global knowledge allows better placement / replicationSimplifies design

8

Mutation OperationsMutation Operations

Primary replicaHolds lease assigned by masterHolds lease assigned by masterAssigns serial order for all mutation operations performed on replicas

Write operationWrite operation1-2: client obtains replica locations and identity of primary replica3: client pushes data to replicas 3 c e t pus es data to ep cas4: client issues update request to primary5: primary forwards/performs write requestrequest6: primary receives replies from replica7: primary replies to clientp y p

9

Fault Tolerance and DiagnosisFault Tolerance and Diagnosis

Fast RecoveryBoth master and chunkserver are designed toBoth master and chunkserver are designed to restart in seconds

Chunk replicationE h h k i li t d lti l h kEach chunk is replicated on multiple chunkservers on different racks

Master replicationpMaster’s state is replicatedMonitoring outside GFS may restart master process

Data integrityData integrityChecksumming to detect corruption of stored dataEach chunkserver independently verifies integrity

same data may look different on different chunk servers

10

ConclusionConclusion

Major InnovationsFile system API tailored to stylized workloadSingle-master design to simplify coordinationMetadata fit in memoryMetadata fit in memoryFlat namespace

11

MapReduce MotivationMapReduce Motivation

Recall “Cost associativity”: 1k servers*1hr=1server*1k hrs

Nice, but how?

How to run my task on 1k servers?Distributed computing, many things to worry aboutCustomized task, can’t use standard applications

MapRed ce a p og amming model/abst actionMapReduce: a programming model/abstraction that supports this while hiding messy details:

ParallelizationParallelizationData distributionFault-tolerance Load balancing

12

Map/ReduceMap/Reduce

Map/Reduce Inspired by LISP

(map square ‘(1 2 3 4))(map square ‘(1 2 3 4))(1 4 9 16)

(reduce + ‘(1 4 9 16))(+ 16 (+ 9 (+ 4 1) ) )30

(reduce + (map square (map – l1 l2))))

13

Programming ModelProgramming ModelInput & Output: each a set of key/value pairs Programmer specifies two functions:Programmer specifies two functions:

map (in_key, in_value) -> list(out_key, intermediate value)intermediate_value)

Processes input key/value pair to generate intermediate pairs

(transparently, the underlying system groups/sorts(transparently, the underlying system groups/sorts intermediate values based on out_keys)reduce (out_key, list(intermediate_value)) ->

list(out_value)( _ )Given all intermediate values for a particular key, produces a set of merged output values (usually just one)

Many real world problems can be representedMany real world problems can be represented using these two functions

14

Example: Count Word OccurrencesExample: Count Word Occurrences

Input consists of (url, contents) pairs

map(key=url, val=contents):For each word w in contents, emit (w, “1”)

ed ce(ke o d al es niq co nts)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word sum)”Emit result (word, sum)

15

Example: Count Word OccurrencesExample: Count Word Occurrences

map(key=url, val=contents):Fo each o d in contents emit ( “1”)For each word w in contents, emit (w, “1”)

reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”

see bob throw see 1 bob 1 see bob throwsee spot run bob 1

run 1 1

run 1see 2

t 1see 1spot 1throw 1

spot 1throw 1

throw 1 grouping/sorting

16

Example: Distributed GrepExample: Distributed Grep

Input consists of (url+offset, single line)

map(key=url+offset, val=line):If contents matches regexp, emit (line, “1”)

d (k l l )reduce(key=line, values=uniq_counts):Don’t do anything; just emit line

17

Reverse Web-Link GraphReverse Web Link Graph

MapFor each target URL found in page sourceEmit a <target, source> pair

R dReduceConcatenate a list of all source URLsOutputs: <target list (source)> pairsOutputs: <target, list (source)> pairs

18

Inverted IndexInverted Index

Map

Reduce

19

More ExamplesMore Examples

Distributed sortMap: extracts key from each record, emits a <key, record>Reduce: emits all pairs unchangedReduce: emits all pairs unchanged

Relies on underlying partitioning and ordering y g p g gfunctionalities

20

Widely Used at GoogleWidely Used at Google

Example uses:Example uses: distributed grep distributed sort web link-graph reversal term-vector / host web access log stats inverted index construction

i i l hi document clustering machine learning statistical machine translation

... ... ... 21

Usage in Aug 2004Usage in Aug 2004Number of jobs 29,423 Average job completion time 634 secsAverage job completion time 634 secsMachine days used 79,186 days

Input data read 3,288 TB d d d d 8Intermediate data produced 758 TB

Output data written 193 TB

Average worker machines per job 157Average worker machines per job 157Average worker deaths per job 1.2Average map tasks per job 3,351 Average reduce tasks per job 55Average reduce tasks per job 55

Unique map implementations 395 Unique reduce implementations 269 U i / d bi ti 426Unique map/reduce combinations 426

22

Implementation OverviewImplementation Overview

Typical cluster:100s-1000s of 2-CPU x86 machines, 2-4 GB of memory 100MBPS or 1GBPS but limited bisection bandwidth100MBPS or 1GBPS, but limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages datay gJob scheduling system: jobs made up of tasks, scheduler assigns tasks to machines

Implementation is a C++ library linked into user programsuser programs

23

ParallelizationParallelization

How is task distributed?Partition input key/value pairs into equal-sized chunks of 16-64MB, run map() tasks in parallelAfter all map()s are complete consolidate allAfter all map()s are complete, consolidate all emitted values for each unique emitted keyNow partition space of output map keys, and run reduce() in parallel

Typical setting:2,000 machinesM = 200,000R 5 000R = 5,000

24

Execution Overview(0) mapreduce(spec, &result)

M input psplits of 16-64MB each

R regions

• Read all intermediate data• Sort it by intermediate keys

g

Partitioning functionhash(intermediate_key) mod R

25

Execution DetailsExecution Details

26

Task Granularity & PipeliningTask Granularity & Pipelining

Fine granularity tasks: map tasks >> himachines

Minimizes time for fault recoveryBetter dynamic load balancingBetter dynamic load balancing

Often use 200,000 map & 5000 reduce tasks Running on 2000 machinesRunning on 2000 machines

27

Fault ToleranceFault Tolerance

Worker failure handled via re-executionDetect failure via periodic heartbeatsRe-execute completed + in-progress map tasks

Due to inaccessible resultsDue to inaccessible results

Only re-execute in progress reduce tasksResults of completed tasks stored in global file system

Robust: lost 80 machines once finished ok

Master failure not handledRare in practiceAbort and re-run at client

28

Refinement: Redundant ExecutionRefinement: Redundant Execution

Problem: Slow workers may significantly delay l ti ti h l t d f t kcompletion time when close to end of tasks

Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowlyBad disks w/ soft errors transfer data slowly Weird things: processor caches disabled

Solution: Near end of phase, spawn backup tasks tas s

Whichever one finishes first "wins“ Dramatically shortens job completion time

29

Refinement: Locality OptimizationRefinement: Locality Optimization

Network bandwidth is a relatively scarce t itresource, so to save it:

Input data stored on local disks in GFSSchedule a map task on machine hosting a replicaSchedule a map task on machine hosting a replicaIf can’t, schedule it close to a replica (e.g., a host using the same switch)g )

EffectThousands of machines read input at local disk pspeed

Without this, rack switches limit read rate

30

Refinement: Combiner FunctionRefinement: Combiner Function

Purpose: reduce data sent over networkCombiner function: performs partial merging of intermediate data at the map worker

Typically, combiner function == reducer functionOnly difference is how to handle outputOnly difference is how to handle outputE.g. word count

31

PerformancePerformance

Tests run on cluster of 1800 machines: 4 GB of memory, dual-processor 2 GHz XeonsDual 160 GB IDE disks Gigabit Ethernet NIC bisection bandwidth 100 GbpsGigabit Ethernet NIC, bisection bandwidth 100 Gbps

Two benchmarks: Grep Scan 1010 100-byte records to extract recordsGrep Scan 1010 100-byte records to extract records

matching a rare pattern (92K matching records) M=15,000 (input split size about 64MB)R=1R=1

Sort Sort 1010 100-byte recordsM=15,000 (input split size about 64MB)R 4 000R=4,000

32

GrepGrep

Locality optimization helps: 1800 machines read 1 TB at peak ~31 GB/s W/out this, rack switches would limit to 10 GB/s

St t h d i i ifi t f h t j bStartup overhead is significant for short jobsTotal time about 150 seconds; 1 minute startup timetime

33

Sort 44% %Sort 44%

longer5% longer

34

ExperienceExperience

Rewrote Google's production indexing System i M R dusing MapReduceSet of 10, 14, 17, 21, 24 MapReduce operations New code is simpler easier to understandNew code is simpler, easier to understand

3800 lines C++ 700

Easier to understand and change indexing process g g p(from months to days)Easier to operate

M R d h dl f il l hiMapReduce handles failures, slow machines

Easy to improve performanceAdd more machinesAdd more machines

35

ConclusionConclusion

MapReduce proven to be useful abstraction Greatly simplifies large-scale computationsFun to use:

focus on problem, let library deal w/ messy details

36

Bigtable MotivationBigtable Motivation

Storage for (semi-)structured datae.g., Google Earth, Google Finance, Personalized Search

ScaleScaleLots of dataMillions of machinesMillions of machinesDifferent project/applicationsHundreds of millions of users

37

Why Not a DBMS?Why Not a DBMS?

Few DBMS’s support the requisite scaleRequired DB with wide scalability, wide applicability, high performance and high availability

Couldn’t afford it if there was oneCouldn t afford it if there was oneMost DBMSs require very expensive infrastructure

DBMSs provide more than Google needsDBMSs provide more than Google needsE.g., full transactions, SQL

Google has highly optimized lower-levelGoogle has highly optimized lower level systems that could be exploited

GFS, Chubby, MapReduce, Job scheduling, y, p , g

38

BigtableBigtable

“A BigTable is a sparse, distributed, persistent ltidi i l t d Th imultidimensional sorted map. The map is

indexed by a row key, a column key, and a timestamp; each value in the map is antimestamp; each value in the map is an uninterpreted array of bytes.”

39

Data ModelData Model

(row, column, timestamp) -> cell contentsRows

Arbitrary stringAccess to data in a row is atomicOrdered lexicographically

40


ColumnTow-level name structure: Column families and columnsColumn Family is the unit of access controlColumn Family is the unit of access control

41


TimestampsStore different versions of data in a cellLookup options

Return most recent K valuesReturn most recent K valuesReturn all values

42


The row range for a table is dynamically titi d i t “t bl t ”partitioned into “tablets”

Tablet is the unit for distribution and load b l n ingbalancing

43

Building BlocksBuilding Blocks

Google File System (GFS)stores persistent data

Schedulerschedules jobs onto machines

ChubbyL k i di t ib t d l kLock service: distributed lock managere.g., master election, location bootstrapping

MapReduce (optional)MapReduce (optional)Data processingRead/write Bigtable dataRead/write Bigtable data

44

ImplementationImplementation

Single-master distributed systemThree major components

Library that linked into every clientOne master server

Assigning tablets to tablet serversAddition and expiration of tablet servers, balancing tablet-dd o a d e p a o o ab e se e s, ba a c g ab eserver loadMetadata Operations

Many tablet serversMany tablet serversTablet servers handle read and write requests to its tableSplits tablets that have grown too large

45

ImplementationImplementation

46

How to locate a Tablet?How to locate a Tablet?

Given a row, how do clients find the location of th t bl t h th t tthe tablet whose row range covers the target row?

47

Tablet AssignmentTablet Assignment

ChubbyTablet server registers itself by getting a lock in a specific directory chubby

Chubby gives “lease” on lock, must be renewed periodicallyChubby gives lease on lock, must be renewed periodicallyServer loses lock if it gets disconnected

Master monitors this directory to find which servers i t/ liexist/are aliveIf server not contactable/has lost lock, master grabs lock and reassigns tablets

48

handout3o

Documents

distributed file system

structured data

linux files chunk file

google file system dean

corruption of stored

toboth master

chunk locations

simplified data processing