handout3o
TRANSCRIPT
INSE 6620 (Cloud Computing Security and Privacy)
Cloud Computing 101
Prof. Lingyu Wang
1
Enabling TechnologiesEnabling Technologies
Cloud computing relies on:
1. Hardware advancements2. Web x.0 technologies3 Vi t li ti3. Virtualization4. Distributed file system
2Ghemawat et al., The Google File System; Dean et al., MapReduce: Simplified Data Processing on Large Clusters; Chang et al., Bigtable: A Distributed Storage System for Structured Data
Google Server FarmsGoogle Server FarmsEarly days…
…today…today
3
How Does it Work?How Does it Work?
How are data stored?The Google File System (GFS)
How are data organized?The Bigtable
How are computations supported?M dMapreduce
4
Google File System (GFS) MotivationGoogle File System (GFS) Motivation
Need a scalable DFS forLarge distributed data-intensive applicationsPerformance, Reliability, Scalability and Availability
M th t diti l DFSMore than traditional DFSComponent failure is norm, not exception
built from inexpensive commodity componentsbuilt from inexpensive commodity components
Files are large (multi-GB)Workloads: Large streaming reads sequential writesg g qCo-design applications and file system APISustained bandwidth more critical than low latency
5
File StructureFile Structure
Files are divided into chunksFixed-size chunks (64MB)
Replicated over chunkservers, called replicas3 replicas by default
Unique 64-bit chunk handlesh k f lChunks as Linux files chunk
file
6
…blocks
ArchitectureArchitecture
metadata
data
Contact single masterObtain chunk locationsContact one of chunkservers
7
Contact one of chunkserversObtain data
Architecture - MasterArchitecture Master
Master stores three types of meta dataFile & chunk namespacesMapping from files to chunksLocation of chunk replicasLocation of chunk replicasStored in memory
HeartbeatsHeartbeatsHaving one master
Global knowledge allows better placement /Global knowledge allows better placement / replicationSimplifies design
8
Mutation OperationsMutation Operations
Primary replicaHolds lease assigned by masterHolds lease assigned by masterAssigns serial order for all mutation operations performed on replicas
Write operationWrite operation1-2: client obtains replica locations and identity of primary replica3: client pushes data to replicas 3 c e t pus es data to ep cas4: client issues update request to primary5: primary forwards/performs write requestrequest6: primary receives replies from replica7: primary replies to clientp y p
9
Fault Tolerance and DiagnosisFault Tolerance and Diagnosis
Fast RecoveryBoth master and chunkserver are designed toBoth master and chunkserver are designed to restart in seconds
Chunk replicationE h h k i li t d lti l h kEach chunk is replicated on multiple chunkservers on different racks
Master replicationpMaster’s state is replicatedMonitoring outside GFS may restart master process
Data integrityData integrityChecksumming to detect corruption of stored dataEach chunkserver independently verifies integrity
same data may look different on different chunk servers
10
ConclusionConclusion
Major InnovationsFile system API tailored to stylized workloadSingle-master design to simplify coordinationMetadata fit in memoryMetadata fit in memoryFlat namespace
11
MapReduce MotivationMapReduce Motivation
Recall “Cost associativity”: 1k servers*1hr=1server*1k hrs
Nice, but how?
How to run my task on 1k servers?Distributed computing, many things to worry aboutCustomized task, can’t use standard applications
MapRed ce a p og amming model/abst actionMapReduce: a programming model/abstraction that supports this while hiding messy details:
ParallelizationParallelizationData distributionFault-tolerance Load balancing
12
Map/ReduceMap/Reduce
Map/Reduce Inspired by LISP
(map square ‘(1 2 3 4))(map square ‘(1 2 3 4))(1 4 9 16)
(reduce + ‘(1 4 9 16))(+ 16 (+ 9 (+ 4 1) ) )30
(reduce + (map square (map – l1 l2))))
13
Programming ModelProgramming ModelInput & Output: each a set of key/value pairs Programmer specifies two functions:Programmer specifies two functions:
map (in_key, in_value) -> list(out_key, intermediate value)intermediate_value)
Processes input key/value pair to generate intermediate pairs
(transparently, the underlying system groups/sorts(transparently, the underlying system groups/sorts intermediate values based on out_keys)reduce (out_key, list(intermediate_value)) ->
list(out_value)( _ )Given all intermediate values for a particular key, produces a set of merged output values (usually just one)
Many real world problems can be representedMany real world problems can be represented using these two functions
14
Example: Count Word OccurrencesExample: Count Word Occurrences
Input consists of (url, contents) pairs
map(key=url, val=contents):For each word w in contents, emit (w, “1”)
ed ce(ke o d al es niq co nts)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word sum)”Emit result (word, sum)
15
Example: Count Word OccurrencesExample: Count Word Occurrences
map(key=url, val=contents):Fo each o d in contents emit ( “1”)For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”
see bob throw see 1 bob 1 see bob throwsee spot run bob 1
run 1 1
run 1see 2
t 1see 1spot 1throw 1
spot 1throw 1
throw 1 grouping/sorting
16
Example: Distributed GrepExample: Distributed Grep
Input consists of (url+offset, single line)
map(key=url+offset, val=line):If contents matches regexp, emit (line, “1”)
d (k l l )reduce(key=line, values=uniq_counts):Don’t do anything; just emit line
17
Reverse Web-Link GraphReverse Web Link Graph
MapFor each target URL found in page sourceEmit a <target, source> pair
R dReduceConcatenate a list of all source URLsOutputs: <target list (source)> pairsOutputs: <target, list (source)> pairs
18
Inverted IndexInverted Index
Map
Reduce
19
More ExamplesMore Examples
Distributed sortMap: extracts key from each record, emits a <key, record>Reduce: emits all pairs unchangedReduce: emits all pairs unchanged
Relies on underlying partitioning and ordering y g p g gfunctionalities
20
Widely Used at GoogleWidely Used at Google
Example uses:Example uses: distributed grep distributed sort web link-graph reversal term-vector / host web access log stats inverted index construction
i i l hi document clustering machine learning statistical machine translation
... ... ... 21
Usage in Aug 2004Usage in Aug 2004Number of jobs 29,423 Average job completion time 634 secsAverage job completion time 634 secsMachine days used 79,186 days
Input data read 3,288 TB d d d d 8Intermediate data produced 758 TB
Output data written 193 TB
Average worker machines per job 157Average worker machines per job 157Average worker deaths per job 1.2Average map tasks per job 3,351 Average reduce tasks per job 55Average reduce tasks per job 55
Unique map implementations 395 Unique reduce implementations 269 U i / d bi ti 426Unique map/reduce combinations 426
22
Implementation OverviewImplementation Overview
Typical cluster:100s-1000s of 2-CPU x86 machines, 2-4 GB of memory 100MBPS or 1GBPS but limited bisection bandwidth100MBPS or 1GBPS, but limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages datay gJob scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
Implementation is a C++ library linked into user programsuser programs
23
ParallelizationParallelization
How is task distributed?Partition input key/value pairs into equal-sized chunks of 16-64MB, run map() tasks in parallelAfter all map()s are complete consolidate allAfter all map()s are complete, consolidate all emitted values for each unique emitted keyNow partition space of output map keys, and run reduce() in parallel
Typical setting:2,000 machinesM = 200,000R 5 000R = 5,000
24
Execution Overview(0) mapreduce(spec, &result)
M input psplits of 16-64MB each
R regions
• Read all intermediate data• Sort it by intermediate keys
g
Partitioning functionhash(intermediate_key) mod R
25
Execution DetailsExecution Details
26
Task Granularity & PipeliningTask Granularity & Pipelining
Fine granularity tasks: map tasks >> himachines
Minimizes time for fault recoveryBetter dynamic load balancingBetter dynamic load balancing
Often use 200,000 map & 5000 reduce tasks Running on 2000 machinesRunning on 2000 machines
27
Fault ToleranceFault Tolerance
Worker failure handled via re-executionDetect failure via periodic heartbeatsRe-execute completed + in-progress map tasks
Due to inaccessible resultsDue to inaccessible results
Only re-execute in progress reduce tasksResults of completed tasks stored in global file system
Robust: lost 80 machines once finished ok
Master failure not handledRare in practiceAbort and re-run at client
28
Refinement: Redundant ExecutionRefinement: Redundant Execution
Problem: Slow workers may significantly delay l ti ti h l t d f t kcompletion time when close to end of tasks
Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowlyBad disks w/ soft errors transfer data slowly Weird things: processor caches disabled
Solution: Near end of phase, spawn backup tasks tas s
Whichever one finishes first "wins“ Dramatically shortens job completion time
29
Refinement: Locality OptimizationRefinement: Locality Optimization
Network bandwidth is a relatively scarce t itresource, so to save it:
Input data stored on local disks in GFSSchedule a map task on machine hosting a replicaSchedule a map task on machine hosting a replicaIf can’t, schedule it close to a replica (e.g., a host using the same switch)g )
EffectThousands of machines read input at local disk pspeed
Without this, rack switches limit read rate
30
Refinement: Combiner FunctionRefinement: Combiner Function
Purpose: reduce data sent over networkCombiner function: performs partial merging of intermediate data at the map worker
Typically, combiner function == reducer functionOnly difference is how to handle outputOnly difference is how to handle outputE.g. word count
31
PerformancePerformance
Tests run on cluster of 1800 machines: 4 GB of memory, dual-processor 2 GHz XeonsDual 160 GB IDE disks Gigabit Ethernet NIC bisection bandwidth 100 GbpsGigabit Ethernet NIC, bisection bandwidth 100 Gbps
Two benchmarks: Grep Scan 1010 100-byte records to extract recordsGrep Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records) M=15,000 (input split size about 64MB)R=1R=1
Sort Sort 1010 100-byte recordsM=15,000 (input split size about 64MB)R 4 000R=4,000
32
GrepGrep
Locality optimization helps: 1800 machines read 1 TB at peak ~31 GB/s W/out this, rack switches would limit to 10 GB/s
St t h d i i ifi t f h t j bStartup overhead is significant for short jobsTotal time about 150 seconds; 1 minute startup timetime
33
Sort 44% %Sort 44%
longer5% longer
34
ExperienceExperience
Rewrote Google's production indexing System i M R dusing MapReduceSet of 10, 14, 17, 21, 24 MapReduce operations New code is simpler easier to understandNew code is simpler, easier to understand
3800 lines C++ 700
Easier to understand and change indexing process g g p(from months to days)Easier to operate
M R d h dl f il l hiMapReduce handles failures, slow machines
Easy to improve performanceAdd more machinesAdd more machines
35
ConclusionConclusion
MapReduce proven to be useful abstraction Greatly simplifies large-scale computationsFun to use:
focus on problem, let library deal w/ messy details
36
Bigtable MotivationBigtable Motivation
Storage for (semi-)structured datae.g., Google Earth, Google Finance, Personalized Search
ScaleScaleLots of dataMillions of machinesMillions of machinesDifferent project/applicationsHundreds of millions of users
37
Why Not a DBMS?Why Not a DBMS?
Few DBMS’s support the requisite scaleRequired DB with wide scalability, wide applicability, high performance and high availability
Couldn’t afford it if there was oneCouldn t afford it if there was oneMost DBMSs require very expensive infrastructure
DBMSs provide more than Google needsDBMSs provide more than Google needsE.g., full transactions, SQL
Google has highly optimized lower-levelGoogle has highly optimized lower level systems that could be exploited
GFS, Chubby, MapReduce, Job scheduling, y, p , g
38
BigtableBigtable
“A BigTable is a sparse, distributed, persistent ltidi i l t d Th imultidimensional sorted map. The map is
indexed by a row key, a column key, and a timestamp; each value in the map is antimestamp; each value in the map is an uninterpreted array of bytes.”
39
Data ModelData Model
(row, column, timestamp) -> cell contentsRows
Arbitrary stringAccess to data in a row is atomicOrdered lexicographically
40
Data ModelData Model
ColumnTow-level name structure: Column families and columnsColumn Family is the unit of access controlColumn Family is the unit of access control
41
Data ModelData Model
TimestampsStore different versions of data in a cellLookup options
Return most recent K valuesReturn most recent K valuesReturn all values
42
Data ModelData Model
The row range for a table is dynamically titi d i t “t bl t ”partitioned into “tablets”
Tablet is the unit for distribution and load b l n ingbalancing
43
Building BlocksBuilding Blocks
Google File System (GFS)stores persistent data
Schedulerschedules jobs onto machines
ChubbyL k i di t ib t d l kLock service: distributed lock managere.g., master election, location bootstrapping
MapReduce (optional)MapReduce (optional)Data processingRead/write Bigtable dataRead/write Bigtable data
44
ImplementationImplementation
Single-master distributed systemThree major components
Library that linked into every clientOne master server
Assigning tablets to tablet serversAddition and expiration of tablet servers, balancing tablet-dd o a d e p a o o ab e se e s, ba a c g ab eserver loadMetadata Operations
Many tablet serversMany tablet serversTablet servers handle read and write requests to its tableSplits tablets that have grown too large
45
ImplementationImplementation
46
How to locate a Tablet?How to locate a Tablet?
Given a row, how do clients find the location of th t bl t h th t tthe tablet whose row range covers the target row?
47
Tablet AssignmentTablet Assignment
ChubbyTablet server registers itself by getting a lock in a specific directory chubby
Chubby gives “lease” on lock, must be renewed periodicallyChubby gives lease on lock, must be renewed periodicallyServer loses lock if it gets disconnected
Master monitors this directory to find which servers i t/ liexist/are aliveIf server not contactable/has lost lock, master grabs lock and reassigns tablets
48